KenLM Data StructuresUp to the main page
Binary fileUsing the binary format significantly reduces loading time. It also exposes more configuration options. The
build_binaryprogram converts ARPA files to binary files:
bin/build_binary foo.arpa foo.binary
bin/build_binary trie foo.arpa foo.binary
Probing is the fastest and default data structure. Unigram lookups happen by array index. Bigrams and longer n-grams are hashed to 64-bit integers which have very low probability of collision, even with the birthday attack. This 64-bit hash is the key to a probing hash table where values are probability and backoff.A linear probing hash table is an array consisting of blanks (zeros) and entries with non-zero keys. Lookup proceeds by hashing the key modulo the array size, starting at this point in the array, and scanning forward until the entry or a blank is found. The ratio of array size to number of entries is controlled by the probing multiplier parameter p. This is a time-space tradeoff: space is linear in p and time is O(p/(p-1)). The value of p can be set at binary building time with -p. The default is 1.5.
build_binary -p 1.5 probing foo.arpa foo.binary
The trie data structure uses the least memory, has the best memory locality, and is decently fast. However, it does take longer to build. It works in much the same way as SRI and IRST's inverted option, but with better memory and time performance. Like probing and sorted, unigram lookup is an array index. Records in the trie have a word index, probability, backoff, and pointer. All of the records for n-grams of the same order are stored consecutively in memory. An n-gram's pointer is actually the index into the (n+1)-gram array where block of (n+1)-grams with one more word of history starts. The end of this block is found by reading the next entry's pointer. Records within the block are sorted by word index. Because the vocabulary ids are randomly permuted, a uniform key distribution applies. Each block therefore constitutes a sorted map from word index to probability, backoff, and pointer. The trie is compacted by using the minimum number of bits to store each integer. Since the trie stores many vocabulary ids and uses the minimum number of bits to do so, vocabulary filtering is highly effective for reducing overall model size even if less n-grams of higher order are removed.
The sorted uniform map holds sorted keys-value pairs where keys are evenly distributed. In KenLM, keys are 64-bit hashes of words or vocabulary IDs assigned in random order. This means a simple form of interpolation search applies. Search starts by reading the first and last keys. Position of the desired key is estimated proportionally. If the desired key is found exactly, the value is returned. Otherwise, the estimated position becomes the new first or last element and the algorithm continues recursively. This known-distribution search is O(log log n) in the average case, O(n) in the worst case, and more local than ordinary binary search, thereby minimizing page faults.
Building the trie entails an on-disk sort. You can optimize this by passing -S to specify sorting memory like GNU sort (but final model building will still use the amount of memory needed to store the model). The -T option lets you customize where to place temporary files (the default is based on the output file name):
bin/build_binary -T /tmp/trie -S 1G trie foo.arpa foo.binary
QuantizationThe trie supports quantized probability and backoff storage with any number of bits from 1 to 25. Quantization bins are computed separately for each order and unigram probabilities are not quantized. To activate quantization, use -q with build_binary. You can also set the number of bits independently using e.g. -q 8 -b 7 to store 8 bits of probability and 7 bits of backoff. KenLM uses the binning method to determine quantization centers.
build_binary trie -q 8 -b 7 foo.arpa foo.binary
Pointer CompressionPointers in trie are a sorted array of integers. We can compress these using a simple version of the technique from Raj and Whittaker. The leading bits of the pointers are removed and stored implicitly using a table of offsets into the array. Retrieving these values has a time cost, so this compression is a time-space tradeoff. The maximum number of bits to chop is set by -a; KenLM will minimize memory subject to this upper bound. This compression is non-lossy and optionally stackable with quantization.
build_binary trie -a 64 foo.arpa foo.binary