Estimating Large Language Models with KenLMUp to the main pagemodified Kneser-Ney smoothing. The builder is disk-based: you specify the amount of RAM to use and it performs disk-based merge sort when necessary. It's faster than SRILM and IRSTLM and scales to much larger models as shown in the paper
Scalable Modified Kneser-Ney Language Model Estimation
, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. ACL, Sofia, Bulgaria, 4—9 August, 2013.
UsageCommand line options are documented by running with no argument:
- Required. Order of the language model to estimate.
- Recommended. Memory to use. This is a number followed by single-character suffix: % for percentage of physical memory (on platforms where this is measured), b for bytes, K for kilobytes, M for megabytes, and so on for G and T. If no suffix is given, kilobytes are assumed for compatability with GNU
sort. The sort program is not used; the command line is simply designed to be compatible.
- Recommended. Temporary file location.
bin/lmplz -o 5 -S 80% -T /tmp <text >text.arpa
PruningModels are not pruned by default. To enable pruning, use
--pruneand specify count thresholds for each order. The numbers must be non-decreasing and the last number will be extended to any higher order. For example,
--prune 0disables pruning (the default) while
--prune 0 0 1prunes singletons for orders three and higher. Currently, unigram pruning is not supported so the first number must always be zero. The pruning criterion differs from SRILM in that lmplz thresholds based on raw counts rather than adjusted counts.
ScalabilityI built an unpruned 5-gram model with 500 billion unique n-grams on one machine. It was trained on 975 billion token of English derived by deduplicating the CommonCrawl text at the sentence level. Prior to deduplication, the corpus has over 1,800 billion tokens, making it comparable to Google's 2007 paper, in which they deemed it too expensive to estimate such a model. Of course, hardware has advanced significantly in the past seven years.
Comparison With Other ToolkitsIRSTLM can scale but, to do so, it approximates modified Kneser-Ney smoothing. BerkeleyLM's latest version does not implement interpolated modified Kneser-Ney smoothing, but rather implements absolute discounting (i.e. every discount is 0.75) without support for interpolation or modified discounting. SRILM uses memory to the point that building large language models is infeasible. These two commands should build the same language model
lmplz -o 5 --interpolate_unigrams 0 <text >text.arpa
ngram-count -order 5 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1 -gt5min 1 -text text -lm text.arpa
-minprunedoes not impact count pruning. There are some details to note:
- KenLM is more numerically precise. Every floating-point value is calculated from exact integers using O(order) floating-point operations. Other toolkits do O(order * vocabulary) sums, losing precision and sometimes resorting to hard-coded values.
- SRILM that disables unigram interpolation for language models of non-trivial size, giving the leftover mass to p(<unk>). The
--interpolate_unigrams 0option emulates SRILM's behavior but gives large <unk> probability so it should probably not be used.
- If the corpus contains a blank line, KenLM includes <s> </s> as a bigram. SRILM skips blank lines.
- Multiple simultaneous streams. There is one for each n-gram order, eliminating the need to store record size information, making code much more natural, allowing in-place edits in interpolation, and making auxiliary files like backoffs easy to handle as another stream.
- Chains of MapReduces are wasteful. In the second step or later, the map can be eliminated. The computation could have been done in the reducer, checkpointed, and sent directly to the next reducer.
- Working locally has computational advantages.
- My target audience might not have a cluster.
Future Work/Known Issues
- Sharding: split the data where possible.
- Directly build KenLM data structures. Currently, you can do
bin/lmplz -o 5 <text |bin/build_binary /dev/stdin text.binarybut it would be more efficient to pass binary files directly.
build_binaryrecognize these ARPA files (via a comment at the top) and trigger a faster path that skips workarounds for other toolkits.
- Allow corpora that contain explicit <unk> tokens. Sorry UMD.
- The progress bars assume terminals are at least 100 characters wide. Some steps do not have a progress bar.
OOVsThere are two ways to determine p(<unk>):
- In the literature, <unk> is a unigram with count zero. All unigrams, including <unk>, are interpolated with the uniform distribution. The only mass for p(<unk>) comes from interpolation: p(<unk>) = interpolation / |vocabulary|. This method is the default in newer versions of KenLM. The
--interpolate_unigramstells older versions to use this method and is ignored by newer versions (since it is already the default).
- SRILM does something different. For models of non-toy size, it gives all the unigram interpolation mass to the unknown word. This results in a larger OOV probability, effectively multiplying it by vocabulary size when compared to the literature. To use the SRILM method, use the command line
--interpolate_unigrams 0. It is also the default in older versions.
--helpand look at the section for
--interpolate_unigrams. Newer versions say
=1, indicating the default is as described in the literature rather than SRILM's method. The literature will always give <unk> lower probability than other words, which is usually what you want; SRILM often gives higher probability to <unk> than to words it has seen. Note that a perplexity comparison is unfair because a model no longer sums to one if there are two or more unknown words in the corpus. Moreover, OOV probability is task-dependent (i.e. passthroughs in MT are often OOVs and have different distributional properties). I recommend using two features per language model: log probability and OOV count. Tuning feature weights has the effect of tuning p(<unk>), rendering the value in the language model moot. When using an OOV feature, be sure to use a regularized tuning method like MIRA or PRO. MERT gets lost because it is relatively rare for n-best entries to have different numbers of OOVs.
--discount_fallback. Kneser-Ney smoothing discounts are estimated from counts of counts, including the number of singletons. Class-based language models often have no singleton unigrams, making the discounts undefined. For orders where this is a problem,
--discount_fallbacksubstitutes user-provided discounts. The defaults are 0.5 for singletons, 1.0 for doubletons, and 1.5 for 3 and above. Often the singleton and doubleton discounts do not matter because there are none.
Corpus Formatting Notes
Words are delimited by any number of '\0', '\t' '\r', and ' '. UNIX newline ('\n') delimits lines (but note that DOS files will work because '\r' will be treated as a word delimiter and ignored at the end of a line). I generally recommend UTF-8 encoding; for other encodings, consider whether processing the delimiters at the byte level will cause problems. In particular, UTF-16 produces null bytes and is not supported.
As with all language model toolkits, tokenization and preprocessing should be done beforehand. Windows users should take care to remove the byte order marker from text files.
The symbols <s>, </s>, and <unk> are not allowed and will trigger an exception. They will be added internally. If your training data might have stray symbols and you want them to be treated as whitespace, pass
If the relevant compression library was installed at build time, you may compress the corpus with gzip, bzip2, or xz. If support was not compiled in and a compressed format is detected, an exception will be thrown. Compressed formats are detected by magic bytes; the file name does not matter.