Estimating Large Language Models with KenLM

KenLM estimates unpruned language models with modified Kneser-Ney smoothing. The builder is disk-based: you specify the amount of RAM to use and it performs disk-based merge sort when necessary. It's faster than SRILM and IRSTLM and scales to much larger models as shown in the paper

Scalable Modified Kneser-Ney Language Model Estimation
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. ACL, Sofia, Bulgaria, 4—9 August, 2013.
[Paper] [BibTeX]

Usage

Command line options are documented by running with no argument:


bin/lmplz

The following arguments are particularly important:

-o: Required. Order of the language model to estimate.
-S: Recommended. Memory to use. This is a number followed by single-character suffix: % for percentage of physical memory (on platforms where this is measured), b for bytes, K for kilobytes, M for megabytes, and so on for G and T. If no suffix is given, kilobytes are assumed for compatability with GNU sort. The sort program is not used; the command line is simply designed to be compatible.
-T: Recommended. Temporary file location.

The corpus is provided on stdin and the ARPA is written to stdout:

bin/lmplz -o 5 -S 80% -T /tmp <text >text.arpa

Pruning

Models are not pruned by default. To enable pruning, use --prune and specify count thresholds for each order. The numbers must be non-decreasing and the last number will be extended to any higher order. For example, --prune 0 disables pruning (the default) while --prune 0 0 1 prunes singletons for orders three and higher. Currently, unigram pruning is not supported so the first number must always be zero. The pruning criterion differs from SRILM in that lmplz thresholds based on raw counts rather than adjusted counts.

Scalability

I built an unpruned 5-gram model with 500 billion unique n-grams on one machine. It was trained on 975 billion token of English derived by deduplicating the CommonCrawl text at the sentence level. Prior to deduplication, the corpus has over 1,800 billion tokens, making it comparable to Google's 2007 paper, in which they deemed it too expensive to estimate such a model. Of course, hardware has advanced significantly in the past seven years.

Comparison With Other Toolkits

IRSTLM can scale but, to do so, it approximates modified Kneser-Ney smoothing. BerkeleyLM's latest version does not implement interpolated modified Kneser-Ney smoothing, but rather implements absolute discounting (i.e. every discount is 0.75) without support for interpolation or modified discounting. SRILM uses memory to the point that building large language models is infeasible. These two commands should build the same language model


lmplz -o 5 --interpolate_unigrams 0 <text >text.arpa

ngram-count -order 5 -interpolate -kndiscount -unk -gt3min 1 -gt4min 1 -gt5min 1 -text text -lm text.arpa

Note that SRILM's command line option -minprune does not impact count pruning. There are some details to note:

KenLM is more numerically precise. Every floating-point value is calculated from exact integers using O(order) floating-point operations. Other toolkits do O(order * vocabulary) sums, losing precision and sometimes resorting to hard-coded values.
SRILM that disables unigram interpolation for language models of non-trivial size, giving the leftover mass to p(<unk>). The --interpolate_unigrams 0 option emulates SRILM's behavior but gives large <unk> probability so it should probably not be used.
If the corpus contains a blank line, KenLM includes <s> </s> as a bigram. SRILM skips blank lines.

Google

A Google paper explained modified Kneser-Ney estimation using MapReduce. The computation here is somewhat similar. There are a few differences from MapReduce:

Multiple simultaneous streams. There is one for each n-gram order, eliminating the need to store record size information, making code much more natural, allowing in-place edits in interpolation, and making auxiliary files like backoffs easy to handle as another stream.
Chains of MapReduces are wasteful. In the second step or later, the map can be eliminated. The computation could have been done in the reducer, checkpointed, and sent directly to the next reducer.
Working locally has computational advantages.
My target audience might not have a cluster.

Future Work/Known Issues

Sharding: split the data where possible.
Directly build KenLM data structures. Currently, you can do bin/lmplz -o 5 <text |bin/build_binary /dev/stdin text.binary but it would be more efficient to pass binary files directly.
Make build_binary recognize these ARPA files (via a comment at the top) and trigger a faster path that skips workarounds for other toolkits.
Allow corpora that contain explicit <unk> tokens. Sorry UMD.
The progress bars assume terminals are at least 100 characters wide. Some steps do not have a progress bar.

OOVs

There are two ways to determine p(<unk>):

In the literature, <unk> is a unigram with count zero. All unigrams, including <unk>, are interpolated with the uniform distribution. The only mass for p(<unk>) comes from interpolation: p(<unk>) = interpolation / |vocabulary|. This method is the default in newer versions of KenLM. The --interpolate_unigrams tells older versions to use this method and is ignored by newer versions (since it is already the default).
SRILM does something different. For models of non-toy size, it gives all the unigram interpolation mass to the unknown word. This results in a larger OOV probability, effectively multiplying it by vocabulary size when compared to the literature. To use the SRILM method, use the command line --interpolate_unigrams 0. It is also the default in older versions.

Unsure which version of lmplz you have? Run with --help and look at the section for --interpolate_unigrams. Newer versions say =1, indicating the default is as described in the literature rather than SRILM's method. The literature will always give <unk> lower probability than other words, which is usually what you want; SRILM often gives higher probability to <unk> than to words it has seen. Note that a perplexity comparison is unfair because a model no longer sums to one if there are two or more unknown words in the corpus. Moreover, OOV probability is task-dependent (i.e. passthroughs in MT are often OOVs and have different distributional properties). I recommend using two features per language model: log probability and OOV count. Tuning feature weights has the effect of tuning p(<unk>), rendering the value in the language model moot. When using an OOV feature, be sure to use a regularized tuning method like MIRA or PRO. MERT gets lost because it is relatively rare for n-best entries to have different numbers of OOVs.

Class-based models

Use --discount_fallback. Kneser-Ney smoothing discounts are estimated from counts of counts, including the number of singletons. Class-based language models often have no singleton unigrams, making the discounts undefined. For orders where this is a problem, --discount_fallback substitutes user-provided discounts. The defaults are 0.5 for singletons, 1.0 for doubletons, and 1.5 for 3 and above. Often the singleton and doubleton discounts do not matter because there are none.

Corpus Formatting Notes

Words are delimited by any number of '\0', '\t' '\r', and ' '. UNIX newline ('\n') delimits lines (but note that DOS files will work because '\r' will be treated as a word delimiter and ignored at the end of a line). I generally recommend UTF-8 encoding; for other encodings, consider whether processing the delimiters at the byte level will cause problems. In particular, UTF-16 produces null bytes and is not supported.

As with all language model toolkits, tokenization and preprocessing should be done beforehand. Windows users should take care to remove the byte order marker from text files.

The symbols <s>, </s>, and <unk> are not allowed and will trigger an exception. They will be added internally. If your training data might have stray symbols and you want them to be treated as whitespace, pass --skip_symbols.

If the relevant compression library was installed at build time, you may compress the corpus with gzip, bzip2, or xz. If support was not compiled in and a compressed format is detected, an exception will be thrown. Compressed formats are detected by magic bytes; the file name does not matter.

Acknowledgements

MT Marathon 2012 team members Ivan Pouzyrevsky and Mohammed Mediani contributed to the computation design and early implementation. Jon Clark contributed the name lmplz, contributed to the design, clarified points about smoothing, and added logging. Marcin Junczys-Dowmunt contributed pruning.