KenLM Language Model Toolkitestimates, filters, and queries language models. Estimation is fast and scalable due to streaming algorithms explained in the paper
Scalable Modified Kneser-Ney Language Model EstimationQuerying is fast and low-memory, as shown in the paper
, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. ACL, Sofia, Bulgaria, 4—9 August, 2013.
KenLM: Faster and Smaller Language Model Queries
. WMT at EMNLP, Edinburgh, Scotland, United Kingdom, 30—31 July, 2011.
[Paper] [Slides] [BibTeX]
UsageMoses, cdec, and Joshua already distribute KenLM and build it along with the decoder. See their documentation on where to find the programs. Estimation and filtering require Boost at least 1.36.0 and zlib. I also recommend tcmalloc from gperftools, bzlib (to read .bz2 files), and xz-utils (to read .xz files). If your distribution has "devel" packages, install those. For more help, see dependencies. Then compile
wget -O - http://kheafield.com/code/kenlm.tar.gz |tar xz
./bjam --helpto see build options. If the dependencies are too difficult and you only need querying, use
./compile_query_only.shthat depends only on g++ and bash. Programs will be located in
EstimatingLanguage models are estimated from text using modified Kneser-Ney smoothing without pruning. It is done on disk, enabling one to build much larger models.
bin/lmplz -o 5 <text >text.arpa
QueryingThe binary file format makes loading faster. Run
bin/build_binary text.arpa text.binary
text.arpa. See data structures for more on selecting data structures. Once your binary file is built, query it:
- Change the first field to 8 or 9.
- KenLM is the only supported language model.
- The lm line in joshua.config should begin with "lm = kenlm".
- Command line
bin/query text.binary <data
cat python/example.pyand see the README.
- Your code
- Download the source code and read the developer documentation.
"The biggest improvement for the language industry has been the addition of the new language model KenLM which is fast, memory-efficient, and above all, allows the use of multi-core processors under the open source license." --Achim Ruopp, TAUS
- Faster and lower memory than SRILM and IRSTLM.
- On-disk estimation with user-specified RAM.
- Two data structures for time-space tradeoff.
- Binary format with mmap. Or load ARPA files directly.
- If you have the appropriate libraries installed, it can also read text and ARPA files compressed with gzip, bzip2, or xz.
- More opportunities for hypothesis recombination. If the model backs off, State stores only the matched words. The FullScore function also returns the length of n-gram matched by the model.
- Querying has few dependencies: a C++ compiler and POSIX system calls. Filtering and estimation are multi-threaded, so they depend on Boost.
- Supports models of any order greater than one (recompilation required for orders >= 7).
- Thorough error handling. For example, ARPA parse errors include a message, the problematic string, the byte offset, and the file name. Compare with IRSTLM.
- Loading progress bar.
- Tests. These depend on Boost.
- Querying supports n-grams containing <unk> tokens; these appear in models built with restricted vocabulary.
- Permissive license means you can distribute it unlike SRILM. There isn't a form to fill out before you can download.
Supported PlatformsBest on Linux. Also supports Mac OS X, Cygwin, and Windows. Tested on x86, x86_64, ppc64, and ARM. The ARM port was contributed by NICT.
I do not actively maintain the Visual Studio build files or test on Windows. A version that works on Windows is tagged on github. See the windows directory for Visual Studio project files based on a contribution by Cong Duy Vu Hoang. Compile the kenlm project before the build_binary and ngram_query projects, preferably in x64 release mode.
Cygwin works too. However, please note that Cygwin is 32-bit even on 64-bit Windows, so you should not expect Cygwin to work with model sizes over 2 GB.
LicenseMy code is LGPL but there are files from other sources too. See the LICENSE file for details.
Not to be confused with KLM or The CMU-Cambridge Statistical Language Modeling toolkit. Hieu Hoang gave the name kenlm.
This implementation was mentioned in my January 2010 MT Marathon paper when early source code was publicly available. Integration into Moses was publicly announced on 18 October 2010. These precede both the 17 December 2010 submission deadline for the BerkeleyLM paper and their 20 June 2011 public release. Tests performed by Adam Pauls in May 2011 showed that KenLM is 4.49x faster. He omitted KenLM from his paper and his 20 June 2011 talk, claiming SRILM is the fastest package. After his talk, an error was discovered in the 4.49x number he reported, but corrected results still show KenLM is faster; see the benchmarks.