KenLM Language Model Toolkit

benchmark | dependencies | developers | estimation | filter | moses | structures
Ken Models with Computer Engineer Barbie
KenLM estimates, filters, and queries language models. Estimation is fast and scalable due to streaming algorithms explained in the paper
Scalable Modified Kneser-Ney Language Model Estimation
Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H. Clark, and Philipp Koehn. ACL, Sofia, Bulgaria, 4—9 August, 2013.
[Paper] [Slides] [BibTeX]
Querying is fast and low-memory, as shown in the paper
KenLM: Faster and Smaller Language Model Queries
Kenneth Heafield. WMT at EMNLP, Edinburgh, Scotland, United Kingdom, 30—31 July, 2011.
[Paper] [Slides] [BibTeX]

Usage

Moses, cdec, Joshua, Jane, and Phrasal already distribute KenLM and build it along with the decoder. See their documentation on where to find the programs. Estimation and filtering require Boost at least 1.36.0 and zlib. I also recommend tcmalloc from gperftools, bzlib (to read .bz2 files), and xz-utils (to read .xz files). If your distribution has "devel" packages, install those. For more help, see dependencies. Then compile
wget -O - http://kheafield.com/code/kenlm.tar.gz |tar xz
cd kenlm
./bjam -j4
Run ./bjam --help to see build options. If the dependencies are too difficult and you only need querying, use ./compile_query_only.sh that depends only on g++ and bash. Programs will be located in bin/.

Estimating

Language models are estimated from text using modified Kneser-Ney smoothing without pruning. It is done on disk, enabling one to build much larger models.
bin/lmplz -o 5 <text >text.arpa
See the page on estimation for more.

Querying

The binary file format makes loading faster. Run
bin/build_binary text.arpa text.binary
and pass text.binary instead of text.arpa. See data structures for more on selecting data structures. Once your binary file is built, query it:
Moses
In newer versions use e.g. KENLM factor=0 order=5 path=filename.arpa. In older versions or legacy scripts, use language model number 8.
cdec
KenLM is the only supported language model.
Joshua
The lm line in joshua.config should begin with lm = kenlm.
Phrasal
Put kenlm: before the file name.
Kriya
Default. Support for SRILM requires editing source code.
HiFST
Default. Also includes an OpenFST wrapper for KenLM.
Command line
bin/query text.binary <data
Python
cat python/example.py and see the README.
Your code
Download the source code and read the developer documentation.

Features

"The biggest improvement for the language industry has been the addition of the new language model KenLM which is fast, memory-efficient, and above all, allows the use of multi-core processors under the open source license." --Achim Ruopp, TAUS

Supported Platforms

Best on Linux. Also supports Mac OS X, Cygwin, and Windows. Tested on x86, x86_64, ppc64, and ARM. The ARM port was contributed by NICT.

Windows Users

I do not actively maintain the Visual Studio build files or test on Windows. A version that works on Windows is tagged on github. See the windows directory for Visual Studio project files based on a contribution by Cong Duy Vu Hoang. Compile the kenlm project before the build_binary and ngram_query projects, preferably in x64 release mode.

Cygwin works too. However, please note that Cygwin is 32-bit even on 64-bit Windows, so you should not expect Cygwin to work with model sizes over 2 GB.

License

My code is LGPL but there are files from other sources too. See the LICENSE file for details.

Confusion

Not to be confused with KLM or The CMU-Cambridge Statistical Language Modeling toolkit. Hieu Hoang gave the name kenlm.

This implementation was mentioned in my January 2010 MT Marathon paper when early source code was publicly available. Integration into Moses was publicly announced on 18 October 2010. These precede both the 17 December 2010 submission deadline for the BerkeleyLM paper and their 20 June 2011 public release. Tests performed by Adam Pauls in May 2011 showed that KenLM is 4.49x faster. He omitted KenLM from his paper and his 20 June 2011 talk, claiming SRILM is the fastest package. After his talk, an error was discovered in the 4.49x number he reported, but corrected results still show KenLM is faster; see the benchmarks.