KenLM: Language Model Inference
KenLM is a library that loads language model files and returns probabilities. It is fast and small, as shown in the paperHeafield, 2011. KenLM: Faster and Smaller Language Model Queries. Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation. Edinburgh, UK, July.
Usage
Code is distributed with decoders. Joshua and cdec use KenLM by default (and have removed support for SRILM). In Moses, change the first field to 8 or 9. To use KenLM as a library, download the source code and read the developer documentation.The binary file format makes loading faster. Run
kenlm/build_binary foo.arpa foo.binary
and pass foo.binary instead.
Features
"The biggest improvement for the language industry has been the addition of the new language model KenLM which is fast, memory-efficient, and above all, allows the use of multi-core processors under the open source license." --Achim Ruopp, TAUS
- Faster and lower memory than SRILM and IRSTLM.
- Two data structures for time-space tradeoff.
- Binary format with mmap. Or load an ARPA directly, including gzipped ARPA files.
- Threadsafe.
- More opportunities for hypothesis recombination. If the model backs off, State stores only the matched words. The FullScore function also returns the length of n-gram matched by the model.
- Few dependencies. C++ compiler and POSIX system calls. Optional ICU, Boost, and zlib functionality.
- Simple build process: compile.sh runs g++.
- Supports models of any order greater than one (recompilation required for orders >= 7).
- Thorough error handling. For example, ARPA parse errors include a message, the problematic string, the byte offset, and the file name. Compare with IRSTLM.
- Loading progress bar.
- Tests. These depend on Boost.
- Supports n-grams containing <unk> tokens; these appear in models built with restricted vocabulary.
- Permissive license (see below) means you can distribute it unlike SRILM. There isn't a form to fill out before you can download.
Supported Platforms
Tested on Linux, Mac OS X, and Cygwin. There is no native Windows support; users are welcome to contribute. The processor must support unaligned uint64_t reads and writes. This includes x86, x86_64, and ppc64 but not ARM. ia64 might work but will likely be slow.License
My code is LGPL. util/string_piece.hh and util/string_piece.cc come from Google under a license found in its comments. util/murmur_hash.cc says "All code is released to the public domain. For business purposes, Murmurhash is under the MIT license."Confusion
Not to be confused with KLM or The CMU-Cambridge Statistical Language Modeling toolkit. Hieu Hoang gave the name kenlm.
KenLM was publicly announced and distributed with Moses on 18 October 2010. This precedes both the 17 December 2010 submission deadline for the BerkeleyLM paper and their 20 June 2011 public release. Tests performed by Adam Pauls in May 2011 showed that KenLM is 4.49x faster. He omitted KenLM from his paper and his 20 June 2011 talk, claiming SRILM is the fastest package.
