KenLM: Language Model Inference

benchmark | developers | moses | structures
Computer Engineer BarbieKen models with Computer Engineer Barbie
KenLM is a library that loads language model files and returns probabilities. It is fast and small, as shown in the paper
Heafield, 2011. KenLM: Faster and Smaller Language Model Queries. Proceedings of the EMNLP 2011 Sixth Workshop on Statistical Machine Translation. Edinburgh, UK, July.

Usage

Code is distributed with decoders. Joshua and cdec use KenLM by default (and have removed support for SRILM). In Moses, change the first field to 8 or 9. To use KenLM as a library, download the source code and read the developer documentation.

The binary file format makes loading faster. Run kenlm/build_binary foo.arpa foo.binary and pass foo.binary instead.

Features

"The biggest improvement for the language industry has been the addition of the new language model KenLM which is fast, memory-efficient, and above all, allows the use of multi-core processors under the open source license." --Achim Ruopp, TAUS

Supported Platforms

Tested on Linux, Mac OS X, and Cygwin. There is no native Windows support; users are welcome to contribute. The processor must support unaligned uint64_t reads and writes. This includes x86, x86_64, and ppc64 but not ARM. ia64 might work but will likely be slow.

License

My code is LGPL. util/string_piece.hh and util/string_piece.cc come from Google under a license found in its comments. util/murmur_hash.cc says "All code is released to the public domain. For business purposes, Murmurhash is under the MIT license."

Confusion

Not to be confused with KLM or The CMU-Cambridge Statistical Language Modeling toolkit. Hieu Hoang gave the name kenlm.

KenLM was publicly announced and distributed with Moses on 18 October 2010. This precedes both the 17 December 2010 submission deadline for the BerkeleyLM paper and their 20 June 2011 public release. Tests performed by Adam Pauls in May 2011 showed that KenLM is 4.49x faster. He omitted KenLM from his paper and his 20 June 2011 talk, claiming SRILM is the fastest package.