Language Model Filter

This program filters language models to test sets. By filtering we mean that n-grams are removed if they cannot be generated during decoding; the process is not lossy. In our experiments, the filter reduces model size by 92% for system combination and 36% for machine translation. For usage, run

bin/filter

Features

Two types of constraints are implemented:

Vocabulary: Check that all words in the n-gram appear in target vocabulary.
Phrase: Determine if the n-gram can be assembled from phrases, including phrases crossing n-gram boundaries.

Orthogonally, the filter handles multiple sentences in parallel with three modes:

Single: Treat the entire input as one sentence. This is the mode most commonly seen in other filters and the least effective.
Multiple
: Output a separately filtered model for each sentence. This mode produces the smallest models, but there are many models so total output is larger.
Union
: Generate one model that is the union of individually filtered models. This constraint is stronger than with Single mode because all words or phrases supporting an n-gram must appear in the same sentence. This mode minimizes loading time.

The filter is fast and multithreaded. Typically disk is the bottleneck; union mode takes 10 minutes for a 19 GB ARPA file and 2525 sentences.