Language Model Filter

Up to the main page
benchmark | dependencies | developers | estimation | filter | moses | structures

This program filters language models to test sets. By filtering we mean that n-grams are removed if they cannot be generated during decoding; the process is not lossy. In our experiments, the filter reduces model size by 92% for system combination and 36% for machine translation. For usage, run



Two types of constraints are implemented:
Check that all words in the n-gram appear in target vocabulary.
Determine if the n-gram can be assembled from phrases, including phrases crossing n-gram boundaries.
Orthogonally, the filter handles multiple sentences in parallel with three modes:
Treat the entire input as one sentence. This is the mode most commonly seen in other filters and the least effective.
Output a separately filtered model for each sentence. This mode produces the smallest models, but there are many models so total output is larger.
Generate one model that is the union of individually filtered models. This constraint is stronger than with Single mode because all words or phrases supporting an n-gram must appear in the same sentence. This mode minimizes loading time.

The filter is fast and multithreaded. Typically disk is the bottleneck; union mode takes 10 minutes for a 19 GB ARPA file and 2525 sentences.