Language Model Filter

Up to the main page
benchmark | dependencies | developers | estimation | filter | moses | structures

This program filters language models to test sets. By filtering we mean that n-grams are removed if they cannot be generated during decoding; the process is not lossy. In our experiments, the filter reduces model size by 92% for system combination and 36% for machine translation. For usage, run

bin/filter

Features

Two types of constraints are implemented:
Vocabulary
Check that all words in the n-gram appear in target vocabulary.
Phrase
Determine if the n-gram can be assembled from phrases, including phrases crossing n-gram boundaries.
Orthogonally, the filter handles multiple sentences in parallel with three modes:
Single
Treat the entire input as one sentence. This is the mode most commonly seen in other filters and the least effective.
Multiple
Output a separately filtered model for each sentence. This mode produces the smallest models, but there are many models so total output is larger.
Union
Generate one model that is the union of individually filtered models. This constraint is stronger than with Single mode because all words or phrases supporting an n-gram must appear in the same sentence. This mode minimizes loading time.

The filter is fast and multithreaded. Typically disk is the bottleneck; union mode takes 10 minutes for a 19 GB ARPA file and 2525 sentences.