Language Model Filter
Up to the main pageThis program filters language models to test sets. By filtering we mean that n-grams are removed if they cannot be generated during decoding; the process is not lossy. In our experiments, the filter reduces model size by 92% for system combination and 36% for machine translation. For usage, run
bin/filter
Features
Two types of constraints are implemented:- Vocabulary
- Check that all words in the n-gram appear in target vocabulary.
- Phrase
- Determine if the n-gram can be assembled from phrases, including phrases crossing n-gram boundaries.
- Single
- Treat the entire input as one sentence. This is the mode most commonly seen in other filters and the least effective.
- Multiple
- Output a separately filtered model for each sentence. This mode produces the smallest models, but there are many models so total output is larger.
- Union
- Generate one model that is the union of individually filtered models. This constraint is stronger than with Single mode because all words or phrases supporting an n-gram must appear in the same sentence. This mode minimizes loading time.
The filter is fast and multithreaded. Typically disk is the bottleneck; union mode takes 10 minutes for a 19 GB ARPA file and 2525 sentences.