Data

Parallel

ParaCrawl crawls the web for parallel data in 26 official EU languages, Icelandic, Norwegian, Spanish co-official languages, and a bunch of other languages.

EuroPat mines patents for translations in German, Spanish, French, Croatian, Norwegian, and Polish.

isiXhosa-English parallel text gathered and created in the Medical Machine Translation project in partnership with the University of Cape Town.

Monolingual

1.3 trillion words of English after deduplication at the sentence level from ParaCrawl.

CommonCrawl text for various languages. The English size is about 0.5 trillion words after deduplication at the sentence level.