Monolingual data from ParaCrawl
The corpus has 96,470,655,818 lines, 1,337,127,886,176 tokens, and 9,153,226,323,307 characters of English. Text was extracted from HTML, classified, split, and deduplicated.
The corpus is available as 128 files, split by the hash of the line. The first and last URLs are:
https://neural.mt/data/paracrawl8-mono/en-000.gz
https://neural.mt/data/paracrawl8-mono/en-127.gz
#!/bin/bash for i in {0..127}; do wget https://neural.mt/data/paracrawl8-mono/en-$(printf "%03i" $i).gz done
Files are hosted on the Internet Archive. Due to their 1 TB limit per directory, I've setup redirects to the appropriate directory.
Source data
This is all the English data used for ParaCrawl release 8, which is based on the following crawls.- Internet Archive
- wide00006, wide00015, and pages with en, is, hr, no, and ga in their URL.
- CommonCrawl
- 2016-30, 2017-30, 2018-30, 2019-18, and 2019-35.
- Targeted
- Philipp Koehn crawled domains that have a mix of multilingual content based on language classification in CommonCrawl. Marta Bañón aimed for sites in Basque, Catalan, Galician, and Spanish but picked up some English on the way. Hieu Hoang crawled sites that produced parallel sentences in earlier generations of ParaCrawl.
More languages
Coming, though ParaCrawl release 9 processing takes priority. That will have even more data!
For low-resource languages, we've noticed the false-positive rate is higher than the frequency on the web. That meant most of the isiXhosa was baseball statistics. Please share better, more precise, language identification!
Acknowledgements
ParaCrawl datasets include data from the Internet Archive (https://archive.org/), as part of an agreement between the Internet Archive and the University of Edinburgh. The Internet Archive is hosting this data set.
ParaCrawl is funded by the European Union's Connecting Europe Facility. Any communication or publication related to the action, made by the beneficiaries jointly or individually in any form and using any means, shall indicate that it reflects only the author's view and that the Agency is not responsible for any use that may be made of the information it contains.
This data was processed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk).