ParaCrawl v.7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for OpusParaCrawl
Dataset Summary
Parallel corpora from Web Crawls collected in the ParaCrawl project. Tha dataset contains:
42 languages, 43 bitexts total number of files: 59,996 total number of tokens: 56.11G total number of sentence fragments: 3.13G
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g. dataset = load_dataset("opus_paracrawl", lang1="en", lang2="so")
You can find the valid… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This corpus was run through BiRoamer https://github.com/bitextor/biroamer to anonymise the Norwegian Bokmål-English parallel data from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage. Anonymisation is an automated process driven by named entity recognition and is far from perfect.
Web-Scale Parallel Corpora for Official European Languages.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('para_crawl', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Bulgarian-English parallel from release 4 of the ParaCrawl project, specifically "Provision of Web-Scale Parallel Corpora for Official European Languages". This version is filtered with BiCleaner with a threshold of 0.7. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Portuguese-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Slovenian-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Polish-English parallel from release 6 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.7 and introduces near-duplicate removal as well. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
BackpropBuff/ParaCrawl.en-es_2_of_2 dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for ParaCrawl_Context
This is a dataset for document-level machine translation introduced in the ACL 2024 paper Document-Level Machine Translation with Large-Scale Public Parallel Data. It is a dataset consisting of parallel sentence pairs from the ParaCrawl dataset along with corresponding preceding context extracted from the webpages the sentences were crawled from.
Dataset Details
Dataset Description
This dataset adds document-level… See the full description on the dataset page: https://huggingface.co/datasets/Proyag/paracrawl_context.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Norwegian Bokmål-English parallel from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Bulgarian-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
WikiMatrix
This is a loader for the WikiMatrix Dataset I have nothing to do with dataset creation, and have just written a loader that downloads the files provided and loads them into a huggingface dataset. (The loader in fact was largely copied from the ParaCrawl dataset builder here, so my contribution is quite minimal)
Use
You can load the dataset using two concatenated lagnuage codes of the WikiMatrix languages, for example: import datasets dataset =… See the full description on the dataset page: https://huggingface.co/datasets/isabelvp/wikimatrix-loader.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Danish-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
TAUS also generated corpora by applying Matching Data selection to DataCloud and ParaCrawl data. The query corpus used is crawled from the web for the latest Corona virus-related articles and news. The selected data is related to virology, epidemic, medicine, and healthcare.
English>German English>Spanish English>French English>Italian English>Chinese English>Russian English>Portuguese (Brazil) English>Japanese English>Korean English>Portuguese English>Dutch English>Polish English>Swedish English>Danish English>Estonian English>Czech
Other languages are available on demand.
IndoParaCrawl
IndoParaCrawl is ParaCrawl v7.1 dataset bulk-translated to Indonesian using Google Translate. Thanks HuggingFace for providing free storage for datasets <3.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Italian-English parallel from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage. Files contain full information for deferred crawling.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This file contains URLs and hashes of text to form a parallel corpus but not the sentences itself. You probably want the actual parallel data; see the version without "deferred files" in the title. To reconstruct a parallel corpus, use the deferred crawling tool at https://github.com/bitextor/deferred-crawling which will download pages and produce a corpus, which is probably smaller due to link rot. This format is intended to support parties whose lawyers believe it is ok to scrape websites directly but not ok to copy them from a third party. Based on English-Norwegian Bokmål parallel from release 9 of the ParaCrawl project, specifically "Continued Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner AI. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html
ParaCrawl v.7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.