41 datasets found

P
ParaCrawl Dataset
paperswithcode.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marta Ba{\~n}{\'o}n; Pin-zhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Espl{\`a}-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ram{\'\i}rez-S{\'a}nchez; Elsa Sarr{\'\i}as; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza (2022). ParaCrawl Dataset [Dataset]. https://paperswithcode.com/dataset/paracrawl
Explore at:
Dataset updated
Jul 10, 2022
Authors
Marta Ba{\~n}{\'o}n; Pin-zhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Espl{\`a}-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ram{\'\i}rez-S{\'a}nchez; Elsa Sarr{\'\i}as; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza
Description
ParaCrawl v.7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.
h
opus_paracrawl
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki, opus_paracrawl [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for OpusParaCrawl

Dataset Summary

Parallel corpora from Web Crawls collected in the ParaCrawl project. Tha dataset contains:

42 languages, 43 bitexts total number of files: 59,996 total number of tokens: 56.11G total number of sentence fragments: 3.13G

To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g. dataset = load_dataset("opus_paracrawl", lang1="en", lang2="so")

You can find the valid… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl.
E
Anonymised ParaCrawl release 7 Norwegian Bokmål-English
live.european-language-grid.eu
tmx
Updated Sep 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Anonymised ParaCrawl release 7 Norwegian Bokmål-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4771
Explore at:
tmxAvailable download formats
Dataset updated
Sep 9, 2021
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This corpus was run through BiRoamer https://github.com/bitextor/biroamer to anonymise the Norwegian Bokmål-English parallel data from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage. Anonymisation is an automated process driven by named entity recognition and is far from perfect.
T
para_crawl
tensorflow.org
Updated Dec 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). para_crawl [Dataset]. https://www.tensorflow.org/datasets/catalog/para_crawl
Explore at:
Dataset updated
Dec 15, 2022
Description
Web-Scale Parallel Corpora for Official European Languages.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('para_crawl', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
E
ParaCrawl release 4 Bulgarian-English
live.european-language-grid.eu
tmx
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ParaCrawl release 4 Bulgarian-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4664
Explore at:
tmxAvailable download formats
Dataset updated
Dec 7, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Bulgarian-English parallel from release 4 of the ParaCrawl project, specifically "Provision of Web-Scale Parallel Corpora for Official European Languages". This version is filtered with BiCleaner with a threshold of 0.7. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
s
ParaCrawl
marketplace.sshopencloud.eu
opendatalab.com
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). ParaCrawl [Dataset]. https://marketplace.sshopencloud.eu/dataset/zLP0jv
Explore at:
Dataset updated
Apr 24, 2020
Description
Web-scale parallel corpora for the languages of the EU
E
ParaCrawl release 8 Portuguese-English
live.european-language-grid.eu
tmx
Updated Jun 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). ParaCrawl release 8 Portuguese-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7068
Explore at:
tmxAvailable download formats
Dataset updated
Jun 16, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Portuguese-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
E
ParaCrawl release 8 Slovenian-English
live.european-language-grid.eu
tmx
Updated Jun 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). ParaCrawl release 8 Slovenian-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7062
Explore at:
tmxAvailable download formats
Dataset updated
Jun 14, 2024
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Slovenian-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
E
ParaCrawl release 6 Polish-English
live.european-language-grid.eu
tmx
Updated Oct 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ParaCrawl release 6 Polish-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4719
Explore at:
tmxAvailable download formats
Dataset updated
Oct 28, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Polish-English parallel from release 6 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.7 and introduces near-duplicate removal as well. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
h
ParaCrawl.en-es_2_of_2
huggingface.co
Updated Aug 1, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BackpropBuff (2012). ParaCrawl.en-es_2_of_2 [Dataset]. https://huggingface.co/datasets/BackpropBuff/ParaCrawl.en-es_2_of_2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2012
Authors
BackpropBuff
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
BackpropBuff/ParaCrawl.en-es_2_of_2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
paracrawl_context
huggingface.co
Updated Oct 16, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Proyag Pal (2013). paracrawl_context [Dataset]. https://huggingface.co/datasets/Proyag/paracrawl_context
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2013
Authors
Proyag Pal
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for ParaCrawl_Context

This is a dataset for document-level machine translation introduced in the ACL 2024 paper Document-Level Machine Translation with Large-Scale Public Parallel Data. It is a dataset consisting of parallel sentence pairs from the ParaCrawl dataset along with corresponding preceding context extracted from the webpages the sentences were crawled from.

Dataset Details Dataset Description

This dataset adds document-level… See the full description on the dataset page: https://huggingface.co/datasets/Proyag/paracrawl_context.
E
ParaCrawl release 7 Norwegian Bokmål-English
live.european-language-grid.eu
tmx
Updated Oct 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ParaCrawl release 7 Norwegian Bokmål-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4742
Explore at:
tmxAvailable download formats
Dataset updated
Oct 30, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Norwegian Bokmål-English parallel from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
E
ParaCrawl release 8 Bulgarian-English
live.european-language-grid.eu
tmx
Updated Dec 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ParaCrawl release 8 Bulgarian-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7085
Explore at:
tmxAvailable download formats
Dataset updated
Dec 3, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Bulgarian-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
h
wikimatrix-loader
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isabel papadimitriou, wikimatrix-loader [Dataset]. https://huggingface.co/datasets/isabelvp/wikimatrix-loader
Explore at:
Authors
Isabel papadimitriou
Description
WikiMatrix

This is a loader for the WikiMatrix Dataset I have nothing to do with dataset creation, and have just written a loader that downloads the files provided and loads them into a huggingface dataset. (The loader in fact was largely copied from the ParaCrawl dataset builder here, so my contribution is quite minimal)

Use

You can load the dataset using two concatenated lagnuage codes of the WikiMatrix languages, for example: import datasets dataset =… See the full description on the dataset page: https://huggingface.co/datasets/isabelvp/wikimatrix-loader.
E
ParaCrawl release 8 Danish-English
live.european-language-grid.eu
tmx
Updated Dec 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ParaCrawl release 8 Danish-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7087
Explore at:
tmxAvailable download formats
Dataset updated
Dec 3, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Danish-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
d
TAUS Language Translation Data | Parallel translation for Covid-19, Medical...
datarade.ai
Updated Dec 16, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TAUS (2020). TAUS Language Translation Data | Parallel translation for Covid-19, Medical and Healthcare, various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-covid-medical-and-healthcare-languages-see-description-taus
Explore at:
.xml, .csv, .xls, .txtAvailable download formats
Dataset updated
Dec 16, 2020
Dataset authored and provided by
TAUS
Area covered
United States of America, United Kingdom, Russian Federation, Poland, Czech Republic, Spain, Portugal, China, Denmark, Korea (Republic of)
Description
TAUS also generated corpora by applying Matching Data selection to DataCloud and ParaCrawl data. The query corpus used is crawled from the web for the latest Corona virus-related articles and news. The selected data is related to virology, epidemic, medicine, and healthcare.

English>German English>Spanish English>French English>Italian English>Chinese English>Russian English>Portuguese (Brazil) English>Japanese English>Korean English>Portuguese English>Dutch English>Polish English>Swedish English>Danish English>Estonian English>Czech

Other languages are available on demand.
h
IndoParaCrawl
huggingface.co
Updated Apr 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akmal (2022). IndoParaCrawl [Dataset]. https://huggingface.co/datasets/Wikidepia/IndoParaCrawl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2022
Authors
Akmal
Description
IndoParaCrawl

IndoParaCrawl is ParaCrawl v7.1 dataset bulk-translated to Indonesian using Google Translate. Thanks HuggingFace for providing free storage for datasets <3.
E
ParaCrawl release 8 Italian-English - deferred files
live.european-language-grid.eu
tmx
Updated Dec 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ParaCrawl release 8 Italian-English - deferred files [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7044
Explore at:
tmxAvailable download formats
Dataset updated
Dec 3, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Italian-English parallel from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage. Files contain full information for deferred crawling.
E
ParaCrawl release 9 English-Norwegian Bokmål - deferred files
live.european-language-grid.eu
tmx
Updated Oct 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ParaCrawl release 9 English-Norwegian Bokmål - deferred files [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/19690
Explore at:
tmxAvailable download formats
Dataset updated
Oct 28, 2023
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file contains URLs and hashes of text to form a parallel corpus but not the sentences itself. You probably want the actual parallel data; see the version without "deferred files" in the title. To reconstruct a parallel corpus, use the deferred crawling tool at https://github.com/bitextor/deferred-crawling which will download pages and produce a corpus, which is probably smaller due to link rot. This format is intended to support parties whose lawyers believe it is ok to scrape websites directly but not ok to copy them from a third party. Based on English-Norwegian Bokmål parallel from release 9 of the ParaCrawl project, specifically "Continued Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner AI. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.
E
ParaCrawl Corpus version 1.0
live.european-language-grid.eu
lindat.mff.cuni.cz
binary format
Updated Jan 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). ParaCrawl Corpus version 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1226
Explore at:
binary formatAvailable download formats
Dataset updated
Jan 13, 2018
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html

Facebook

Twitter

Click to copy link

Link copied

Cite

Marta Ba{\~n}{\'o}n; Pin-zhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Espl{\`a}-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ram{\'\i}rez-S{\'a}nchez; Elsa Sarr{\'\i}as; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza (2022). ParaCrawl Dataset [Dataset]. https://paperswithcode.com/dataset/paracrawl

ParaCrawl Dataset

Explore at:

Dataset updated

Jul 10, 2022

Authors

Description

ParaCrawl v.7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.

Clear search

Close search

Google apps

Main menu

ParaCrawl Dataset

opus_paracrawl

Anonymised ParaCrawl release 7 Norwegian Bokmål-English

para_crawl

ParaCrawl release 4 Bulgarian-English

ParaCrawl

ParaCrawl release 8 Portuguese-English

ParaCrawl release 8 Slovenian-English

ParaCrawl release 6 Polish-English

ParaCrawl.en-es_2_of_2

paracrawl_context

ParaCrawl release 7 Norwegian Bokmål-English

ParaCrawl release 8 Bulgarian-English

wikimatrix-loader

ParaCrawl release 8 Danish-English

TAUS Language Translation Data | Parallel translation for Covid-19, Medical...

IndoParaCrawl

ParaCrawl release 8 Italian-English - deferred files

ParaCrawl release 9 English-Norwegian Bokmål - deferred files

ParaCrawl Corpus version 1.0

ParaCrawl Dataset