41 datasets found
  1. P

    ParaCrawl Dataset

    • paperswithcode.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marta Ba{\~n}{\'o}n; Pin-zhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Espl{\`a}-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ram{\'\i}rez-S{\'a}nchez; Elsa Sarr{\'\i}as; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza (2022). ParaCrawl Dataset [Dataset]. https://paperswithcode.com/dataset/paracrawl
    Explore at:
    Dataset updated
    Jul 10, 2022
    Authors
    Marta Ba{\~n}{\'o}n; Pin-zhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Espl{\`a}-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ram{\'\i}rez-S{\'a}nchez; Elsa Sarr{\'\i}as; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza
    Description

    ParaCrawl v.7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.

  2. h

    opus_paracrawl

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki, opus_paracrawl [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for OpusParaCrawl

      Dataset Summary
    

    Parallel corpora from Web Crawls collected in the ParaCrawl project. Tha dataset contains:

    42 languages, 43 bitexts total number of files: 59,996 total number of tokens: 56.11G total number of sentence fragments: 3.13G

    To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs, e.g. dataset = load_dataset("opus_paracrawl", lang1="en", lang2="so")

    You can find the valid… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl.

  3. E

    Anonymised ParaCrawl release 7 Norwegian Bokmål-English

    • live.european-language-grid.eu
    tmx
    Updated Sep 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Anonymised ParaCrawl release 7 Norwegian Bokmål-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4771
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Sep 9, 2021
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This corpus was run through BiRoamer https://github.com/bitextor/biroamer to anonymise the Norwegian Bokmål-English parallel data from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage. Anonymisation is an automated process driven by named entity recognition and is far from perfect.

  4. T

    para_crawl

    • tensorflow.org
    Updated Dec 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). para_crawl [Dataset]. https://www.tensorflow.org/datasets/catalog/para_crawl
    Explore at:
    Dataset updated
    Dec 15, 2022
    Description

    Web-Scale Parallel Corpora for Official European Languages.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('para_crawl', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. E

    ParaCrawl release 4 Bulgarian-English

    • live.european-language-grid.eu
    tmx
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParaCrawl release 4 Bulgarian-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4664
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Dec 7, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Bulgarian-English parallel from release 4 of the ParaCrawl project, specifically "Provision of Web-Scale Parallel Corpora for Official European Languages". This version is filtered with BiCleaner with a threshold of 0.7. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  6. s

    ParaCrawl

    • marketplace.sshopencloud.eu
    • opendatalab.com
    Updated Apr 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). ParaCrawl [Dataset]. https://marketplace.sshopencloud.eu/dataset/zLP0jv
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    Web-scale parallel corpora for the languages of the EU

  7. E

    ParaCrawl release 8 Portuguese-English

    • live.european-language-grid.eu
    tmx
    Updated Jun 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ParaCrawl release 8 Portuguese-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7068
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Jun 16, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Portuguese-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  8. E

    ParaCrawl release 8 Slovenian-English

    • live.european-language-grid.eu
    tmx
    Updated Jun 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ParaCrawl release 8 Slovenian-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7062
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Jun 14, 2024
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Slovenian-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  9. E

    ParaCrawl release 6 Polish-English

    • live.european-language-grid.eu
    tmx
    Updated Oct 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParaCrawl release 6 Polish-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4719
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Oct 28, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Polish-English parallel from release 6 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.7 and introduces near-duplicate removal as well. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  10. h

    ParaCrawl.en-es_2_of_2

    • huggingface.co
    Updated Aug 1, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BackpropBuff (2012). ParaCrawl.en-es_2_of_2 [Dataset]. https://huggingface.co/datasets/BackpropBuff/ParaCrawl.en-es_2_of_2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2012
    Authors
    BackpropBuff
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    BackpropBuff/ParaCrawl.en-es_2_of_2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    paracrawl_context

    • huggingface.co
    Updated Oct 16, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Proyag Pal (2013). paracrawl_context [Dataset]. https://huggingface.co/datasets/Proyag/paracrawl_context
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2013
    Authors
    Proyag Pal
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for ParaCrawl_Context

    This is a dataset for document-level machine translation introduced in the ACL 2024 paper Document-Level Machine Translation with Large-Scale Public Parallel Data. It is a dataset consisting of parallel sentence pairs from the ParaCrawl dataset along with corresponding preceding context extracted from the webpages the sentences were crawled from.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset adds document-level… See the full description on the dataset page: https://huggingface.co/datasets/Proyag/paracrawl_context.

  12. E

    ParaCrawl release 7 Norwegian Bokmål-English

    • live.european-language-grid.eu
    tmx
    Updated Oct 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParaCrawl release 7 Norwegian Bokmål-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/4742
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Oct 30, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Norwegian Bokmål-English parallel from release 7 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  13. E

    ParaCrawl release 8 Bulgarian-English

    • live.european-language-grid.eu
    tmx
    Updated Dec 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParaCrawl release 8 Bulgarian-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7085
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Dec 3, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Bulgarian-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  14. h

    wikimatrix-loader

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isabel papadimitriou, wikimatrix-loader [Dataset]. https://huggingface.co/datasets/isabelvp/wikimatrix-loader
    Explore at:
    Authors
    Isabel papadimitriou
    Description

    WikiMatrix

    This is a loader for the WikiMatrix Dataset I have nothing to do with dataset creation, and have just written a loader that downloads the files provided and loads them into a huggingface dataset. (The loader in fact was largely copied from the ParaCrawl dataset builder here, so my contribution is quite minimal)

      Use
    

    You can load the dataset using two concatenated lagnuage codes of the WikiMatrix languages, for example: import datasets dataset =… See the full description on the dataset page: https://huggingface.co/datasets/isabelvp/wikimatrix-loader.

  15. E

    ParaCrawl release 8 Danish-English

    • live.european-language-grid.eu
    tmx
    Updated Dec 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParaCrawl release 8 Danish-English [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7087
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Dec 3, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Danish-English parallel data from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  16. d

    TAUS Language Translation Data | Parallel translation for Covid-19, Medical...

    • datarade.ai
    Updated Dec 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TAUS (2020). TAUS Language Translation Data | Parallel translation for Covid-19, Medical and Healthcare, various languages for Machine Learning [Dataset]. https://datarade.ai/data-products/taus-parallel-text-covid-medical-and-healthcare-languages-see-description-taus
    Explore at:
    .xml, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Dec 16, 2020
    Dataset authored and provided by
    TAUS
    Area covered
    United States of America, United Kingdom, Russian Federation, Poland, Czech Republic, Spain, Portugal, China, Denmark, Korea (Republic of)
    Description

    TAUS also generated corpora by applying Matching Data selection to DataCloud and ParaCrawl data. The query corpus used is crawled from the web for the latest Corona virus-related articles and news. The selected data is related to virology, epidemic, medicine, and healthcare.

    English>German English>Spanish English>French English>Italian English>Chinese English>Russian English>Portuguese (Brazil) English>Japanese English>Korean English>Portuguese English>Dutch English>Polish English>Swedish English>Danish English>Estonian English>Czech

    Other languages are available on demand.

  17. h

    IndoParaCrawl

    • huggingface.co
    Updated Apr 11, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akmal (2022). IndoParaCrawl [Dataset]. https://huggingface.co/datasets/Wikidepia/IndoParaCrawl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2022
    Authors
    Akmal
    Description

    IndoParaCrawl

    IndoParaCrawl is ParaCrawl v7.1 dataset bulk-translated to Indonesian using Google Translate. Thanks HuggingFace for providing free storage for datasets <3.

  18. E

    ParaCrawl release 8 Italian-English - deferred files

    • live.european-language-grid.eu
    tmx
    Updated Dec 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParaCrawl release 8 Italian-English - deferred files [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7044
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Dec 3, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Italian-English parallel from release 8 of the ParaCrawl project, specifically "Broader Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner with a threshold of 0.5. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage. Files contain full information for deferred crawling.

  19. E

    ParaCrawl release 9 English-Norwegian Bokmål - deferred files

    • live.european-language-grid.eu
    tmx
    Updated Oct 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParaCrawl release 9 English-Norwegian Bokmål - deferred files [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/19690
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Oct 28, 2023
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This file contains URLs and hashes of text to form a parallel corpus but not the sentences itself. You probably want the actual parallel data; see the version without "deferred files" in the title. To reconstruct a parallel corpus, use the deferred crawling tool at https://github.com/bitextor/deferred-crawling which will download pages and produce a corpus, which is probably smaller due to link rot. This format is intended to support parties whose lawyers believe it is ok to scrape websites directly but not ok to copy them from a third party. Based on English-Norwegian Bokmål parallel from release 9 of the ParaCrawl project, specifically "Continued Web-Scale Provision of Parallel Corpora for European Languages". This version is filtered with BiCleaner AI. Data was crawled from the web following robots.txt, as is standard practice. The crawl is not targeted to a particular domain, intending to provide broad coverage.

  20. E

    ParaCrawl Corpus version 1.0

    • live.european-language-grid.eu
    • lindat.mff.cuni.cz
    binary format
    Updated Jan 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). ParaCrawl Corpus version 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1226
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jan 13, 2018
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marta Ba{\~n}{\'o}n; Pin-zhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Espl{\`a}-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ram{\'\i}rez-S{\'a}nchez; Elsa Sarr{\'\i}as; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza (2022). ParaCrawl Dataset [Dataset]. https://paperswithcode.com/dataset/paracrawl

ParaCrawl Dataset

Explore at:
Dataset updated
Jul 10, 2022
Authors
Marta Ba{\~n}{\'o}n; Pin-zhen Chen; Barry Haddow; Kenneth Heafield; Hieu Hoang; Miquel Espl{\`a}-Gomis; Mikel L. Forcada; Amir Kamran; Faheem Kirefu; Philipp Koehn; Sergio Ortiz Rojas; Leopoldo Pla Sempere; Gema Ram{\'\i}rez-S{\'a}nchez; Elsa Sarr{\'\i}as; Marek Strelec; Brian Thompson; William Waites; Dion Wiggins; Jaume Zaragoza
Description

ParaCrawl v.7.1 is a parallel dataset with 41 language pairs primarily aligned with English (39 out of 41) and mined using the parallel-data-crawling tool Bitextor which includes downloading documents, preprocessing and normalization, aligning documents and segments, and filtering noisy data via Bicleaner. ParaCrawl focuses on European languages, but also includes 9 lower-resource, non-European language pairs in v7.1.

Search
Clear search
Close search
Google apps
Main menu