6 datasets found
  1. E

    doclevel-MT-benchmark-discoMT2019

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). doclevel-MT-benchmark-discoMT2019 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7742
    Explore at:
    Dataset updated
    Apr 10, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This release contains data sets for experiments with document-level machine translation. The data sets have been used in previous studies and provided here for replicability and comparison with other systems. The data sets are taken from the English-German news translation task at WMT 2019 and the English-German bitext in the OpenSubtitles collection v2016 from OPUS. All data sets are sentence aligned with corresponding lines being aligned to each other. Document boundaries are marked with empty lines (on both sides of the parallel corpus).The data set has been used in the following publication:@inproceedings{scherrer-tiedemann-loaiciga-2019, title = "Analysing concatenation approaches to document-level NMT in two different domains", author = {Scherrer, Yves and Tiedemann, J{"o}rg and Lo{\'a}iciga, Sharid}, booktitle = "Proceedings of the Third Workshop on Discourse in Machine Translation", month = nov, year = "2019", address = "Hong-Kong", publisher = "Association for Computational Linguistics",}Please, cite that paper if you use the data set in your own work.

  2. h

    synth-greedy-decoded-doclevel

    • huggingface.co
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eole-NLP (2025). synth-greedy-decoded-doclevel [Dataset]. https://huggingface.co/datasets/eole-nlp/synth-greedy-decoded-doclevel
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2025
    Dataset authored and provided by
    Eole-NLP
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    eole-nlp/synth-greedy-decoded-doclevel dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    mt-doclevel-ab-test

    • huggingface.co
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Supertext (2025). mt-doclevel-ab-test [Dataset]. https://huggingface.co/datasets/Supertext/mt-doclevel-ab-test
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset authored and provided by
    Supertext
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A/B Test Supertext vs DeepL

    We release all evaluation data and scripts for further analysis and reproduction of the accompanying paper: A comparison of translation performance between DeepL and Supertext. The data consists of document-level translations by Supertext and DeepL as well as accompanying ratings by professional translators. Please find more details in the paper. Please note that the empty lines correspond to paragraph boundaries (i.e., double line breaks) in the original… See the full description on the dataset page: https://huggingface.co/datasets/Supertext/mt-doclevel-ab-test.

  4. h

    kunpeng-doc-level-webnovel-instruction

    • huggingface.co
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Web_Novel_Trans (2024). kunpeng-doc-level-webnovel-instruction [Dataset]. https://huggingface.co/datasets/WebNovelTrans/kunpeng-doc-level-webnovel-instruction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Dataset authored and provided by
    Web_Novel_Trans
    Description

    WebNovelTrans/kunpeng-doc-level-webnovel-instruction dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    DocNMT

    • huggingface.co
    Updated May 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Guangsheng Bao (2023). DocNMT [Dataset]. https://huggingface.co/datasets/gshbao/DocNMT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 24, 2023
    Authors
    Guangsheng Bao
    License

    https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/

    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    The benchmark datasets for document-level machine translation.

      Supported Tasks
    

    Document-level Machine Translation Tasks.

      Languages
    

    English-German

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    TED: iwslt17, News: nc2016, Europarl: europarl7

      Data Fields
    

    Pure text that each line represents a sentence and multiple lines separated by '

      Data Splits
    

    train… See the full description on the dataset page: https://huggingface.co/datasets/gshbao/DocNMT.

  6. h

    Flores-Indic-Doc-Level

    • huggingface.co
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varun Gumma (2025). Flores-Indic-Doc-Level [Dataset]. https://huggingface.co/datasets/VarunGumma/Flores-Indic-Doc-Level
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2025
    Authors
    Varun Gumma
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was constructed by merging individual sentences from the Flores dataset based on matching domain, topic, and URL attributes. The result is a long-context, document-level parallel benchmark. For more details on the domains and dataset statistics, please refer to the original paper and the dataset.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2024). doclevel-MT-benchmark-discoMT2019 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7742

doclevel-MT-benchmark-discoMT2019

Explore at:
Dataset updated
Apr 10, 2024
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This release contains data sets for experiments with document-level machine translation. The data sets have been used in previous studies and provided here for replicability and comparison with other systems. The data sets are taken from the English-German news translation task at WMT 2019 and the English-German bitext in the OpenSubtitles collection v2016 from OPUS. All data sets are sentence aligned with corresponding lines being aligned to each other. Document boundaries are marked with empty lines (on both sides of the parallel corpus).The data set has been used in the following publication:@inproceedings{scherrer-tiedemann-loaiciga-2019, title = "Analysing concatenation approaches to document-level NMT in two different domains", author = {Scherrer, Yves and Tiedemann, J{"o}rg and Lo{\'a}iciga, Sharid}, booktitle = "Proceedings of the Third Workshop on Discourse in Machine Translation", month = nov, year = "2019", address = "Hong-Kong", publisher = "Association for Computational Linguistics",}Please, cite that paper if you use the data set in your own work.

Search
Clear search
Close search
Google apps
Main menu