6 datasets found

E
doclevel-MT-benchmark-discoMT2019
live.european-language-grid.eu
data.niaid.nih.gov
+1more
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). doclevel-MT-benchmark-discoMT2019 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7742
Explore at:
Dataset updated
Apr 10, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This release contains data sets for experiments with document-level machine translation. The data sets have been used in previous studies and provided here for replicability and comparison with other systems. The data sets are taken from the English-German news translation task at WMT 2019 and the English-German bitext in the OpenSubtitles collection v2016 from OPUS. All data sets are sentence aligned with corresponding lines being aligned to each other. Document boundaries are marked with empty lines (on both sides of the parallel corpus).The data set has been used in the following publication:@inproceedings{scherrer-tiedemann-loaiciga-2019, title = "Analysing concatenation approaches to document-level NMT in two different domains", author = {Scherrer, Yves and Tiedemann, J{"o}rg and Lo{\'a}iciga, Sharid}, booktitle = "Proceedings of the Third Workshop on Discourse in Machine Translation", month = nov, year = "2019", address = "Hong-Kong", publisher = "Association for Computational Linguistics",}Please, cite that paper if you use the data set in your own work.
h
synth-greedy-decoded-doclevel
huggingface.co
Updated Feb 7, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eole-NLP (2025). synth-greedy-decoded-doclevel [Dataset]. https://huggingface.co/datasets/eole-nlp/synth-greedy-decoded-doclevel
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2025
Dataset authored and provided by
Eole-NLP
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
eole-nlp/synth-greedy-decoded-doclevel dataset hosted on Hugging Face and contributed by the HF Datasets community
h
mt-doclevel-ab-test
huggingface.co
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Supertext (2025). mt-doclevel-ab-test [Dataset]. https://huggingface.co/datasets/Supertext/mt-doclevel-ab-test
Explore at:
Dataset updated
Jun 17, 2025
Dataset authored and provided by
Supertext
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A/B Test Supertext vs DeepL

We release all evaluation data and scripts for further analysis and reproduction of the accompanying paper: A comparison of translation performance between DeepL and Supertext. The data consists of document-level translations by Supertext and DeepL as well as accompanying ratings by professional translators. Please find more details in the paper. Please note that the empty lines correspond to paragraph boundaries (i.e., double line breaks) in the original… See the full description on the dataset page: https://huggingface.co/datasets/Supertext/mt-doclevel-ab-test.
h
kunpeng-doc-level-webnovel-instruction
huggingface.co
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Web_Novel_Trans (2024). kunpeng-doc-level-webnovel-instruction [Dataset]. https://huggingface.co/datasets/WebNovelTrans/kunpeng-doc-level-webnovel-instruction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2024
Dataset authored and provided by
Web_Novel_Trans
Description
WebNovelTrans/kunpeng-doc-level-webnovel-instruction dataset hosted on Hugging Face and contributed by the HF Datasets community
h
DocNMT
huggingface.co
Updated May 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guangsheng Bao (2023). DocNMT [Dataset]. https://huggingface.co/datasets/gshbao/DocNMT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 24, 2023
Authors
Guangsheng Bao
License
https://choosealicense.com/licenses/afl-3.0/https://choosealicense.com/licenses/afl-3.0/
Description
Dataset Card for Dataset Name

Dataset Summary

The benchmark datasets for document-level machine translation.

Supported Tasks

Document-level Machine Translation Tasks.

Languages

English-German

Dataset Structure Data Instances

TED: iwslt17, News: nc2016, Europarl: europarl7

Data Fields

Pure text that each line represents a sentence and multiple lines separated by '

Data Splits

train… See the full description on the dataset page: https://huggingface.co/datasets/gshbao/DocNMT.
h
Flores-Indic-Doc-Level
huggingface.co
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varun Gumma (2025). Flores-Indic-Doc-Level [Dataset]. https://huggingface.co/datasets/VarunGumma/Flores-Indic-Doc-Level
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2025
Authors
Varun Gumma
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was constructed by merging individual sentences from the Flores dataset based on matching domain, topic, and URL attributes. The result is a long-context, document-level parallel benchmark. For more details on the domains and dataset statistics, please refer to the original paper and the dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). doclevel-MT-benchmark-discoMT2019 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7742

doclevel-MT-benchmark-discoMT2019

Explore at:

Dataset updated

Apr 10, 2024

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This release contains data sets for experiments with document-level machine translation. The data sets have been used in previous studies and provided here for replicability and comparison with other systems. The data sets are taken from the English-German news translation task at WMT 2019 and the English-German bitext in the OpenSubtitles collection v2016 from OPUS. All data sets are sentence aligned with corresponding lines being aligned to each other. Document boundaries are marked with empty lines (on both sides of the parallel corpus).The data set has been used in the following publication:@inproceedings{scherrer-tiedemann-loaiciga-2019, title = "Analysing concatenation approaches to document-level NMT in two different domains", author = {Scherrer, Yves and Tiedemann, J{"o}rg and Lo{\'a}iciga, Sharid}, booktitle = "Proceedings of the Third Workshop on Discourse in Machine Translation", month = nov, year = "2019", address = "Hong-Kong", publisher = "Association for Computational Linguistics",}Please, cite that paper if you use the data set in your own work.

Clear search

Close search

Google apps

Main menu

doclevel-MT-benchmark-discoMT2019

synth-greedy-decoded-doclevel

mt-doclevel-ab-test

kunpeng-doc-level-webnovel-instruction

DocNMT

Flores-Indic-Doc-Level

doclevel-MT-benchmark-discoMT2019