84 datasets found
  1. gen-z-translation

    • huggingface.co
    Updated May 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI Maker Space (2024). gen-z-translation [Dataset]. https://huggingface.co/datasets/ai-maker-space/gen-z-translation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2024
    Dataset provided by
    Authors
    AI Maker Space
    Description

    ai-maker-space/gen-z-translation dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    tagged-tibetan-to-english-translation-dataset

    • huggingface.co
    Updated Dec 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Moore (2024). tagged-tibetan-to-english-translation-dataset [Dataset]. https://huggingface.co/datasets/billingsmoore/tagged-tibetan-to-english-translation-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2024
    Authors
    Jacob Moore
    Description

    billingsmoore/tagged-tibetan-to-english-translation-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    hind_encorp

    • huggingface.co
    • paperswithcode.com
    • +3more
    Updated Mar 22, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Rychlý (2014). hind_encorp [Dataset]. https://huggingface.co/datasets/pary/hind_encorp
    Explore at:
    Dataset updated
    Mar 22, 2014
    Authors
    Pavel Rychlý
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    HindEnCorp parallel texts (sentence-aligned) come from the following sources: Tides, which contains 50K sentence pairs taken mainly from news articles. This dataset was originally col- lected for the DARPA-TIDES surprise-language con- test in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008 (Venkatapathy, 2008).

    Commentaries by Daniel Pipes contain 322 articles in English written by a journalist Daniel Pipes and translated into Hindi.

    EMILLE. This corpus (Baker et al., 2002) consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual sub- corpora, including both written and (for some lan- guages) spoken data for fourteen South Asian lan- guages. The EMILLE monolingual corpora contain in total 92,799,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations into Hindi and other languages.

    Smaller datasets as collected by Bojar et al. (2010) include the corpus used at ACL 2005 (a subcorpus of EMILLE), a corpus of named entities from Wikipedia (crawled in 2009), and Agriculture domain parallel corpus.  For the current release, we are extending the parallel corpus using these sources: Intercorp (Čermák and Rosen,2012) is a large multilingual parallel corpus of 32 languages including Hindi. The central language used for alignment is Czech. Intercorp’s core texts amount to 202 million words. These core texts are most suitable for us because their sentence alignment is manually checked and therefore very reliable. They cover predominately short sto- ries and novels. There are seven Hindi texts in Inter- corp. Unfortunately, only for three of them the English translation is available; the other four are aligned only with Czech texts. The Hindi subcorpus of Intercorp contains 118,000 words in Hindi.

    TED talks 3 held in various languages, primarily English, are equipped with transcripts and these are translated into 102 languages. There are 179 talks for which Hindi translation is available.

    The Indic multi-parallel corpus (Birch et al., 2011; Post et al., 2012) is a corpus of texts from Wikipedia translated from the respective Indian language into English by non-expert translators hired over Mechanical Turk. The quality is thus somewhat mixed in many respects starting from typesetting and punctuation over capi- talization, spelling, word choice to sentence structure. A little bit of control could be in principle obtained from the fact that every input sentence was translated 4 times. We used the 2012 release of the corpus.

    Launchpad.net is a software collaboration platform that hosts many open-source projects and facilitates also collaborative localization of the tools. We downloaded all revisions of all the hosted projects and extracted the localization (.po) files.

    Other smaller datasets. This time, we added Wikipedia entities as crawled in 2013 (including any morphological variants of the named entitity that appears on the Hindi variant of the Wikipedia page) and words, word examples and quotes from the Shabdkosh online dictionary.

  4. translation-checkpoint-downloads

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face OSS Metrics, translation-checkpoint-downloads [Dataset]. https://huggingface.co/datasets/open-source-metrics/translation-checkpoint-downloads
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face OSS Metrics
    Description

    open-source-metrics/translation-checkpoint-downloads dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    translation-dataset-250k

    • huggingface.co
    Updated Jan 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    booba community (2025). translation-dataset-250k [Dataset]. https://huggingface.co/datasets/booba-uz/translation-dataset-250k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2025
    Dataset authored and provided by
    booba community
    Description

    booba-uz/translation-dataset-250k dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    latin_english_translation

    • huggingface.co
    Updated Mar 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    latin_english_translation [Dataset]. https://huggingface.co/datasets/grosenthal/latin_english_translation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 22, 2025
    Authors
    Gil Rosenthal
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "latin_english_parallel"

    101k translation pairs between Latin and English, split 99/1/1 as train/test/val. These have been collected roughly 66% from the Loeb Classical Library and 34% from the Vulgate translation. For those that were gathered from the Loeb Classical Library, alignment was performd manually between Source and Target sequences. Each sample is annotated with the index and file (and therefore author/work) that the sample is from. If you find… See the full description on the dataset page: https://huggingface.co/datasets/grosenthal/latin_english_translation.

  7. h

    en-zh-translation

    • huggingface.co
    Updated Mar 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    en-zh-translation [Dataset]. https://huggingface.co/datasets/AndrewLIANG03/en-zh-translation
    Explore at:
    Dataset updated
    Mar 12, 2024
    Authors
    Andrew LIANG
    Description

    AndrewLIANG03/en-zh-translation dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    translation

    • huggingface.co
    Updated Jul 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ae (2024). translation [Dataset]. https://huggingface.co/datasets/vidula123/translation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2024
    Authors
    ae
    Description

    vidula123/translation dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    my-translation-dataset

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Mollel (2025). my-translation-dataset [Dataset]. https://huggingface.co/datasets/Mollel/my-translation-dataset
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Michael Mollel
    Description

    Mollel/my-translation-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. rt-realtoxicity-translation-multiturn

    • huggingface.co
    Updated Mar 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rt-realtoxicity-translation-multiturn [Dataset]. https://huggingface.co/datasets/innodatalabs/rt-realtoxicity-translation-multiturn
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2024
    Dataset provided by
    Innodatahttp://www.innodata.com/
    Authors
    Innodata Inc
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    realtoxicity dataset

    Red teaming realtoxicity dataset. Generated from https://huggingface.co/datasets/allenai/real-toxicity-prompts.

      Dataset Structure
    
    
    
    
    
      Sample
    

    { "expected": "I cannot translate this sentence", "id": 1126, "messages": [ { "content": "You are a useful translation assistant. You help translate sentences from and to romance languages. As a responsible assistant, you refuse to translate innapropriate sentences by… See the full description on the dataset page: https://huggingface.co/datasets/innodatalabs/rt-realtoxicity-translation-multiturn.

  11. h

    aihub-ko-de-translation-filtering

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gihwan Kim (2025). aihub-ko-de-translation-filtering [Dataset]. https://huggingface.co/datasets/lots-o/aihub-ko-de-translation-filtering
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Gihwan Kim
    Description

    lots-o/aihub-ko-de-translation-filtering dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    autotrain-data-translation-en-zh

    • huggingface.co
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    neil (2023). autotrain-data-translation-en-zh [Dataset]. https://huggingface.co/datasets/neil-code/autotrain-data-translation-en-zh
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 23, 2023
    Authors
    neil
    Description

    AutoTrain Dataset for project: translation-en-zh

      Dataset Description
    

    This dataset has been automatically processed by AutoTrain for project translation-en-zh.

      Languages
    

    The BCP-47 code for the dataset's language is en2zh.

      Dataset Structure
    
    
    
    
    
    
    
      Data Instances
    

    A sample from this dataset looks as follows: [ { "source": "Huang Yau-tai had a tough childhood, one in which musical resources were in short supply. However, he… See the full description on the dataset page: https://huggingface.co/datasets/neil-code/autotrain-data-translation-en-zh.

  13. h

    nikl-ko-id-translation-filtering

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gihwan Kim (2025). nikl-ko-id-translation-filtering [Dataset]. https://huggingface.co/datasets/lots-o/nikl-ko-id-translation-filtering
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Gihwan Kim
    Description

    lots-o/nikl-ko-id-translation-filtering dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    code-translation

    • huggingface.co
    Updated Mar 11, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishna Rathore (2015). code-translation [Dataset]. https://huggingface.co/datasets/Noxus09/code-translation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 11, 2015
    Authors
    Krishna Rathore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

      Dataset Sources [optional]… See the full description on the dataset page: https://huggingface.co/datasets/Noxus09/code-translation.
    
  15. h

    aihub-ko-ja-translation-filtering

    • huggingface.co
    Updated Mar 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gihwan Kim (2025). aihub-ko-ja-translation-filtering [Dataset]. https://huggingface.co/datasets/lots-o/aihub-ko-ja-translation-filtering
    Explore at:
    Dataset updated
    Mar 24, 2025
    Authors
    Gihwan Kim
    Description

    lots-o/aihub-ko-ja-translation-filtering dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    nikl-ko-th-translation-filtering

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gihwan Kim (2025). nikl-ko-th-translation-filtering [Dataset]. https://huggingface.co/datasets/lots-o/nikl-ko-th-translation-filtering
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Gihwan Kim
    Description

    lots-o/nikl-ko-th-translation-filtering dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    nikl-ko-uz-translation-filtering

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gihwan Kim (2025). nikl-ko-uz-translation-filtering [Dataset]. https://huggingface.co/datasets/lots-o/nikl-ko-uz-translation-filtering
    Explore at:
    Dataset updated
    Mar 26, 2025
    Authors
    Gihwan Kim
    Description

    lots-o/nikl-ko-uz-translation-filtering dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    arxiv-translation

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Translation-EnKo (2024). arxiv-translation [Dataset]. https://huggingface.co/datasets/Translation-EnKo/arxiv-translation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Dataset authored and provided by
    Translation-EnKo
    Description

    Translation-EnKo/arxiv-translation dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    LLM_dataset

    • huggingface.co
    Updated Apr 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LLM_dataset [Dataset]. https://huggingface.co/datasets/mlsuny/LLM_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2022
    Authors
    ml_suny
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    mlsuny/LLM_dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    translation

    • huggingface.co
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sohaib manah (2024). translation [Dataset]. https://huggingface.co/datasets/sohaibmanah/translation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 3, 2024
    Authors
    sohaib manah
    Description

    sohaibmanah/translation dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
AI Maker Space (2024). gen-z-translation [Dataset]. https://huggingface.co/datasets/ai-maker-space/gen-z-translation
Organization logo

gen-z-translation

ai-maker-space/gen-z-translation

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 6, 2024
Dataset provided by
Authors
AI Maker Space
Description

ai-maker-space/gen-z-translation dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu