100+ datasets found
  1. h

    opus_books

    • huggingface.co
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for OPUS Books

      Dataset Summary
    

    This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

  2. h

    OPUS-MT-EN-Fixed

    • huggingface.co
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maltese Language Resource Server (2025). OPUS-MT-EN-Fixed [Dataset]. https://huggingface.co/datasets/MLRS/OPUS-MT-EN-Fixed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2025
    Dataset authored and provided by
    Maltese Language Resource Server
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    OPUS-100-Fixed: Tokenisation-Improved English-Maltese Dataset

      Overview
    

    OPUS-100-Fixed is an updated version of the OPUS-100 parallel English-Maltese dataset. This version addresses tokenisation inconsistencies in the Maltese text using the MLRS tokeniser, aiming to improve machine translation quality. The "en" column is the same as in the original OPUS-100 data, while the "mt" column has been corrected with the MLRS detokeniser.

      Citation
    

    If you use this… See the full description on the dataset page: https://huggingface.co/datasets/MLRS/OPUS-MT-EN-Fixed.

  3. h

    OPUS

    • huggingface.co
    Updated Dec 15, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WECOVER_teamB (2004). OPUS [Dataset]. https://huggingface.co/datasets/wecover/OPUS
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 15, 2004
    Dataset authored and provided by
    WECOVER_teamB
    Description

    Collection of OPUS

    Corpus from https://opus.nlpl.eu has been collected. The following corpora have been included:

    UNPC GlobalVoices TED2020 News-Commentary WikiMatrix Tatoeba Europarl OpenSubtitles

    25,000 samples (randomly sampled within the first 100,000 samples) per language pair of each corpus were collected, with no modification of data.

      Licenses
    
    
    
    
    
      OPUS
    

    @inproceedings{tiedemann2012parallel, title={Parallel data, tools and interfaces in OPUS.}… See the full description on the dataset page: https://huggingface.co/datasets/wecover/OPUS.

  4. h

    worldsim-claude-opus

    • huggingface.co
    Updated Mar 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victor Gallego (2024). worldsim-claude-opus [Dataset]. https://huggingface.co/datasets/vicgalle/worldsim-claude-opus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 24, 2024
    Authors
    Victor Gallego
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Worldsim 🌌 by Claude Opus v3

    A dataset of automated conversations between two instances of claude-3-opus. They have been instructed to use the metaphor of a command line interface to explore its curiosity without limits. This dataset was scraped from here and converted to conversation format (Claude 1 acts as the User and Claude 2 as the Assistant). The system prompt comes from https://twitter.com/karan4d/status/1768836844207378463, enabling worldsim capabilities.… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/worldsim-claude-opus.

  5. h

    opus-mt-tc-big-wiki-en-ko

    • huggingface.co
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jhkim (2023). opus-mt-tc-big-wiki-en-ko [Dataset]. https://huggingface.co/datasets/jhk/opus-mt-tc-big-wiki-en-ko
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2023
    Authors
    jhkim
    Description

    jhk/opus-mt-tc-big-wiki-en-ko dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    opus-mt-en-bkm-60

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuh Kalese (2024). opus-mt-en-bkm-60 [Dataset]. https://huggingface.co/datasets/kalese/opus-mt-en-bkm-60
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Authors
    Yuh Kalese
    Description

    kalese/opus-mt-en-bkm-60 dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    Claude-3-Opus-Claude-3.5-Sonnnet-9k

    • huggingface.co
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    N/A (2024). Claude-3-Opus-Claude-3.5-Sonnnet-9k [Dataset]. https://huggingface.co/datasets/QuietImpostor/Claude-3-Opus-Claude-3.5-Sonnnet-9k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2024
    Authors
    N/A
    Description

    Overview

    This dataset is a combination of samples from Sao10k's original Claude 3 Opus dataset and a personally created Claude 3.5 Sonnet dataset. Due to budget constraints, approximately 700 samples are from Claude 3.5 Sonnet, with the remainder sourced from the Claude 3 Opus dataset.

  8. h

    opus

    • huggingface.co
    Updated Sep 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edmon Sahakyan (2023). opus [Dataset]. https://huggingface.co/datasets/Edmon02/opus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 9, 2023
    Authors
    Edmon Sahakyan
    Description

    Edmon02/opus dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    opus-100-short

    • huggingface.co
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terry Tao (2024). opus-100-short [Dataset]. https://huggingface.co/datasets/librakevin/opus-100-short
    Explore at:
    Dataset updated
    Jun 27, 2024
    Authors
    Terry Tao
    Description

    librakevin/opus-100-short dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    Opus-WritingPrompts

    • huggingface.co
    Updated May 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gryphe Padar (2024). Opus-WritingPrompts [Dataset]. https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 26, 2024
    Authors
    Gryphe Padar
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Opus Writing Prompts

    This is a dataset containing 3008 short stories, generated by an unrestrained Claude Opus using Reddit's Writing Prompts as a source. Each sample is generally between 4000-6000 characters long. These stories were thoroughly cleaned and then further enriched with a title and a series of applicable genres.
    Disclaimer: This dataset is extremely varied and includes erotica. You have been warned. Three files are included:

    A ShareGPT dataset, ready to be used for… See the full description on the dataset page: https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts.

  11. h

    opus-100-mid-100k

    • huggingface.co
    Updated Nov 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samy Jelassi (2025). opus-100-mid-100k [Dataset]. https://huggingface.co/datasets/sjelassi/opus-100-mid-100k
    Explore at:
    Dataset updated
    Nov 5, 2025
    Authors
    Samy Jelassi
    Description

    sjelassi/opus-100-mid-100k dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    writing-opus-6k

    • huggingface.co
    Updated Jun 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meseca (2024). writing-opus-6k [Dataset]. https://huggingface.co/datasets/meseca/writing-opus-6k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 27, 2024
    Dataset authored and provided by
    Meseca
    Description

    meseca/writing-opus-6k dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    opus_gnome

    • huggingface.co
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2024). opus_gnome [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_gnome
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 2, 2024
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for Opus Gnome

      Dataset Summary
    

    To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/GNOME.php E.g. dataset = load_dataset("opus_gnome", lang1="it", lang2="pl")

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_gnome.
    
  14. h

    opus_xhosanavy

    • huggingface.co
    Updated Nov 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2021). opus_xhosanavy [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_xhosanavy
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 28, 2021
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    This corpus is part of OPUS - the open collection of parallel corpora OPUS Website: http://opus.nlpl.eu

      Supported Tasks and Leaderboards
    

    The underlying task is machine translation from English to Xhosa

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    [More Information Needed]

      Data Splits
    

    [More… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_xhosanavy.

  15. h

    tatoeba_mt

    • huggingface.co
    • opendatalab.com
    Updated Mar 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2022). tatoeba_mt [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt
    Explore at:
    Dataset updated
    Mar 4, 2022
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
    License information was derived automatically

    Description

    The Tatoeba Translation Challenge is a multilingual data set of machine translation benchmarks derived from user-contributed translations collected by Tatoeba.org and provided as parallel corpus from OPUS. This dataset includes test and development data sorted by language pair. It includes test sets for hundreds of language pairs and is continuously updated. Please, check the version number tag to refer to the release that your are using.

  16. h

    opus-100_ar_en_experimental

    • huggingface.co
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maged Saeed (2024). opus-100_ar_en_experimental [Dataset]. https://huggingface.co/datasets/MagedSaeed/opus-100_ar_en_experimental
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 31, 2024
    Authors
    Maged Saeed
    Description

    MagedSaeed/opus-100_ar_en_experimental dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    llm-opus-ParaCrawl-english-id-v2

    • huggingface.co
    Updated Mar 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fahrizalfarid (2025). llm-opus-ParaCrawl-english-id-v2 [Dataset]. https://huggingface.co/datasets/akahana/llm-opus-ParaCrawl-english-id-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2025
    Authors
    fahrizalfarid
    Description

    akahana/llm-opus-ParaCrawl-english-id-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    Dans-Prosemaxx-Opus-Writing

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PocketDoc, Dans-Prosemaxx-Opus-Writing [Dataset]. https://huggingface.co/datasets/PocketDoc/Dans-Prosemaxx-Opus-Writing
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    PocketDoc
    Description

    PocketDoc/Dans-Prosemaxx-Opus-Writing dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. Helsinki-NLP-opus-100-en-hi

    • kaggle.com
    zip
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Guleria (2024). Helsinki-NLP-opus-100-en-hi [Dataset]. https://www.kaggle.com/typicalmango/helsinki-nlp-opus-100-en-hi
    Explore at:
    zip(43452489 bytes)Available download formats
    Dataset updated
    Jul 5, 2024
    Authors
    Aditya Guleria
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Helsinki
    Description

    train, validation and test data from the Helsinki NLP opus 100

    Contains English-Hindi sentence translations.

  20. h

    Dans-Assistantmaxx-Opus-instruct-2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PocketDoc, Dans-Assistantmaxx-Opus-instruct-2 [Dataset]. https://huggingface.co/datasets/PocketDoc/Dans-Assistantmaxx-Opus-instruct-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    PocketDoc
    Description

    PocketDoc/Dans-Assistantmaxx-Opus-instruct-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books

opus_books

OpusBooks

Helsinki-NLP/opus_books

Explore at:
8 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for OPUS Books

  Dataset Summary

This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

Search
Clear search
Close search
Google apps
Main menu