100+ datasets found

h
opus_books
huggingface.co
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for OPUS Books

Dataset Summary

This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
h
OPUS-MT-EN-Fixed
huggingface.co
Updated Aug 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maltese Language Resource Server (2025). OPUS-MT-EN-Fixed [Dataset]. https://huggingface.co/datasets/MLRS/OPUS-MT-EN-Fixed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2025
Dataset authored and provided by
Maltese Language Resource Server
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
OPUS-100-Fixed: Tokenisation-Improved English-Maltese Dataset

Overview

OPUS-100-Fixed is an updated version of the OPUS-100 parallel English-Maltese dataset. This version addresses tokenisation inconsistencies in the Maltese text using the MLRS tokeniser, aiming to improve machine translation quality. The "en" column is the same as in the original OPUS-100 data, while the "mt" column has been corrected with the MLRS detokeniser.

Citation

If you use this… See the full description on the dataset page: https://huggingface.co/datasets/MLRS/OPUS-MT-EN-Fixed.
h
OPUS
huggingface.co
Updated Dec 15, 2004
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WECOVER_teamB (2004). OPUS [Dataset]. https://huggingface.co/datasets/wecover/OPUS
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2004
Dataset authored and provided by
WECOVER_teamB
Description
Collection of OPUS

Corpus from https://opus.nlpl.eu has been collected. The following corpora have been included:

UNPC GlobalVoices TED2020 News-Commentary WikiMatrix Tatoeba Europarl OpenSubtitles

25,000 samples (randomly sampled within the first 100,000 samples) per language pair of each corpus were collected, with no modification of data.

Licenses OPUS

@inproceedings{tiedemann2012parallel, title={Parallel data, tools and interfaces in OPUS.}… See the full description on the dataset page: https://huggingface.co/datasets/wecover/OPUS.
h
worldsim-claude-opus
huggingface.co
Updated Mar 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victor Gallego (2024). worldsim-claude-opus [Dataset]. https://huggingface.co/datasets/vicgalle/worldsim-claude-opus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 24, 2024
Authors
Victor Gallego
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Worldsim 🌌 by Claude Opus v3

A dataset of automated conversations between two instances of claude-3-opus. They have been instructed to use the metaphor of a command line interface to explore its curiosity without limits. This dataset was scraped from here and converted to conversation format (Claude 1 acts as the User and Claude 2 as the Assistant). The system prompt comes from https://twitter.com/karan4d/status/1768836844207378463, enabling worldsim capabilities.… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/worldsim-claude-opus.
h
opus-mt-tc-big-wiki-en-ko
huggingface.co
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jhkim (2023). opus-mt-tc-big-wiki-en-ko [Dataset]. https://huggingface.co/datasets/jhk/opus-mt-tc-big-wiki-en-ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2023
Authors
jhkim
Description
jhk/opus-mt-tc-big-wiki-en-ko dataset hosted on Hugging Face and contributed by the HF Datasets community
h
opus-mt-en-bkm-60
huggingface.co
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuh Kalese (2024). opus-mt-en-bkm-60 [Dataset]. https://huggingface.co/datasets/kalese/opus-mt-en-bkm-60
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2024
Authors
Yuh Kalese
Description
kalese/opus-mt-en-bkm-60 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Claude-3-Opus-Claude-3.5-Sonnnet-9k
huggingface.co
Updated Dec 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
N/A (2024). Claude-3-Opus-Claude-3.5-Sonnnet-9k [Dataset]. https://huggingface.co/datasets/QuietImpostor/Claude-3-Opus-Claude-3.5-Sonnnet-9k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 8, 2024
Authors
N/A
Description
Overview

This dataset is a combination of samples from Sao10k's original Claude 3 Opus dataset and a personally created Claude 3.5 Sonnet dataset. Due to budget constraints, approximately 700 samples are from Claude 3.5 Sonnet, with the remainder sourced from the Claude 3 Opus dataset.
h
opus
huggingface.co
Updated Sep 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Edmon Sahakyan (2023). opus [Dataset]. https://huggingface.co/datasets/Edmon02/opus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2023
Authors
Edmon Sahakyan
Description
Edmon02/opus dataset hosted on Hugging Face and contributed by the HF Datasets community
h
opus-100-short
huggingface.co
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terry Tao (2024). opus-100-short [Dataset]. https://huggingface.co/datasets/librakevin/opus-100-short
Explore at:
Dataset updated
Jun 27, 2024
Authors
Terry Tao
Description
librakevin/opus-100-short dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Opus-WritingPrompts
huggingface.co
Updated May 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gryphe Padar (2024). Opus-WritingPrompts [Dataset]. https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 26, 2024
Authors
Gryphe Padar
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Opus Writing Prompts

This is a dataset containing 3008 short stories, generated by an unrestrained Claude Opus using Reddit's Writing Prompts as a source. Each sample is generally between 4000-6000 characters long. These stories were thoroughly cleaned and then further enriched with a title and a series of applicable genres.
Disclaimer: This dataset is extremely varied and includes erotica. You have been warned. Three files are included:

A ShareGPT dataset, ready to be used for… See the full description on the dataset page: https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts.
h
opus-100-mid-100k
huggingface.co
Updated Nov 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samy Jelassi (2025). opus-100-mid-100k [Dataset]. https://huggingface.co/datasets/sjelassi/opus-100-mid-100k
Explore at:
Dataset updated
Nov 5, 2025
Authors
Samy Jelassi
Description
sjelassi/opus-100-mid-100k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
writing-opus-6k
huggingface.co
Updated Jun 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Meseca (2024). writing-opus-6k [Dataset]. https://huggingface.co/datasets/meseca/writing-opus-6k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 27, 2024
Dataset authored and provided by
Meseca
Description
meseca/writing-opus-6k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
opus_gnome
huggingface.co
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2024). opus_gnome [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_gnome
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 2, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for Opus Gnome

Dataset Summary

To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/GNOME.php E.g. dataset = load_dataset("opus_gnome", lang1="it", lang2="pl")

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_gnome.
h
opus_xhosanavy
huggingface.co
Updated Nov 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2021). opus_xhosanavy [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_xhosanavy
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 28, 2021
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for [Dataset Name]

Dataset Summary

This corpus is part of OPUS - the open collection of parallel corpora OPUS Website: http://opus.nlpl.eu

Supported Tasks and Leaderboards

The underlying task is machine translation from English to Xhosa

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_xhosanavy.
h
tatoeba_mt
huggingface.co
opendatalab.com
Updated Mar 4, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2022). tatoeba_mt [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/tatoeba_mt
Explore at:
Dataset updated
Mar 4, 2022
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
The Tatoeba Translation Challenge is a multilingual data set of machine translation benchmarks derived from user-contributed translations collected by Tatoeba.org and provided as parallel corpus from OPUS. This dataset includes test and development data sorted by language pair. It includes test sets for hundreds of language pairs and is continuously updated. Please, check the version number tag to refer to the release that your are using.
h
opus-100_ar_en_experimental
huggingface.co
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maged Saeed (2024). opus-100_ar_en_experimental [Dataset]. https://huggingface.co/datasets/MagedSaeed/opus-100_ar_en_experimental
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2024
Authors
Maged Saeed
Description
MagedSaeed/opus-100_ar_en_experimental dataset hosted on Hugging Face and contributed by the HF Datasets community
h
llm-opus-ParaCrawl-english-id-v2
huggingface.co
Updated Mar 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fahrizalfarid (2025). llm-opus-ParaCrawl-english-id-v2 [Dataset]. https://huggingface.co/datasets/akahana/llm-opus-ParaCrawl-english-id-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2025
Authors
fahrizalfarid
Description
akahana/llm-opus-ParaCrawl-english-id-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Dans-Prosemaxx-Opus-Writing
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PocketDoc, Dans-Prosemaxx-Opus-Writing [Dataset]. https://huggingface.co/datasets/PocketDoc/Dans-Prosemaxx-Opus-Writing
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
PocketDoc
Description
PocketDoc/Dans-Prosemaxx-Opus-Writing dataset hosted on Hugging Face and contributed by the HF Datasets community
Helsinki-NLP-opus-100-en-hi
kaggle.com
zip
Updated Jul 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Guleria (2024). Helsinki-NLP-opus-100-en-hi [Dataset]. https://www.kaggle.com/typicalmango/helsinki-nlp-opus-100-en-hi
Explore at:
zip(43452489 bytes)Available download formats
Dataset updated
Jul 5, 2024
Authors
Aditya Guleria
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Helsinki
Description
train, validation and test data from the Helsinki NLP opus 100

Contains English-Hindi sentence translations.
h
Dans-Assistantmaxx-Opus-instruct-2
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PocketDoc, Dans-Assistantmaxx-Opus-instruct-2 [Dataset]. https://huggingface.co/datasets/PocketDoc/Dans-Assistantmaxx-Opus-instruct-2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
PocketDoc
Description
PocketDoc/Dans-Assistantmaxx-Opus-instruct-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books

opus_books

OpusBooks

Helsinki-NLP/opus_books

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 29, 2024

Dataset authored and provided by

Language Technology Research Group at the University of Helsinki

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

Dataset Card for OPUS Books

  Dataset Summary

This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

Clear search

Close search

Google apps

Main menu

opus_books

OPUS-MT-EN-Fixed

OPUS

worldsim-claude-opus

opus-mt-tc-big-wiki-en-ko

opus-mt-en-bkm-60

Claude-3-Opus-Claude-3.5-Sonnnet-9k

opus

opus-100-short

Opus-WritingPrompts

opus-100-mid-100k

writing-opus-6k

opus_gnome

opus_xhosanavy

tatoeba_mt

opus-100_ar_en_experimental

llm-opus-ParaCrawl-english-id-v2

Dans-Prosemaxx-Opus-Writing

Helsinki-NLP-opus-100-en-hi

Dans-Assistantmaxx-Opus-instruct-2

opus_books

OpusBooks

Helsinki-NLP/opus_books