Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for OPUS Books
Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
OPUS-100-Fixed: Tokenisation-Improved English-Maltese Dataset
Overview
OPUS-100-Fixed is an updated version of the OPUS-100 parallel English-Maltese dataset. This version addresses tokenisation inconsistencies in the Maltese text using the MLRS tokeniser, aiming to improve machine translation quality. The "en" column is the same as in the original OPUS-100 data, while the "mt" column has been corrected with the MLRS detokeniser.
Citation
If you use this… See the full description on the dataset page: https://huggingface.co/datasets/MLRS/OPUS-MT-EN-Fixed.
Facebook
TwitterCollection of OPUS
Corpus from https://opus.nlpl.eu has been collected. The following corpora have been included:
UNPC GlobalVoices TED2020 News-Commentary WikiMatrix Tatoeba Europarl OpenSubtitles
25,000 samples (randomly sampled within the first 100,000 samples) per language pair of each corpus were collected, with no modification of data.
Licenses
OPUS
@inproceedings{tiedemann2012parallel, title={Parallel data, tools and interfaces in OPUS.}… See the full description on the dataset page: https://huggingface.co/datasets/wecover/OPUS.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Worldsim 🌌 by Claude Opus v3
A dataset of automated conversations between two instances of claude-3-opus. They have been instructed to use the metaphor of a command line interface to explore its curiosity without limits. This dataset was scraped from here and converted to conversation format (Claude 1 acts as the User and Claude 2 as the Assistant). The system prompt comes from https://twitter.com/karan4d/status/1768836844207378463, enabling worldsim capabilities.… See the full description on the dataset page: https://huggingface.co/datasets/vicgalle/worldsim-claude-opus.
Facebook
Twitterjhk/opus-mt-tc-big-wiki-en-ko dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterkalese/opus-mt-en-bkm-60 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterOverview
This dataset is a combination of samples from Sao10k's original Claude 3 Opus dataset and a personally created Claude 3.5 Sonnet dataset. Due to budget constraints, approximately 700 samples are from Claude 3.5 Sonnet, with the remainder sourced from the Claude 3 Opus dataset.
Facebook
TwitterEdmon02/opus dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterlibrakevin/opus-100-short dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Opus Writing Prompts
This is a dataset containing 3008 short stories, generated by an unrestrained Claude Opus using Reddit's Writing Prompts as a source. Each sample is generally between 4000-6000 characters long.
These stories were thoroughly cleaned and then further enriched with a title and a series of applicable genres.
Disclaimer: This dataset is extremely varied and includes erotica. You have been warned.
Three files are included:
A ShareGPT dataset, ready to be used for… See the full description on the dataset page: https://huggingface.co/datasets/Gryphe/Opus-WritingPrompts.
Facebook
Twittersjelassi/opus-100-mid-100k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittermeseca/writing-opus-6k dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for Opus Gnome
Dataset Summary
To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. You can find the valid pairs in Homepage section of Dataset Description: http://opus.nlpl.eu/GNOME.php E.g. dataset = load_dataset("opus_gnome", lang1="it", lang2="pl")
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_gnome.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for [Dataset Name]
Dataset Summary
This corpus is part of OPUS - the open collection of parallel corpora OPUS Website: http://opus.nlpl.eu
Supported Tasks and Leaderboards
The underlying task is machine translation from English to Xhosa
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_xhosanavy.
Facebook
TwitterAttribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
The Tatoeba Translation Challenge is a multilingual data set of machine translation benchmarks derived from user-contributed translations collected by Tatoeba.org and provided as parallel corpus from OPUS. This dataset includes test and development data sorted by language pair. It includes test sets for hundreds of language pairs and is continuously updated. Please, check the version number tag to refer to the release that your are using.
Facebook
TwitterMagedSaeed/opus-100_ar_en_experimental dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterakahana/llm-opus-ParaCrawl-english-id-v2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterPocketDoc/Dans-Prosemaxx-Opus-Writing dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
train, validation and test data from the Helsinki NLP opus 100
Contains English-Hindi sentence translations.
Facebook
TwitterPocketDoc/Dans-Assistantmaxx-Opus-instruct-2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for OPUS Books
Dataset Summary
This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.