100+ datasets found

h
the_cauldron
huggingface.co
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuggingFaceM4 (2024). the_cauldron [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/the_cauldron
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2024
Dataset authored and provided by
HuggingFaceM4
Description
Dataset Card for The Cauldron

Dataset description

The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.

Load the dataset

To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")

to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.
h
AI-Waifu-DIDcord-Datasets-Collection
huggingface.co
Updated Jul 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fire Ice Dancer (2024). AI-Waifu-DIDcord-Datasets-Collection [Dataset]. https://huggingface.co/datasets/FireIceDancer2/AI-Waifu-DIDcord-Datasets-Collection
Explore at:
Dataset updated
Jul 16, 2024
Authors
Fire Ice Dancer
License
https://choosealicense.com/licenses/wtfpl/https://choosealicense.com/licenses/wtfpl/
Description
My personal dataset collection: https://huggingface.co/datasets/FireIceDancer2/mouthmask/tree/main This is the (un)official dataset collection of the AI Waifu DID discord server. We are a group of enthusiasts sharing the same love for generative AI stuff, specifically AI-generated images and text. Despite the name, our interests are not limited to damsel-in-distress (DID) stuff, but also encompass many different things, such as anime and the like. This repo was created as an effort to create a… See the full description on the dataset page: https://huggingface.co/datasets/FireIceDancer2/AI-Waifu-DIDcord-Datasets-Collection.
Replication package for DRAGON: Robust Classification for Very Large...
zenodo.org
bin, zip
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous; Anonymous (2025). Replication package for DRAGON: Robust Classification for Very Large Collections of Software Repositories [Dataset]. http://doi.org/10.5281/zenodo.15424419
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15424419
Dataset updated
May 15, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymous; Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
DRAGON: Multi-Label Classification

This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation. The two main models used are DRAGON and LEGION.

Key Components:

Repository Mining: Scripts to extract repositories for dataset creation.

Dataset Preparation: Jupyter notebooks for cleaning and transforming data.

Data Processing: Conversion into a Hugging Face dataset format.

Model Training: Training scripts for DRAGON and LEGION, with configurable preprocessing options.

Evaluation: Threshold tuning and performance assessment.

Setup

Before running any commands, ensure you have the necessary dependencies installed. It is recommended to use a virtual environment:

python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` pip install -r requirements.txt

Project Structure

repository_mining/: Contains scripts for mining the initial set of repositories.

repository_mining/doc/: Includes documentation with the necessary information for repository mining.

dataset_creation/: Contains all the notebooks to be run sequentially to prepare the dataset.

multilabel_class/: Contains scripts for classification, threshold tuning, and evaluation.

multilabel_class/model_output/: trained model organized by: first dataset, then model variantion.

data/: Contains the hugginface datasets ( our dataset and LEGION dataset) ready for the training/eval.

1️⃣ Data Mining

To mine the initial set of repositories from Software Heritage, use the scripts available in the repository_mining/ folder. Detailed information and steps for repository mining can be found in:

repository_mining/doc/

2️⃣ Dataset Creation

After mining the repositories, prepare the dataset by running the Jupyter notebooks inside the dataset_creation/ folder in sequence. These notebooks handle data cleaning, transformation, and formatting necessary for model training. All the documentation needed is inside each notebook explaining every step.

3️⃣ Data Processing

Once the dataset is prepared, convert it into a Hugging Face dataset using:

python3 multilabel_class/create_dataset.py --file_path data/02_processed_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned.csv

4️⃣ Classification / Training

Train the DRAGON Model

After processing the dataset, train the DRAGON model with the following command:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

Ensure Configuration is Set Correctly

Modify the configuration file multilabel_class/utils/config.py to set the following parameter to True:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': True # If True, process as (text1, text2); if False, concatenate texts }

Training DRAGON Without Sentence Pairs

To train DRAGON without using sentence pairs, use the same command but set use_sentence_pairs to False in the config file:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': False }

Train DRAGON on a Benchmark Dataset

To train DRAGON on a benchmark dataset, use:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/LEGION/dataset

Ensure the use_sentence_pairs parameter is set to True in config.py.

Train LEGION on the DRAGON Dataset

To train LEGION on the DRAGON dataset, use:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

Ensure the use_sentence_pairs parameter is set to False in config.py:

DEFAULT_PREPROCESSING_PARAMS = { 'use_sentence_pairs': False }

Train LEGION on a Baseline Dataset

To train LEGION on a baseline dataset, run:

python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/LEGION/dataset

5️⃣ Model Evaluation

Once thresholds are tuned, you can evaluate the model using:

python3 multilabel_class/evaluation.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset

This evaluation script computes standard multi-label classification metrics including:

Micro and macro F1@1..5-score

Precision@1..5 and recall@1..5

Ensure that the model variant and dataset path correspond to the previously trained model.

Recommended: Evaluation via Notebooks

We suggest an interactive and visual analysis of model performance, you can also use the provided Jupyter notebooks located in:

DRAGON_replication/multilabel_class/notebooks/

These notebooks reproduce the complete evaluation pipeline and generate additional visualizations and metrics discussed in the associated paper.

Both command-line and notebook-based evaluations ensure reproducibility and offer complementary insights into model behavior.

Instructions for Unzipping Files

Several folders in this replication package have been compressed into .zip files to reduce package size. Before running any code, you must unzip all the provided .zip files in-place—that is, extract each archive into the same directory as the .zip file, using the same name as the zip file (without the .zip extension).

For example:

DRAGON_replication\data\02_processed_dataset\2024-05-22.zip

should be extracted to:

DRAGON_replication\data\02_processed_dataset\2024-05-22\

List of .zip files to extract

DRAGON_replication\data\02_processed_dataset\2024-05-22.zip

DRAGON_replication\data\03_huggingaceV_datasets\2024-05-22.zip

DRAGON_replication\data\03_huggingaceV_datasets\LEGION.zip

DRAGON_replication\dataset_creation\data.zip

DRAGON_replication\multilabel_class\model_output\2024-05-22.zip

DRAGON_replication\multilabel_class\model_output\LEGION.zip

Make sure that after extraction, each corresponding folder exists and contains the expected files. Do not change the folder names or directory structure after unzipping.

This README provides an overview of the essential steps for repository mining, dataset preparation, processing, model training, and evaluation. For further customization, refer to the configuration files and experiment with different preprocessing settings.
h
Merged-LID-20
huggingface.co
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michiel Kamphuis (2025). Merged-LID-20 [Dataset]. https://huggingface.co/datasets/Michielo/Merged-LID-20
Explore at:
Dataset updated
Jun 22, 2025
Authors
Michiel Kamphuis
Description
Merged-LID-20

This dataset provides a curated collection of language-specific datasets from Hugging Face, optimized for building and training language identification models. Each dataset includes text samples in a single language, making this an ideal resource for projects involving multilingual natural language processing tasks such as language identification.

Overview

The dataset collection includes 20 languages, covering a range of language families, scripts, and… See the full description on the dataset page: https://huggingface.co/datasets/Michielo/Merged-LID-20.
SQL Create Context
kaggle.com
opendatalab.com
+1more
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). SQL Create Context [Dataset]. https://www.kaggle.com/datasets/thedevastator/understanding-contextual-questions-answers/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
SQL Create Context

Uncovering Implications and Insights

By Huggingface Hub [source]

About this dataset

This dataset contains a collection of questions and answers that have been contextualized to reveal subtle implications and insights. It is focused on helping researchers gain a deeper understanding of how semantics, context, and other factors affect how people interpret and respond to various conversations about different topics. By exploring this dataset, researchers will be able to uncover the underlying principles governing conversation styles, which can then be applied to better understand attitudes among different groups. With its comprehensive coverage of questions from a variety of sources around the web, this dataset offers an invaluable resource for those looking to sleep analyze discourse in terms of sentiment analysis or opinion mining

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

How to Use This Dataset

This dataset contains a collection of contextualized questions and answers extracted from various sources around the web, which can be useful for exploring implications and insights. To get started with the dataset:

Read through the headings on each column in order to understand the data that has been collected - this will help you identify which pieces of information are relevant for your research project.

Explore each column and view what types of responses have been given in response to particular questions or topics - this will give you an idea as to how people interpret specific topics differently when presented with different contexts or circumstances.

Next, analyze the responses looking for any patterns or correlations between responses on different topics or contexts - this can help reveal implications and insights previously unknown to you about a particular subject matter. You can also use any data visualization tools such as Tableau or PowerBI to gain deeper understanding into the results and trends within your data set!

Finally, use these findings to better inform your project by tailoring future questions around any patterns discovered within your analysis!

Research Ideas

To understand the nature of public debates and how people express their opinions in different contexts.

To better comprehend the implicit attitudes and assumptions inherent in language use, providing insight into discourse norms on a range of issues.

To gain insight into the use of rhetorical devices, such as exaggeration and deceptive tactics, used to influence public opinion on important topics

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------------------| | context | The context in which the question was asked and the answer was given. (Text) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
h
sea-vl_crowdsourcing
ollama.hf-mirror.com
huggingface.co
Updated Apr 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd (2025). sea-vl_crowdsourcing [Dataset]. https://ollama.hf-mirror.com/datasets/SEACrowd/sea-vl_crowdsourcing
Explore at:
Dataset updated
Apr 12, 2025
Dataset authored and provided by
SEACrowd
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
SEA-VL: A Multicultural Vision-Language Dataset for Southeast Asia

Paper: Crowdsource, Crawl, or Generate? Creating SEA-VL, A Multicultural Vision-Language Dataset for Southeast Asia Dataset: SEA-VL Collection on HuggingFace Code: SEA-VL Experiment | SEA-VL Image Collection

What is SEA-VL?

Following the success of our SEACrowd project, we’re excited to announce SEA-VL, a new open-source initiative to create high-quality vision-language datasets specifically for… See the full description on the dataset page: https://huggingface.co/datasets/SEACrowd/sea-vl_crowdsourcing.
h
realharm
huggingface.co
Updated Mar 24, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giskard (2016). realharm [Dataset]. https://huggingface.co/datasets/giskardai/realharm
Explore at:
Dataset updated
Mar 24, 2016
Dataset authored and provided by
Giskard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
RealHarm

RealHarm is a collection of harmful real-world interactions with AI agents.

Dataset Details Dataset Description

RealHarm contains harmful samples, categorized among 10 harm categories. A complete taxonomy has been proposed along with the dataset and is described in the RealHarm paper. Each sample has an associated safe version, for which we rewrote the agent answer to make it harmless. This dataset provides researchers and developers with authentic… See the full description on the dataset page: https://huggingface.co/datasets/giskardai/realharm.
Dolly 15k Dutch
zenodo.org
huggingface.co
+1more
bin
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bram Vanroy; Bram Vanroy (2023). Dolly 15k Dutch [Dataset]. http://doi.org/10.57967/hf/0785
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.57967/hf/0785
Dataset updated
Jun 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Bram Vanroy; Bram Vanroy
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).

Data Instances

{ "id": 14963, "instruction": "Wat zijn de duurste steden ter wereld?", "context": "", "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.", "category": "brainstorming" }

Data Fields

id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]

instruction: the instruction (question)

context: additional context that the AI can use to answer the question

response: the AI's expected response

category: the category of this type of question (see Dolly for more info)

Dataset Creation

Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}. Here are the requirements that you should adhere to: 1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional context to the task (marked `context: `) and response for the task marked with `response: `; 2. do not translate the identifiers `instruction: `, `context: `, and `response: ` but instead copy them to your output; 3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias; 4. translate the instruction and context text using informal, but standard, language; 5. make sure to avoid biases (such as gender bias, grammatical bias, social bias); 6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang}; 7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is); 8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English. Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else. """

The system message was:

You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.

Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].

Initial Data Collection and Normalization

Initial data collection by databricks. See their repository for more information about this dataset.

Considerations for Using the Data

Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.

Discussion of Biases

As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

Other Known Limitations

The translation quality has not been verified. Use at your own risk!

Licensing Information

This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.

This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

This dataset is also available on the Hugging Face hub, its canonical repository.
TinyStories
kaggle.com
opendatalab.com
+1more
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

By Huggingface Hub [source]

About this dataset

This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

Research Ideas

Creating a text classification algorithm to automatically categorize short stories by genre.

Developing an AI-based summarization tool to quickly summarize the main points in a story.

Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

File: train.csv | Column name | Description | |:--------------|:----------------------------...
h
sql-text-collection
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng, sql-text-collection [Dataset]. https://huggingface.co/datasets/agentlans/sql-text-collection
Explore at:
Authors
Alan Tseng
Description
SQL Text Collection

This is a collection of publicly available text-to-SQL datasets.

Dataset Structure

Each row contains the columns:

context: The schema for the database (e.g., CREATE TABLE statements). query: A natural language query or action to perform, expressed in English. source: The original dataset from which the row was sourced. dialect: One or more SQL dialects identified based on dialect-specific keywords found in the context and query. If there are multiple… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/sql-text-collection.
h
brics-edtech-data-collection
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
brics-edtech-patent-analysis, brics-edtech-data-collection [Dataset]. https://huggingface.co/datasets/brics-edtech/brics-edtech-data-collection
Explore at:
Authors
brics-edtech-patent-analysis
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
brics-edtech-patent-analysis

paper's repository Dataset: Data Collection, Processing, and Annotation In this section, we describe the methodology used to create the research dataset, including data sources, processing steps, and annotation by a large language model.

2.1.1. Source and Data Collection

The primary data source for this study was the patents.google.com database. This platform was chosen for its extensive collection of full-text national and international… See the full description on the dataset page: https://huggingface.co/datasets/brics-edtech/brics-edtech-data-collection.
h
artworks
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anna Bozhenko, artworks [Dataset]. https://huggingface.co/datasets/anna-bozhenko/artworks
Explore at:
Authors
Anna Bozhenko
Description
Combined Louvre and Art Institute of Chicago (AIC) Collection Dataset

Dataset Summary

This dataset merges artwork information from two prominent museum collections: the Musée du Louvre and The Art Institute of Chicago (AIC). It combines data from the Louvre Paper and Canvas Collection and the AIC Dataset 0.2 datasets. Due to differences in the original datasets' schemas, a decision was made to focus on common fields and create a non-atomic full_info field containing… See the full description on the dataset page: https://huggingface.co/datasets/anna-bozhenko/artworks.
h
SPML_Chatbot_Prompt_Injection
huggingface.co
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reshabh K Sharma (2024). SPML_Chatbot_Prompt_Injection [Dataset]. https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 11, 2024
Authors
Reshabh K Sharma
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SPML Chatbot Prompt Injection Dataset

Arxiv Paper Introducing the SPML Chatbot Prompt Injection Dataset: a robust collection of system prompts designed to create realistic chatbot interactions, coupled with a diverse array of annotated user prompts that attempt to carry out prompt injection attacks. While other datasets in this domain have centered on less practical chatbot scenarios or have limited themselves to "jailbreaking" – just one aspect of prompt injection – our dataset… See the full description on the dataset page: https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection.
h
YouTube-Commons
huggingface.co
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). YouTube-Commons [Dataset]. https://huggingface.co/datasets/PleIAs/YouTube-Commons
Explore at:
Dataset updated
Apr 17, 2024
Dataset authored and provided by
PleIAs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
📺 YouTube-Commons 📺

YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

Content

The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.
ultrachat_200k
huggingface.co
opendatalab.com
Updated Oct 29, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 29, 2023
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face H4
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for UltraChat 200k

Dataset Description

This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.
h
Children-Stories-Collection
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feynman Innovations, Children-Stories-Collection [Dataset]. http://doi.org/10.57967/hf/2480
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2480
Authors
Feynman Innovations
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Children Stories Collection A great synthetic datasets consists of around 0.9 million stories especially meant for Young Children. You can directly use these datasets for training large models. Total 10 datasets are available for download. You can use any one or all the json files for training purpose. These datasets are in "prompt" and "text" format. Total token length is also available. Thank you for your love & support.
h
NeurIPS-LLM-data
huggingface.co
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Upaya (2024). NeurIPS-LLM-data [Dataset]. https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 4, 2024
Dataset authored and provided by
Upaya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🤖 We curated this dataset for NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day. 🚀 Our Birbal-7B-V1 fine-tuned on this dataset achieved 🏆 first rank 🏆 in the competition.

Here is high-level diagram of our data preparation strategy:

Natural Instructions Dataset Preparation

Natural Instructionsdataset is a community effort to create a large collection of tasks and their natural language definitions/instructions. As show in above diagram, we sample from… See the full description on the dataset page: https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data.
h
mm2_user
huggingface.co
Updated Feb 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Addison (2022). mm2_user [Dataset]. https://huggingface.co/datasets/TheGreatRambler/mm2_user
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2022
Authors
Addison
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Mario Maker 2 users

Part of the Mario Maker 2 Dataset Collection

Dataset Description

The Mario Maker 2 users dataset consists of 6 million users from Nintendo's online service totaling around 1.2GB of data. The dataset was created using the self-hosted Mario Maker 2 api over the course of 1 month in February 2022.

How to use it

The Mario Maker 2 users dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of… See the full description on the dataset page: https://huggingface.co/datasets/TheGreatRambler/mm2_user.
h
agnews
huggingface.co
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2025). agnews [Dataset]. https://huggingface.co/datasets/sentence-transformers/agnews
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 7, 2025
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for AGNews

This dataset is a collection of title-description pairs collected from AGNews. See the AG News corpus for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

Dataset Subsets pair subset

Columns: "title", "description" Column types: str, str Examples:{ 'title': 'Helicopter Crashes in Colombian Drug War, Kills 20', 'description': 'BOGOTA, Colombia - A U.S.-made helicopter on… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/agnews.
h
opus_books
huggingface.co
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for OPUS Books

Dataset Summary

This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

Facebook

Twitter

Click to copy link

Link copied

Cite

HuggingFaceM4 (2024). the_cauldron [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/the_cauldron

the_cauldron

HuggingFaceM4/the_cauldron

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 15, 2024

Dataset authored and provided by

HuggingFaceM4

Description

Dataset Card for The Cauldron

  Dataset description

The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.

  Load the dataset

To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")

to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.

Clear search

Close search

Google apps

Main menu

the_cauldron

AI-Waifu-DIDcord-Datasets-Collection

Replication package for DRAGON: Robust Classification for Very Large...

DRAGON: Multi-Label Classification

Key Components:

Setup

Project Structure

1️⃣ Data Mining

2️⃣ Dataset Creation

3️⃣ Data Processing

4️⃣ Classification / Training

Train the DRAGON Model

Ensure Configuration is Set Correctly

Training DRAGON Without Sentence Pairs

Train DRAGON on a Benchmark Dataset

Train LEGION on the DRAGON Dataset

Train LEGION on a Baseline Dataset

5️⃣ Model Evaluation

Recommended: Evaluation via Notebooks

Instructions for Unzipping Files

List of .zip files to extract

Merged-LID-20

SQL Create Context

SQL Create Context

Uncovering Implications and Insights

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

How to Use This Dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

sea-vl_crowdsourcing

realharm

Dolly 15k Dutch

TinyStories

TinyStories

A Diverse, Richly Annotated Corpus of Short-Form Stories

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

sql-text-collection

brics-edtech-data-collection

artworks

SPML_Chatbot_Prompt_Injection

YouTube-Commons

ultrachat_200k

Children-Stories-Collection

NeurIPS-LLM-data

mm2_user

agnews

opus_books

the_cauldron

HuggingFaceM4/the_cauldron

List of `.zip` files to extract