100+ datasets found
  1. h

    the_cauldron

    • huggingface.co
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HuggingFaceM4 (2024). the_cauldron [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/the_cauldron
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2024
    Dataset authored and provided by
    HuggingFaceM4
    Description

    Dataset Card for The Cauldron

      Dataset description
    

    The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.

      Load the dataset
    

    To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")

    to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.

  2. h

    AI-Waifu-DIDcord-Datasets-Collection

    • huggingface.co
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fire Ice Dancer (2024). AI-Waifu-DIDcord-Datasets-Collection [Dataset]. https://huggingface.co/datasets/FireIceDancer2/AI-Waifu-DIDcord-Datasets-Collection
    Explore at:
    Dataset updated
    Jul 16, 2024
    Authors
    Fire Ice Dancer
    License

    https://choosealicense.com/licenses/wtfpl/https://choosealicense.com/licenses/wtfpl/

    Description

    My personal dataset collection: https://huggingface.co/datasets/FireIceDancer2/mouthmask/tree/main This is the (un)official dataset collection of the AI Waifu DID discord server. We are a group of enthusiasts sharing the same love for generative AI stuff, specifically AI-generated images and text. Despite the name, our interests are not limited to damsel-in-distress (DID) stuff, but also encompass many different things, such as anime and the like. This repo was created as an effort to create a… See the full description on the dataset page: https://huggingface.co/datasets/FireIceDancer2/AI-Waifu-DIDcord-Datasets-Collection.

  3. Replication package for DRAGON: Robust Classification for Very Large...

    • zenodo.org
    bin, zip
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2025). Replication package for DRAGON: Robust Classification for Very Large Collections of Software Repositories [Dataset]. http://doi.org/10.5281/zenodo.15424419
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    DRAGON: Multi-Label Classification

    This archive contains the replication package for the DRAGON multi-label classification models, which leverage BERT-based architectures. The package includes scripts for repository mining, dataset creation, data processing, model training, and evaluation. The two main models used are DRAGON and LEGION.

    Key Components:

    • Repository Mining: Scripts to extract repositories for dataset creation.
    • Dataset Preparation: Jupyter notebooks for cleaning and transforming data.
    • Data Processing: Conversion into a Hugging Face dataset format.
    • Model Training: Training scripts for DRAGON and LEGION, with configurable preprocessing options.
    • Evaluation: Threshold tuning and performance assessment.

    Setup

    Before running any commands, ensure you have the necessary dependencies installed. It is recommended to use a virtual environment:

    python3 -m venv venv
    source venv/bin/activate # On Windows use `venv\Scripts\activate`
    pip install -r requirements.txt
    

    Project Structure

    • repository_mining/: Contains scripts for mining the initial set of repositories.
      • repository_mining/doc/: Includes documentation with the necessary information for repository mining.
    • dataset_creation/: Contains all the notebooks to be run sequentially to prepare the dataset.
    • multilabel_class/: Contains scripts for classification, threshold tuning, and evaluation.
      • multilabel_class/model_output/: trained model organized by: first dataset, then model variantion.
    • data/: Contains the hugginface datasets ( our dataset and LEGION dataset) ready for the training/eval.

    1️⃣ Data Mining

    To mine the initial set of repositories from Software Heritage, use the scripts available in the repository_mining/ folder. Detailed information and steps for repository mining can be found in:

    repository_mining/doc/
    

    2️⃣ Dataset Creation

    After mining the repositories, prepare the dataset by running the Jupyter notebooks inside the dataset_creation/ folder in sequence. These notebooks handle data cleaning, transformation, and formatting necessary for model training. All the documentation needed is inside each notebook explaining every step.

    3️⃣ Data Processing

    Once the dataset is prepared, convert it into a Hugging Face dataset using:

    python3 multilabel_class/create_dataset.py --file_path data/02_processed_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned.csv
    

    4️⃣ Classification / Training

    Train the DRAGON Model

    After processing the dataset, train the DRAGON model with the following command:

    python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
    

    Ensure Configuration is Set Correctly

    Modify the configuration file multilabel_class/utils/config.py to set the following parameter to True:

    DEFAULT_PREPROCESSING_PARAMS = { 
      'use_sentence_pairs': True # If True, process as (text1, text2); if False, concatenate texts
    }
    

    Training DRAGON Without Sentence Pairs

    To train DRAGON without using sentence pairs, use the same command but set use_sentence_pairs to False in the config file:

    DEFAULT_PREPROCESSING_PARAMS = { 
      'use_sentence_pairs': False
    }
    

    Train DRAGON on a Benchmark Dataset

    To train DRAGON on a benchmark dataset, use:

    python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
    

    Ensure the use_sentence_pairs parameter is set to True in config.py.

    Train LEGION on the DRAGON Dataset

    To train LEGION on the DRAGON dataset, use:

    python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
    

    Ensure the use_sentence_pairs parameter is set to False in config.py:

    DEFAULT_PREPROCESSING_PARAMS = { 
      'use_sentence_pairs': False
    }
    

    Train LEGION on a Baseline Dataset

    To train LEGION on a baseline dataset, run:

    python3 multilabel_class/tune_thresholds.py --model_type bert --model_variant db --dataset_path data/03_huggingaceV_datasets/LEGION/dataset
    

    5️⃣ Model Evaluation

    Once thresholds are tuned, you can evaluate the model using:

    python3 multilabel_class/evaluation.py --model_type bert --model_variant focal --dataset_path data/03_huggingaceV_datasets/2024-05-22/origin-metadata-readme_names-900000dataset_forks-cleaned/dataset
    

    This evaluation script computes standard multi-label classification metrics including:

    • Micro and macro F1@1..5-score
    • Precision@1..5 and recall@1..5

    Ensure that the model variant and dataset path correspond to the previously trained model.

    Recommended: Evaluation via Notebooks

    We suggest an interactive and visual analysis of model performance, you can also use the provided Jupyter notebooks located in:

    DRAGON_replication/multilabel_class/notebooks/
    

    These notebooks reproduce the complete evaluation pipeline and generate additional visualizations and metrics discussed in the associated paper.

    Both command-line and notebook-based evaluations ensure reproducibility and offer complementary insights into model behavior.

    Instructions for Unzipping Files

    Several folders in this replication package have been compressed into .zip files to reduce package size. Before running any code, you must unzip all the provided .zip files in-place—that is, extract each archive into the same directory as the .zip file, using the same name as the zip file (without the .zip extension).

    For example:

    DRAGON_replication\data\02_processed_dataset\2024-05-22.zip
    

    should be extracted to:

    DRAGON_replication\data\02_processed_dataset\2024-05-22\
    

    List of .zip files to extract

    • DRAGON_replication\data\02_processed_dataset\2024-05-22.zip
    • DRAGON_replication\data\03_huggingaceV_datasets\2024-05-22.zip
    • DRAGON_replication\data\03_huggingaceV_datasets\LEGION.zip
    • DRAGON_replication\dataset_creation\data.zip
    • DRAGON_replication\multilabel_class\model_output\2024-05-22.zip
    • DRAGON_replication\multilabel_class\model_output\LEGION.zip

    Make sure that after extraction, each corresponding folder exists and contains the expected files. Do not change the folder names or directory structure after unzipping.

    This README provides an overview of the essential steps for repository mining, dataset preparation, processing, model training, and evaluation. For further customization, refer to the configuration files and experiment with different preprocessing settings.

  4. h

    Merged-LID-20

    • huggingface.co
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michiel Kamphuis (2025). Merged-LID-20 [Dataset]. https://huggingface.co/datasets/Michielo/Merged-LID-20
    Explore at:
    Dataset updated
    Jun 22, 2025
    Authors
    Michiel Kamphuis
    Description

    Merged-LID-20

    This dataset provides a curated collection of language-specific datasets from Hugging Face, optimized for building and training language identification models. Each dataset includes text samples in a single language, making this an ideal resource for projects involving multilingual natural language processing tasks such as language identification.

      Overview
    

    The dataset collection includes 20 languages, covering a range of language families, scripts, and… See the full description on the dataset page: https://huggingface.co/datasets/Michielo/Merged-LID-20.

  5. SQL Create Context

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). SQL Create Context [Dataset]. https://www.kaggle.com/datasets/thedevastator/understanding-contextual-questions-answers/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    SQL Create Context

    Uncovering Implications and Insights

    By Huggingface Hub [source]

    About this dataset

    This dataset contains a collection of questions and answers that have been contextualized to reveal subtle implications and insights. It is focused on helping researchers gain a deeper understanding of how semantics, context, and other factors affect how people interpret and respond to various conversations about different topics. By exploring this dataset, researchers will be able to uncover the underlying principles governing conversation styles, which can then be applied to better understand attitudes among different groups. With its comprehensive coverage of questions from a variety of sources around the web, this dataset offers an invaluable resource for those looking to sleep analyze discourse in terms of sentiment analysis or opinion mining

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    How to Use This Dataset

    This dataset contains a collection of contextualized questions and answers extracted from various sources around the web, which can be useful for exploring implications and insights. To get started with the dataset:

    • Read through the headings on each column in order to understand the data that has been collected - this will help you identify which pieces of information are relevant for your research project.
    • Explore each column and view what types of responses have been given in response to particular questions or topics - this will give you an idea as to how people interpret specific topics differently when presented with different contexts or circumstances.
    • Next, analyze the responses looking for any patterns or correlations between responses on different topics or contexts - this can help reveal implications and insights previously unknown to you about a particular subject matter. You can also use any data visualization tools such as Tableau or PowerBI to gain deeper understanding into the results and trends within your data set!
    • Finally, use these findings to better inform your project by tailoring future questions around any patterns discovered within your analysis!

    Research Ideas

    • To understand the nature of public debates and how people express their opinions in different contexts.
    • To better comprehend the implicit attitudes and assumptions inherent in language use, providing insight into discourse norms on a range of issues.
    • To gain insight into the use of rhetorical devices, such as exaggeration and deceptive tactics, used to influence public opinion on important topics

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------------------| | context | The context in which the question was asked and the answer was given. (Text) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  6. h

    sea-vl_crowdsourcing

    • ollama.hf-mirror.com
    • huggingface.co
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd (2025). sea-vl_crowdsourcing [Dataset]. https://ollama.hf-mirror.com/datasets/SEACrowd/sea-vl_crowdsourcing
    Explore at:
    Dataset updated
    Apr 12, 2025
    Dataset authored and provided by
    SEACrowd
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    SEA-VL: A Multicultural Vision-Language Dataset for Southeast Asia

    Paper: Crowdsource, Crawl, or Generate? Creating SEA-VL, A Multicultural Vision-Language Dataset for Southeast Asia Dataset: SEA-VL Collection on HuggingFace Code: SEA-VL Experiment | SEA-VL Image Collection

      What is SEA-VL?
    

    Following the success of our SEACrowd project, we’re excited to announce SEA-VL, a new open-source initiative to create high-quality vision-language datasets specifically for… See the full description on the dataset page: https://huggingface.co/datasets/SEACrowd/sea-vl_crowdsourcing.

  7. h

    realharm

    • huggingface.co
    Updated Mar 24, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Giskard (2016). realharm [Dataset]. https://huggingface.co/datasets/giskardai/realharm
    Explore at:
    Dataset updated
    Mar 24, 2016
    Dataset authored and provided by
    Giskard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    RealHarm

    RealHarm is a collection of harmful real-world interactions with AI agents.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    RealHarm contains harmful samples, categorized among 10 harm categories. A complete taxonomy has been proposed along with the dataset and is described in the RealHarm paper. Each sample has an associated safe version, for which we rewrote the agent answer to make it harmless. This dataset provides researchers and developers with authentic… See the full description on the dataset page: https://huggingface.co/datasets/giskardai/realharm.

  8. Dolly 15k Dutch

    • zenodo.org
    • huggingface.co
    • +1more
    bin
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Vanroy; Bram Vanroy (2023). Dolly 15k Dutch [Dataset]. http://doi.org/10.57967/hf/0785
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Bram Vanroy; Bram Vanroy
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset contains 14,934 instructions, contexts and responses, in several natural language categories such as classification, closed QA, generation, etc. The English original dataset was created by @databricks, who crowd-sourced the data creation via its employees. The current dataset is a translation of that dataset through ChatGPT (gpt-3.5-turbo).

    Data Instances

    {
     "id": 14963,
     "instruction": "Wat zijn de duurste steden ter wereld?",
     "context": "",
     "response": "Dit is een uitgebreide lijst van de duurste steden: Singapore, Tel Aviv, New York, Hong Kong, Los Angeles, Zurich, Genève, San Francisco, Parijs en Sydney.",
     "category": "brainstorming"
    }
    

    Data Fields

    • id: the ID of the item. The following 77 IDs are not included because they could not be translated (or were too long): [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966]
    • instruction: the instruction (question)
    • context: additional context that the AI can use to answer the question
    • response: the AI's expected response
    • category: the category of this type of question (see Dolly for more info)

    Dataset Creation

    Both the translations and the topics were translated with OpenAI's API for gpt-3.5-turbo. max_tokens=1024, temperature=0 as parameters.

    The prompt template to translate the input is (where src_lang was English and tgt_lang Dutch):

    CONVERSATION_TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional context to the task, and the response to the task, from {src_lang} to {tgt_lang}.
    
    Here are the requirements that you should adhere to:
    1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional context to the task (marked `context: `) and response for the task marked with `response: `;
    2. do not translate the identifiers `instruction: `, `context: `, and `response: ` but instead copy them to your output;
    3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
    4. translate the instruction and context text using informal, but standard, language;
    5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
    6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the context in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
    7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the context, nor the translation in the response (just copy them as-is);
    8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.
    
    Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.
    
    """
    

    The system message was:

    You are a helpful assistant that translates English to Dutch according to the requirements that are given to you.
    

    Note that 77 items (0.5%) were not successfully translated. This can either mean that the prompt was too long for the given limit (max_tokens=1024) or that the generated translation could not be parsed into instruction, context and response fields. The missing IDs are [1502, 1812, 1868, 4179, 4541, 6347, 8851, 9321, 10588, 10835, 11257, 12082, 12319, 12471, 12701, 12988, 13066, 13074, 13076, 13181, 13253, 13279, 13313, 13346, 13369, 13446, 13475, 13528, 13546, 13548, 13549, 13558, 13566, 13600, 13603, 13657, 13668, 13733, 13765, 13775, 13801, 13831, 13906, 13922, 13923, 13957, 13967, 13976, 14028, 14031, 14045, 14050, 14082, 14083, 14089, 14110, 14155, 14162, 14181, 14187, 14200, 14221, 14222, 14281, 14473, 14475, 14476, 14587, 14590, 14667, 14685, 14764, 14780, 14808, 14836, 14891, 1 4966].

    Initial Data Collection and Normalization

    Initial data collection by databricks. See their repository for more information about this dataset.

    Considerations for Using the Data

    Note that the translations in this new dataset have not been verified by humans! Use at your own risk, both in terms of quality and biases.

    Discussion of Biases

    As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias), of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

    Other Known Limitations

    The translation quality has not been verified. Use at your own risk!

    Licensing Information

    This repository follows the original databricks license, which is CC BY-SA 3.0 but see below for a specific restriction.

    This text was generated (either in part or in full) with GPT-3 (gpt-3.5-turbo), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

    If you use this dataset, you must also follow the Sharing and Usage policies.

    As clearly stated in their Terms of Use, specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware, that is a specific restriction that should serve as an addendum to the current license.

    This dataset is also available on the Hugging Face hub, its canonical repository.

  9. TinyStories

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). TinyStories [Dataset]. https://www.kaggle.com/datasets/thedevastator/tinystories-narrative-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    TinyStories

    A Diverse, Richly Annotated Corpus of Short-Form Stories

    By Huggingface Hub [source]

    About this dataset

    This dataset contains the text of a remarkable collection of short stories known as the TinyStories Corpus. With over 2,000 annotated stories, it is populated with an array of diverse styles and genres from multiple sources. This corpus is enriched by intricate annotations across each narrative content, making it a valuable resource for narrative text classification. The text field in each row includes the entirety of each story that can be used to identify plots, characters and other features associated with story-telling techniques. Through this collection of stories, users will gain an extensive insight into a wide range of narratives which could be used to produce powerful machine learning models for Narrative Text Classification

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In this dataset, each row contains a short story along with its associated labels for narrative text classification tasks. The data consists of the following columns: - text: The story text itself (string) - validation.csv: Contains a set of short stories for validation (dataframe) - train.csv: Contains the text of short stories used for narrative text classification (dataframe)

    The data contained in both files can be used for various types of machine learning tasks related to narrative text classification. These include but are not limited to experiments such as determining story genres, predicting user reactions, sentiment analysis etc.

    To get started with using this dataset, begin by downloading both validation and train csv files from Kaggle datasets page and saving them on your computer or local environment. Once downloaded, you may need to preprocess both datasets by cleaning up any unnecessary/wrongly formatted values or duplicate entries if any exists within it before proceeding further on to your research work or machine learning task experimentations as these have great impacts on your research results accuracy rate which you do not want compromised!

    Next step is simply loading up these two datasets into Python pandas dataframes so that they can easily be manipulated and analyzed using common tools associated with Natural Language Processing(NLP). This would require you writing few simple lines using pandas API functions like read_csv(), .append(), .concat()etc depending upon what kind of analysis/experiment you intend conducting afterwards utilizing this dataset in Python Jupyter Notebook framework as well as other machine learning frameworks popular among data scientists like scikit-learn if it will be something more complex than simple NLP task operations!

    By now if done everything mentioned correctly here then we are ready now to finally get into actually working out our desired applications from exploring potential connections between different narratives or character traits via supervised Machine Learning models such as Naive Bayes Classifier among many others that could ultimately provide us useful insights revealing patterns existing underneath all those texts! With all necessary datas loaded up in supporting python platforms correctly so feel free to make interesting discoveries/predictions from extensive analyses provided by this richly annotated TinyStories Narrative Dataset!

    Research Ideas

    • Creating a text classification algorithm to automatically categorize short stories by genre.
    • Developing an AI-based summarization tool to quickly summarize the main points in a story.
    • Developing an AI-based story generator that can generate new stories based on existing ones in the dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:--------------------------------| | text | The text of the story. (String) |

    File: train.csv | Column name | Description | |:--------------|:----------------------------...

  10. h

    sql-text-collection

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng, sql-text-collection [Dataset]. https://huggingface.co/datasets/agentlans/sql-text-collection
    Explore at:
    Authors
    Alan Tseng
    Description

    SQL Text Collection

    This is a collection of publicly available text-to-SQL datasets.

      Dataset Structure
    

    Each row contains the columns:

    context: The schema for the database (e.g., CREATE TABLE statements). query: A natural language query or action to perform, expressed in English. source: The original dataset from which the row was sourced. dialect: One or more SQL dialects identified based on dialect-specific keywords found in the context and query. If there are multiple… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/sql-text-collection.

  11. h

    brics-edtech-data-collection

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    brics-edtech-patent-analysis, brics-edtech-data-collection [Dataset]. https://huggingface.co/datasets/brics-edtech/brics-edtech-data-collection
    Explore at:
    Authors
    brics-edtech-patent-analysis
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    brics-edtech-patent-analysis

    paper's repository Dataset: Data Collection, Processing, and Annotation In this section, we describe the methodology used to create the research dataset, including data sources, processing steps, and annotation by a large language model.

      2.1.1. Source and Data Collection
    

    The primary data source for this study was the patents.google.com database. This platform was chosen for its extensive collection of full-text national and international… See the full description on the dataset page: https://huggingface.co/datasets/brics-edtech/brics-edtech-data-collection.

  12. h

    artworks

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Bozhenko, artworks [Dataset]. https://huggingface.co/datasets/anna-bozhenko/artworks
    Explore at:
    Authors
    Anna Bozhenko
    Description

    Combined Louvre and Art Institute of Chicago (AIC) Collection Dataset

      Dataset Summary
    

    This dataset merges artwork information from two prominent museum collections: the Musée du Louvre and The Art Institute of Chicago (AIC). It combines data from the Louvre Paper and Canvas Collection and the AIC Dataset 0.2 datasets. Due to differences in the original datasets' schemas, a decision was made to focus on common fields and create a non-atomic full_info field containing… See the full description on the dataset page: https://huggingface.co/datasets/anna-bozhenko/artworks.

  13. h

    SPML_Chatbot_Prompt_Injection

    • huggingface.co
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reshabh K Sharma (2024). SPML_Chatbot_Prompt_Injection [Dataset]. https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 11, 2024
    Authors
    Reshabh K Sharma
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SPML Chatbot Prompt Injection Dataset

    Arxiv Paper Introducing the SPML Chatbot Prompt Injection Dataset: a robust collection of system prompts designed to create realistic chatbot interactions, coupled with a diverse array of annotated user prompts that attempt to carry out prompt injection attacks. While other datasets in this domain have centered on less practical chatbot scenarios or have limited themselves to "jailbreaking" – just one aspect of prompt injection – our dataset… See the full description on the dataset page: https://huggingface.co/datasets/reshabhs/SPML_Chatbot_Prompt_Injection.

  14. h

    YouTube-Commons

    • huggingface.co
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    PleIAs (2024). YouTube-Commons [Dataset]. https://huggingface.co/datasets/PleIAs/YouTube-Commons
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset authored and provided by
    PleIAs
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    📺 YouTube-Commons 📺

    YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

      Content
    

    The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.

  15. ultrachat_200k

    • huggingface.co
    • opendatalab.com
    Updated Oct 29, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face H4 (2023). ultrachat_200k [Dataset]. https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 29, 2023
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face H4
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for UltraChat 200k

      Dataset Description
    

    This is a heavily filtered version of the UltraChat dataset and was used to train Zephyr-7B-β, a state of the art 7b chat model. The original datasets consists of 1.4M dialogues generated by ChatGPT and spanning a wide range of topics. To create UltraChat 200k, we applied the following logic:

    Selection of a subset of data for faster supervised fine tuning. Truecasing of the dataset, as we observed around 5% of the data… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k.

  16. h

    Children-Stories-Collection

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Feynman Innovations, Children-Stories-Collection [Dataset]. http://doi.org/10.57967/hf/2480
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Feynman Innovations
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Children Stories Collection A great synthetic datasets consists of around 0.9 million stories especially meant for Young Children. You can directly use these datasets for training large models. Total 10 datasets are available for download. You can use any one or all the json files for training purpose. These datasets are in "prompt" and "text" format. Total token length is also available. Thank you for your love & support.

  17. h

    NeurIPS-LLM-data

    • huggingface.co
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Upaya (2024). NeurIPS-LLM-data [Dataset]. https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 4, 2024
    Dataset authored and provided by
    Upaya
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🤖 We curated this dataset for NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day. 🚀 Our Birbal-7B-V1 fine-tuned on this dataset achieved 🏆 first rank 🏆 in the competition.

    Here is high-level diagram of our data preparation strategy:

      Natural Instructions Dataset Preparation
    

    Natural Instructionsdataset is a community effort to create a large collection of tasks and their natural language definitions/instructions. As show in above diagram, we sample from… See the full description on the dataset page: https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data.

  18. h

    mm2_user

    • huggingface.co
    Updated Feb 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Addison (2022). mm2_user [Dataset]. https://huggingface.co/datasets/TheGreatRambler/mm2_user
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2022
    Authors
    Addison
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Mario Maker 2 users

    Part of the Mario Maker 2 Dataset Collection

      Dataset Description
    

    The Mario Maker 2 users dataset consists of 6 million users from Nintendo's online service totaling around 1.2GB of data. The dataset was created using the self-hosted Mario Maker 2 api over the course of 1 month in February 2022.

      How to use it
    

    The Mario Maker 2 users dataset is a very large dataset so for most use cases it is recommended to make use of the streaming API of… See the full description on the dataset page: https://huggingface.co/datasets/TheGreatRambler/mm2_user.

  19. h

    agnews

    • huggingface.co
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2025). agnews [Dataset]. https://huggingface.co/datasets/sentence-transformers/agnews
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 7, 2025
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for AGNews

    This dataset is a collection of title-description pairs collected from AGNews. See the AG News corpus for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

      Dataset Subsets
    
    
    
    
    
      pair subset
    

    Columns: "title", "description" Column types: str, str Examples:{ 'title': 'Helicopter Crashes in Colombian Drug War, Kills 20', 'description': 'BOGOTA, Colombia - A U.S.-made helicopter on… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/agnews.

  20. h

    opus_books

    • huggingface.co
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2024). opus_books [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/opus_books
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2024
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for OPUS Books

      Dataset Summary
    

    This is a collection of copyright free books aligned by Andras Farkas, which are available from http://www.farkastranslations.com/bilingual_books.php Note that the texts are rather dated due to copyright issues and that some of them are manually reviewed (check the meta-data at the top of the corpus files in XML). The source is multilingually aligned, which is available from http://www.farkastranslations.com/bilingual_books.php.… See the full description on the dataset page: https://huggingface.co/datasets/Helsinki-NLP/opus_books.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
HuggingFaceM4 (2024). the_cauldron [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/the_cauldron

the_cauldron

HuggingFaceM4/the_cauldron

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2024
Dataset authored and provided by
HuggingFaceM4
Description

Dataset Card for The Cauldron

  Dataset description

The Cauldron is part of the Idefics2 release. It is a massive collection of 50 vision-language datasets (training sets only) that were used for the fine-tuning of the vision-language model Idefics2.

  Load the dataset

To load the dataset, install the library datasets with pip install datasets. Then, from datasets import load_dataset ds = load_dataset("HuggingFaceM4/the_cauldron", "ai2d")

to download and load the… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceM4/the_cauldron.

Search
Clear search
Close search
Google apps
Main menu