88 datasets found
  1. h

    airoboros-gpt4

    • huggingface.co
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jon Durbin (2023). airoboros-gpt4 [Dataset]. https://huggingface.co/datasets/jondurbin/airoboros-gpt4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 4, 2023
    Authors
    Jon Durbin
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data airoboros is apache-2. Specific areas of focus for this training data:

    trivia math nonsensical math coding closed context question answering closed context question answering, with multiple contexts to choose from as confounding factors writing multiple choice

      Usage and License Notices
    

    All airoboros models and datasets are intended and licensed for research use only.… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/airoboros-gpt4.

  2. h

    GPT4-8K

    • huggingface.co
    Updated Jan 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). GPT4-8K [Dataset]. https://huggingface.co/datasets/erfanzar/GPT4-8K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2024
    Authors
    Erfan zare chavoshi
    Description

    Dataset Card for "GPT4-8K"

    Sure! Here's a README.md file for your dataset:

      Dataset Description
    

    This dataset was generated using GPT-4, a powerful language model developed by OpenAI. It contains a collection of dialogs between a user and an assistant, along with additional information. from OpenChat

      Dataset Configurations
    

    The dataset includes the following configurations:

    Config Name: default

    Data Files: Split: train Path: data/train-*

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT4-8K.
    
  3. Z

    Model Output of GPT-3.5 and GPT-4 for ECHR-AM

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Dec 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitrović, Jelena (2024). Model Output of GPT-3.5 and GPT-4 for ECHR-AM [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8246128
    Explore at:
    Dataset updated
    Dec 13, 2024
    Dataset provided by
    Zubaer, Abdullah Al
    Granitzer, Michael
    Mitrović, Jelena
    Description

    "gpt3.5-gpt4-input-output-echram.zip" :

    Input and output to GPT-3.5 and GPT-4 based on ECHR dataset published in JSON format in this paper for argument component classification only i.e. clauses that are argumentative (conclusion/premise), extracted from the JSON file

    Note: Output of the model is under OpenAI Terms & policies.

    Please cite our paper also if you use this dataset: Performance analysis of large language models in the domain of legal argument mining

    You can click here for BibTex or copy the text below.

    @ARTICLE{10.3389/frai.2023.1278796,

    AUTHOR={Al Zubaer, Abdullah and Granitzer, Michael and Mitrović, Jelena },

    TITLE={Performance analysis of large language models in the domain of legal argument mining},

    JOURNAL={Frontiers in Artificial Intelligence},

    VOLUME={6},

    YEAR={2023},

    URL={https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2023.1278796},

    DOI={10.3389/frai.2023.1278796},

    ISSN={2624-8212},

    ABSTRACT={Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.}}

  4. Estimated water consumption for training GPT-3 2023

    • ai-chatbox.pro
    • statista.com
    Updated Nov 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Estimated water consumption for training GPT-3 2023 [Dataset]. https://www.ai-chatbox.pro/?_=%2Fstatistics%2F1536925%2Fgpt-3-estimated-water-consumption-training%2F%23XgboDwS6a1rKoGJjSPEePEUG%2FVFd%2Bik%3D
    Explore at:
    Dataset updated
    Nov 19, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jul 2023
    Area covered
    Worldwide
    Description

    GPT-3's water consumption for the training phase was estimated at roughly 4.8 billion liters of water, when assuming the model was trained on Microsoft's Iowa data center (OpeanAI has disclosed that the data center was used for training parts of the GPT-4 model). If the model were to have been fully trained in the Washington data center, water consumption could have been as high as 15 billion liters. That would've amounted to more than Microsoft's total water withdrawals in 2023.

  5. h

    llm-training-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

    The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

      Models used for text generation:
    

    GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

      Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
    
  6. f

    o3-mini dataset.

    • plos.figshare.com
    • figshare.com
    csv
    Updated Jul 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Myung Hye Yoo; Joungmin Kim; Sanghoun Song (2025). o3-mini dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0326943.s003
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 7, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Myung Hye Yoo; Joungmin Kim; Sanghoun Song
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This study examines the multilingual capabilities of GPT, focusing on its handling of syntactic ambiguity across English, Korean, and Japanese. We investigate whether GPT can capture language-specific attachment preferences or if it relies primarily on English-centric training patterns. Using ambiguous relative clauses as a testing ground, we assess GPT’s interpretation tendencies across language contexts. Our findings reveal that, while GPT (GPT-3.5-turbo, GPT-4-turbo, GPT 4o)’s performance aligns with native English speakers’ preferred interpretations, it overgeneralizes this interpretation in Korean and lacks clear preferences in Japanese, despite distinct attachment biases among native speakers of these languages. The newer, smaller-scale models—o1-mini and o3-mini—further reinforce this trend by closely mirroring English attachment patterns in both Korean and Japanese. Overall results suggest that GPT’s multilingual proficiency is limited, likely reflecting a bias toward high-resource languages like English, although differences in model size and tuning strategies may partially mitigate the extent of English-centric generalization. While GPT models demonstrate aspects of human-like language processing, our findings underscore the need for further refinement to achieve a more nuanced engagement with linguistic diversity across languages.

  7. h

    alpaca-gpt4-data-zh

    • huggingface.co
    Updated Apr 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chris Alexiuk (2023). alpaca-gpt4-data-zh [Dataset]. https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 11, 2023
    Authors
    Chris Alexiuk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for "alpaca-gpt4-data-zh"

    All of the work is done by this team.

      Usage and License Notices
    

    The data is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

      English Dataset
    

    Found here

      Citation
    

    @article{peng2023gpt4llm, title={Instruction Tuning with GPT-4}, author={Baolin Peng, Chunyuan Li… See the full description on the dataset page: https://huggingface.co/datasets/llm-wizard/alpaca-gpt4-data-zh.

  8. Alpaca GPT-4

    • kaggle.com
    • opendatalab.com
    • +1more
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Alpaca GPT-4 [Dataset]. https://www.kaggle.com/datasets/thedevastator/gpt-4-instruction-following-dataset/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2023
    Dataset provided by
    Kaggle
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Alpaca GPT-4

    High-Performance NLP for Instruction-Following Reasoning

    By Huggingface Hub [source]

    About this dataset

    This dataset consists of 52K instruction-following data generated by GPT-4 in English using the same prompts as in Alpaca. This data has been crafted specifically to help researchers break ground and explore new strategies for natural language processing, with a special focus on instruction-following reasoning.

    What makes this dataset unique and powerful is that it offers an ample variety of options for experimenting with models that can excel at instruction following tasks; from refining specific components such as predicting outputs or analyzing long textual conversations, to using the entire platform to train and evaluate end-to-end approaches. Allowing researchers the opportunity to rapidly iterate their experiments while having the confidence of a high performant model with few limitations - making this an invaluable resource for anyone looking to push the boundaries of artificial intelligence techniques for logical reasoning problems

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset is an invaluable resource for researching artificial intelligence approaches to logical reasoning problems. This dataset consists of 52K instruction-following samples generated by GPT-4 in English using the same prompts as in Alpaca. Here are some tips on how to make the most out of this dataset:

    • The columns in this dataset provide essential data that can help researchers evaluate their models on a task involving instruction following: instruction, input, output and text. In order to effectively use this data, it is important for researchers to be familiar with each column and understand its purpose and contribution towards understanding instructional following principles. a) The instruction column provides a statement which an AI model must interpret in order for it complete a task correctly; b) The 'input' column is basically pre-generated data that helps an AI model make sense of the instructions; c) The 'output' column indicates what kind of result must be returned after the AI model interprets instructions correctly; and finally,
      d) The ‘text’ column is full text generated by GPT-4 which gives us deeper insight into what gave rise our output results from input & instruction handling.

      Note : It's very important that researchers pay attention to all four columns when overseeing their work on such datasets, as all four components collaborate together integrately.

      To get better results one should consider fine tuning existing schemes so they become better suited for instruction following tasks using these 4 columns as guidance points. It would be also useful if the datasets came with corresponding hyperparameters so users can fine tune them quicker without losing accuracy or any other metric needed on such scenarios!

      Additionally, readers should Oyverviewedthe contextcloserlytoaccuracy assessthepunishmeasure opinion toneandGoforwhichmodeltypebestsuitsitcaseization given before attempting any sort of evaluation since some might bringmore accurateresultsbuttakelongertoprocess ore viceversa!yerinaredaviews satismetricmayvariaentdataobservioletorsalld .yCdgntricular error%mnfreeunerratreated too accommodate certain scenarios better than others but will still depend largely onthedatasetaccuratelyusedtocourubricateperformances026 (269units). For example, if changes are

    Research Ideas

    • Training intelligent conversational agents with instruction-following reasoning capabilities.
    • Developing more complex and powerful instructions processing models driven by natural language understanding and reasoning algorithms.
    • Establishing an online platform to help academic, business or other organizations to construct auto-grading systems for instruction-following skills evaluation of their staff at large scale in a relatively cheap way

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Colu...

  9. f

    Implications for future LLM research.

    • plos.figshare.com
    xls
    Updated Jan 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce (2024). Implications for future LLM research. [Dataset]. http://doi.org/10.1371/journal.pdig.0000417.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 18, 2024
    Dataset provided by
    PLOS Digital Health
    Authors
    Jack Gallifant; Amelia Fiske; Yulia A. Levites Strekalova; Juan S. Osorio-Valencia; Rachael Parke; Rogers Mwavu; Nicole Martinez; Judy Wawira Gichoya; Marzyeh Ghassemi; Dina Demner-Fushman; Liam G. McCoy; Leo Anthony Celi; Robin Pierce
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The study provides a comprehensive review of OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) technical report, with an emphasis on applications in high-risk settings like healthcare. A diverse team, including experts in artificial intelligence (AI), natural language processing, public health, law, policy, social science, healthcare research, and bioethics, analyzed the report against established peer review guidelines. The GPT-4 report shows a significant commitment to transparent AI research, particularly in creating a systems card for risk assessment and mitigation. However, it reveals limitations such as restricted access to training data, inadequate confidence and uncertainty estimations, and concerns over privacy and intellectual property rights. Key strengths identified include the considerable time and economic investment in transparent AI research and the creation of a comprehensive systems card. On the other hand, the lack of clarity in training processes and data raises concerns about encoded biases and interests in GPT-4. The report also lacks confidence and uncertainty estimations, crucial in high-risk areas like healthcare, and fails to address potential privacy and intellectual property issues. Furthermore, this study emphasizes the need for diverse, global involvement in developing and evaluating large language models (LLMs) to ensure broad societal benefits and mitigate risks. The paper presents recommendations such as improving data transparency, developing accountability frameworks, establishing confidence standards for LLM outputs in high-risk settings, and enhancing industry research review processes. It concludes that while GPT-4’s report is a step towards open discussions on LLMs, more extensive interdisciplinary reviews are essential for addressing bias, harm, and risk concerns, especially in high-risk domains. The review aims to expand the understanding of LLMs in general and highlights the need for new reflection forms on how LLMs are reviewed, the data required for effective evaluation, and addressing critical issues like bias and risk.

  10. h

    GPT-4-Prompts

    • huggingface.co
    Updated Dec 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erfan zare chavoshi (2024). GPT-4-Prompts [Dataset]. https://huggingface.co/datasets/erfanzar/GPT-4-Prompts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2024
    Authors
    Erfan zare chavoshi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multi-Turn Conversational Prompts from ChatGPT-4 (10K+ Tokens) Abstract: This dataset offers a valuable collection of multi-turn conversational prompts generated by ChatGPT-4, carefully curated for diverse prompt styles (chatml, gemma, llama). Each prompt exceeds 10,000 tokens, providing ample context and inspiration for training and evaluating large language models. Ideal for researchers and developers interested in exploring advanced conversational AI capabilities. Table of Contents:… See the full description on the dataset page: https://huggingface.co/datasets/erfanzar/GPT-4-Prompts.

  11. T

    gpt3

    • tensorflow.org
    Updated Dec 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). gpt3 [Dataset]. https://www.tensorflow.org/datasets/catalog/gpt3
    Explore at:
    Dataset updated
    Dec 19, 2023
    Description

    Synthetic datasets for word scramble and arithmetic tasks described in the GPT3 paper.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('gpt3', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  12. d

    Replication Data for: Large Language Models as a Substitute for Human...

    • search.dataone.org
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heseltine, Michael (2024). Replication Data for: Large Language Models as a Substitute for Human Experts in Annotating Political Text [Dataset]. http://doi.org/10.7910/DVN/V2P6YL
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Heseltine, Michael
    Description

    Large-scale text analysis has grown rapidly as a method in political science and beyond. To date, text-as-data methods rely on large volumes of human-annotated training examples, which places a premium on researcher resources. However, advances in large language models (LLMs) may make automated annotation increasingly viable. This paper tests the performance of GPT-4 across a range of scenarios relevant for analysis of political text. We compare GPT-4 coding with human expert coding of tweets and news articles across four variables (whether text is political, its negativity, its sentiment, and its ideology) and across four countries (the United States, Chile, Germany, and Italy). GPT-4 coding is highly accurate, especially for shorter texts such as tweets, correctly classifying texts up to 95\% of the time. Performance drops for longer news articles, and very slightly for non-English text. We introduce a ``hybrid'' coding approach, in which disagreements of multiple GPT-4 runs are adjudicated by a human expert, which boosts accuracy. Finally, we explore downstream effects, finding that transformer models trained on hand-coded or GPT-4-coded data yield almost identical outcomes. Our results suggests that LLM-assisted coding is a viable and cost-efficient approach, although consideration should be given to task complexity.

  13. Data for "Flexible, Model-Agnostic Method for Materials Data Extraction from...

    • figshare.com
    xlsx
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maciej Polak; Dane Morgan; Shrey Modi; Jinming Zhang; Anna Latosinska; Shaonan Wang; Jasmine Wang; Ayan Deep Hazra (2024). Data for "Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models" [Dataset]. http://doi.org/10.6084/m9.figshare.21861948.v5
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Maciej Polak; Dane Morgan; Shrey Modi; Jinming Zhang; Anna Latosinska; Shaonan Wang; Jasmine Wang; Ayan Deep Hazra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets for the paper entitled "Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models" by Maciej P. Polak, Shrey Modi, Anna Latosinska, Jinming Zhang, Ching-Wen Wang, Shanonan Wang, Ayan Deep Hazra, and Dane MorganMPPolak_BulkModulus_ValidationData.xlsx - a dataset of bulk modulus sentences, positive - containing bulk modulus data, and negative - not contaning data, used for model assessment.MPPolak_BulkModulus_AllTrainData.xlsx - a dataset of bulk modulus sentences, positive - containing bulk modulus data, and negative - not contaning data, used for fine tuning of the model and model assessment.MPPolak_CritCoolRate_Dataset.xlsx - a dataset of critical cooling rates for metallic glasses developed in this paper with the ,ethod presented in the paper, consisting of names of materials, values of critical cooling rates, their units, and DOIs of the source documents.MPPolak_DataExtraction_codes.zip - simple example codes necessary to reproduce the results. The provided 'positive' and 'negative' files are a shortened versions of the training data allowing for quick execution and testing. The 'pos' and 'neg' files contain full testing sets. The 'plotting' directory contains data and scripts which allow to reproduce the figures.

  14. Data from: MetaHarm: Harmful YouTube Video Dataset Annotated by Domain...

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wonjeong Jo; Wonjeong Jo; Magdalena Wojcieszak; Magdalena Wojcieszak (2025). MetaHarm: Harmful YouTube Video Dataset Annotated by Domain Experts, GPT-4-Turbo, and Crowdworkers [Dataset]. http://doi.org/10.5281/zenodo.14647452
    Explore at:
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Wonjeong Jo; Wonjeong Jo; Magdalena Wojcieszak; Magdalena Wojcieszak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    We provide text metadata, image frames, and thumbnails of YouTube videos classified as harmful or harmless by domain experts, GPT-4-Turbo, and crowdworkers. Harmful videos are categorized into one or more of six harm categories: Information harms (IH), Hate and Harassment harms (HH), Clickbait harms (CB), Addictive harms (ADD), Sexual harms (SXL), and Physical harms (PH).

    This repository includes the text metadata and a link to external cloud storage for the image data.

    Text Metadata

    Folder Subfolder#Videos
    Ground TruthHarmful_full_agreement
    (classified as harmful by all the three actors)
    5,109
    Harmful_subset_agreement
    (classified as harmful by more than two actors)
    14,019
    Domain ExpertsHarmful15,115
    Harmless3,303
    GPT-4-TurboHarmful10,495
    Harmless7,818
    Crowdworkers
    (Workers from Amazon Mechanical Turk)
    Harmful12,668
    Harmless4,390
    Unannotated large pool-60,906
    Note. The term "actor" refers to the annotating entities: domain experts, GPT-4-Turbo, and crowdworkers

    Explanations about the indicators

    1. Ground truth - harmful_full_agreement & harmful_subset agreement
    - links
    - video_id
    - channel
    - description
    - transcript
    - date
    - maj_harmcat: In the full_agreement version, this represents a harm category identified by all three actors. In the subset_agreement version, it represents a harm category classified by more than two actors.
    - all_harmcat: This includes all harm categories classified by any of the actors without requiring agreement. It captures all classified categories.
    2. Domain Experts, GPT-4-Turbo, Crowdworkers
    - links
    - video_id
    - channel
    - description
    - transcript
    - date
    - harmcat
    3. Unannotated large pool
    - links
    - video_id
    - channel
    - description
    - transcript
    - date
    Note. Some data from the external dataset does not include date information. In such cases, the date was marked as 1990-01-01.
    We retrieved transcripts using the YouTubeTranscriptApi. If a video does not have any text data in the transcript section, it means the API failed to retrieve the transcript, possibly because the video does not contain any detectable language.
    Some image frames are also available in the pickle file.

    Image data

    The image frames and thumbnails are available at this link: https://ucdavis.app.box.com/folder/302772803692?s=d23b20snl1slwkuh4pgvjs31m7r1xae2
    1. Image frames (imageframes_1-20.zip): Image frames are organized into 20 zip folders due to the large size of the image frames. Each zip folder contains subfolders named after the unique video IDs of the annotated videos. Inside each subfolder, there are 15 sequentially numbered image frames (from 0 to 14) extracted from the corresponding video. The image frame folders do not distinguish between videos classified as harmful or non-harmful.
    2. Thumbnails (Thumbnails.zip): The zip folder contains thumbnails from the individual videos used in classification. Each thumbnail is named using the unique video ID. This folder does not distinguish between videos classified as harmful or harmless

    Related works (in preprint)

    For details about the harm classification taxonomy and the performance comparison between crowdworkers, GPT-4-Turbo, and domain experts, please see https://arxiv.org/abs/2411.05854.

  15. LLM - Detect AI Datamix

    • kaggle.com
    Updated Feb 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  16. Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. http://doi.org/10.5281/zenodo.10413068
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Peter Nutter; Mika Senghaas; Ludek Cizinsky; Peter Nutter; Mika Senghaas; Ludek Cizinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    • LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
    • Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
    • Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    • curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
    • curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    • Fine-tuning and advancing Homepage2Vec or similar website classification models
    • Research on LLM-generated datasets for text classification tasks
    • Exploration of multilingual website classification

    Additional Information:

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  17. Z

    Replication Package for "Improving the Readability of Generated Tests Using...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Oct 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregory Gay (2023). Replication Package for "Improving the Readability of Generated Tests Using GPT-4 and ChatGPT Code Interpreter" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8289841
    Explore at:
    Dataset updated
    Oct 5, 2023
    Dataset authored and provided by
    Gregory Gay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While automated test generation can decrease the human burden associated with testing, it does not eliminate this burden. Humans must still work with generated test cases to interpret testing results, debug the code, build and maintain a comprehensive test suite, and many other tasks. Therefore, a major challenge with automated test generation is understandability of generated test test cases.

    Large language models (LLMs), machine learning models trained on massive corpora of textual data - including both natural language and programming languages - are an emerging technology with great potential for performing language-related predictive tasks such as translation, summarization, and decision support.

    In this study, we are exploring the capabilities of LLMs with regard to improving test case understandability.

    This package contains the data produced during this exploration:

    The examples directory contains the three case studies we tested our transformation process on:

    queue_example: Tests of a basic queue data structure

    httpie_sessions: Tests of the sessions module from the httpie project.

    string_utils_validation: Tests of the validation module from the python-string-utils project.

    Each directory contains the modules-under-test, the original test cases generated by Pynguin, and the transformed test cases.

    Two trials were performed per case example of the transformation technique to assess the impact of different results from the LLM.

    The survey directory contains the survey that was sent to assess the impact of the transformation on test readability.

    survey.pdf contains the survey questions.

    responses.xlsx contains the survey results.

  18. f

    Data_Sheet_1_Performance analysis of large language models in the domain of...

    • figshare.com
    pdf
    Updated Nov 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah Al Zubaer; Michael Granitzer; Jelena Mitrović (2023). Data_Sheet_1_Performance analysis of large language models in the domain of legal argument mining.PDF [Dataset]. http://doi.org/10.3389/frai.2023.1278796.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Nov 17, 2023
    Dataset provided by
    Frontiers
    Authors
    Abdullah Al Zubaer; Michael Granitzer; Jelena Mitrović
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generative pre-trained transformers (GPT) have recently demonstrated excellent performance in various natural language tasks. The development of ChatGPT and the recently released GPT-4 model has shown competence in solving complex and higher-order reasoning tasks without further training or fine-tuning. However, the applicability and strength of these models in classifying legal texts in the context of argument mining are yet to be realized and have not been tested thoroughly. In this study, we investigate the effectiveness of GPT-like models, specifically GPT-3.5 and GPT-4, for argument mining via prompting. We closely study the model's performance considering diverse prompt formulation and example selection in the prompt via semantic search using state-of-the-art embedding models from OpenAI and sentence transformers. We primarily concentrate on the argument component classification task on the legal corpus from the European Court of Human Rights. To address these models' inherent non-deterministic nature and make our result statistically sound, we conducted 5-fold cross-validation on the test set. Our experiments demonstrate, quite surprisingly, that relatively small domain-specific models outperform GPT 3.5 and GPT-4 in the F1-score for premise and conclusion classes, with 1.9% and 12% improvements, respectively. We hypothesize that the performance drop indirectly reflects the complexity of the structure in the dataset, which we verify through prompt and data analysis. Nevertheless, our results demonstrate a noteworthy variation in the performance of GPT models based on prompt formulation. We observe comparable performance between the two embedding models, with a slight improvement in the local model's ability for prompt selection. This suggests that local models are as semantically rich as the embeddings from the OpenAI model. Our results indicate that the structure of prompts significantly impacts the performance of GPT models and should be considered when designing them.

  19. MULTITuDE

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2023). MULTITuDE [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10013755?locale=sk
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MULTITuDE is a dataset for multilingual machine-generated text detection benchmark, described in the EMNLP 2023 conference paper. It consists of 7992 human-written news texts in 11 languages subsampled from MassiveSumm, accompanied by 66089 texts generated by 8 large language models (by using headlines of news articles). The creation process and scripts for replication/extension are located in a GitHub repository. If you use this dataset in any publication, project, tool or in any other form, please, cite the paper. Fields The dataset has the following fields: 'text' - a text sample, 'label' - 0 for human-written text, 1 for machine-generated text, 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text, 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively, 'language' - the ISO 639-1 language code identifying the language of the given text, 'length' - word count of the given text, 'source' - a string identifying the source dataset / news medium of the given text. Statistics (the number of samples) Splits: train - 44786 test - 29295 Binary labels: 0 - 7992 1 - 66089 Multiclass labels: gpt-3.5-turbo - 8300 gpt-4 - 8300 text-davinci-003 - 8297 alpaca-lora-30b - 8290 vicuna-13b - 8287 opt-66b - 8229 llama-65b - 8229 opt-iml-max-1.3b - 8157 human - 7992 Languages: English (en) - 29460 (train + test) Spanish (es) - 11586 (train + test) Russian (ru) - 11578 (train + test) Dutch (nl) - 2695 (test) Catalan (ca) - 2691 (test) Czech (cs) - 2689 (test) German (de) - 2685 (test) Chinese (zh) - 2683 (test) Portuguese (pt) - 2673 (test) Arabic (ar) - 2673 (test) Ukrainian (uk) - 2668 (test)

  20. AI Training GPU Cluster Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). AI Training GPU Cluster Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/ai-training-gpu-cluster-market
    Explore at:
    pptx, pdf, csvAvailable download formats
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    AI Training GPU Cluster Market Outlook



    According to our latest research, the AI Training GPU Cluster market size reached USD 9.3 billion globally in 2024, reflecting the surging demand for high-performance computing resources in artificial intelligence development. The market is anticipated to grow at a robust CAGR of 23.7% during the forecast period, with projections indicating a value of approximately USD 74.2 billion by 2033. This remarkable expansion is primarily driven by the exponential growth in AI-powered applications, the proliferation of large language models, and the increasing adoption of generative AI solutions across industries. As organizations worldwide accelerate digital transformation and automation initiatives, the demand for scalable and efficient GPU clusters for AI training continues to intensify.



    One of the foremost growth factors propelling the AI Training GPU Cluster market is the rapid advancement and complexity of AI models, particularly in the realms of deep learning and natural language processing. As models such as GPT-4, BERT, and other transformer-based architectures become more intricate and data-hungry, the computational requirements for training these models have skyrocketed. Traditional CPU-based systems are no longer sufficient to handle these workloads efficiently. Instead, organizations are increasingly investing in GPU clusters, which offer the parallel processing power necessary for faster model training and iteration. This shift is especially pronounced in sectors like autonomous vehicles, healthcare diagnostics, and financial modeling, where the speed and accuracy of AI models can yield significant competitive advantages.



    Another significant driver is the democratization of AI development and the rise of cloud-based AI platforms. The availability of AI training GPU clusters through public cloud providers such as AWS, Google Cloud, and Microsoft Azure has lowered the entry barrier for startups and enterprises alike. This trend is fueling innovation across both established corporations and emerging players, enabling them to experiment with and deploy sophisticated AI models without the need for massive upfront hardware investments. Furthermore, the integration of advanced management software and orchestration tools is making it easier for organizations to scale their AI workloads dynamically, optimize resource utilization, and reduce operational complexity, thereby further accelerating market growth.



    The growing emphasis on edge computing and hybrid cloud strategies is also shaping the trajectory of the AI Training GPU Cluster market. As enterprises seek to process and analyze data closer to the source for latency-sensitive applications, there is a rising demand for hybrid and on-premises GPU clusters. This is particularly relevant in industries such as manufacturing, automotive, and telecommunications, where real-time decision-making is critical. The convergence of 5G, IoT, and AI is amplifying this demand, prompting vendors to develop flexible cluster architectures that can seamlessly integrate on-premises, cloud-based, and edge resources. This evolution is not only expanding the addressable market but also fostering new collaboration models between technology providers and end-users.



    Regionally, North America remains the dominant force in the AI Training GPU Cluster market, accounting for over 40% of global revenue in 2024. The region’s leadership can be attributed to its concentration of leading AI research institutions, technology giants, and a vibrant ecosystem of startups. However, Asia Pacific is emerging as the fastest-growing market, driven by substantial investments in AI infrastructure by China, Japan, and South Korea. Europe is also witnessing steady growth, bolstered by government initiatives supporting AI innovation and digital transformation. Latin America and the Middle East & Africa, while still nascent, are expected to register notable growth rates as AI adoption accelerates across various sectors. This regional diversification underscores the global nature of AI-driven transformation and the widespread need for advanced GPU cluster solutions.




Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jon Durbin (2023). airoboros-gpt4 [Dataset]. https://huggingface.co/datasets/jondurbin/airoboros-gpt4

airoboros-gpt4

jondurbin/airoboros-gpt4

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 4, 2023
Authors
Jon Durbin
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

The data was generated by gpt-4, and therefore is subject to OpenAI ToS. The tool used to generate the data airoboros is apache-2. Specific areas of focus for this training data:

trivia math nonsensical math coding closed context question answering closed context question answering, with multiple contexts to choose from as confounding factors writing multiple choice

  Usage and License Notices

All airoboros models and datasets are intended and licensed for research use only.… See the full description on the dataset page: https://huggingface.co/datasets/jondurbin/airoboros-gpt4.

Search
Clear search
Close search
Google apps
Main menu