95 datasets found
  1. AI vs Human Text Balanced 360k+ records

    • kaggle.com
    zip
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ARJUN VERMA (2025). AI vs Human Text Balanced 360k+ records [Dataset]. https://www.kaggle.com/datasets/arjunverma2004/ai-vs-human-text-balanced-180k-records
    Explore at:
    zip(263669225 bytes)Available download formats
    Dataset updated
    Sep 7, 2025
    Authors
    ARJUN VERMA
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains a collection of AI-generated and human-written text samples. It is designed for research in AI text detection, natural language processing, and machine learning tasks such as binary classification.

    The original dataset, created by Shane Gerami , provided a large set of examples for distinguishing between AI and human writing.

    In this version, I have balanced the dataset to ensure equal representation of AI and human text. This makes it more suitable for training and evaluating machine learning models without bias toward one class.

    🔍 Features - Text: The written passage (AI or human). - Label: - 0 → Human-written - 1→ AI-generated

    ⚡ Use Cases - Training models to detect AI-generated text - Benchmarking text classification approaches - Research in AI detection, authorship attribution, and content moderation

    🙌 Acknowledgments

    Original dataset by Shane Gerami: AI vs Human Text

    This balanced version prepared and published by @arjunverma2004

  2. Data from: ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic...

    • zenodo.org
    bin, csv, zip
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Khairallah; Ali Khairallah; Arkaitz Zubiaga; Arkaitz Zubiaga (2025). ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection [Dataset]. http://doi.org/10.5281/zenodo.17249602
    Explore at:
    zip, csv, binAvailable download formats
    Dataset updated
    Oct 8, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ali Khairallah; Ali Khairallah; Arkaitz Zubiaga; Arkaitz Zubiaga
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description
    The ALHD (Arabic LLM and Human Dataset) is a large-scale, multigenre, and comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection
  3. i

    Urdu Human and AI text Dataset (UHAT)

    • ieee-dataport.org
    Updated Jul 20, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Ammar (2025). Urdu Human and AI text Dataset (UHAT) [Dataset]. https://ieee-dataport.org/documents/urdu-human-and-ai-text-dataset-uhat
    Explore at:
    Dataset updated
    Jul 20, 2025
    Authors
    Muhammad Ammar
    Description

    Dataset Overview This dataset is designed for Urdu text classification

  4. ShutterStock Dataset for AI vs Human-Gen. Image

    • kaggle.com
    zip
    Updated Jun 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sachin Singh (2025). ShutterStock Dataset for AI vs Human-Gen. Image [Dataset]. https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image
    Explore at:
    zip(11617243112 bytes)Available download formats
    Dataset updated
    Jun 19, 2025
    Authors
    Sachin Singh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    ShutterStock AI vs. Human-Generated Image Dataset

    This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.

    Dataset Overview:

    • Total Images: 100,000
    • Training Data: 80,000 images (majority AI-generated)
    • Test Data: 20,000 images
    • Image Sources: A mix of AI-generated images and real photographs or illustrations created by human artists
    • Labeling: Each image is labeled as either AI-generated or human-created

    Potential Use Cases:

    • AI-Generated Image Detection: Train models to distinguish between AI and human-made images.
    • Deep Learning & Computer Vision Research: Develop and benchmark CNNs, transformers, and other architectures.
    • Generative Model Evaluation: Compare AI-generated images to real images for quality assessment.
    • Digital Forensics: Identify synthetic media for applications in fake image detection.
    • Ethical AI & Content Authenticity: Study the impact of AI-generated visuals in media and ensure transparency.

    Why This Dataset?

    With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.

    Explore the dataset and contribute to advancing AI-generated content detection!

    Step 1: Install and Authenticate Kaggle API

    If you haven't installed the Kaggle API, run:
    bash pip install kaggle Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).

    Step 2: Use wget

      wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip
    

    Step 3: Extract the Dataset

    Once downloaded, extract the dataset using:
    bash unzip dataset.zip -d dataset_folder

    Now your dataset is ready to use! 🚀

  5. Tran et al. Final_Dataset.xlsx

    • figshare.com
    xlsx
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam (2024). Tran et al. Final_Dataset.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.27619839.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artificial Intelligence (AI) has emerged as a critical challenge to the authenticity of journalistic content, raising concerns over the ease with which artificially generated articles can mimic human-written news. This study focuses on using machine learning to identify distinguishing features, or “stylistic fingerprints,” of AI-generated and human-authored journalism. By analyzing these unique characteristics, we aim to classify news pieces with high accuracy, enhancing our ability to verify the authenticity of digital news.To conduct this study, we gathered a balanced dataset of 150 original journalistic articles and their 150 AI-generated counterparts, sourced from popular news websites. A variety of lexical, syntactic, and readability features were extracted from each article to serve as input data for training machine learning models. Five classifiers were then trained to evaluate how accurately they could distinguish between authentic and artificial articles, with each model learning specific patterns and variations in writing style.In addition to model training, BERTopic, a topic modeling technique, was applied to extract salient keywords from the journalistic articles. These keywords were used to prompt Google’s Gemini, an AI text generation model, to create artificial articles on the same topics as the original human-written pieces. This ensured a high level of relevance between authentic and AI-generated articles, which added complexity to the classification task.Among the five classifiers tested, the Random Forest model delivered the best performance, achieving an accuracy of 98.3% along with high precision (0.984), recall (0.983), and F1-score (0.983). Feature importance analyses were conducted using methods like Random Forest Feature Importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination. This analysis revealed that the top five discriminative features were sentence length range, paragraph length coefficient of variation, verb ratio, sentence complexity tags, and paragraph length range. These features appeared to encapsulate subtle but meaningful stylistic differences between human and AI-generated content.This research makes a significant contribution to combating disinformation by offering a robust method for authenticating journalistic content. By employing machine learning to identify subtle linguistic patterns, this study not only advances our understanding of AI in journalism but also enhances the tools available to ensure the credibility of news in the digital age.

  6. m

    Speech Dataset of Human and AI-Generated Voices

    • data.mendeley.com
    • kaggle.com
    Updated Sep 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huzain Azis (2025). Speech Dataset of Human and AI-Generated Voices [Dataset]. http://doi.org/10.17632/5czyx2vppv.2
    Explore at:
    Dataset updated
    Sep 15, 2025
    Authors
    Huzain Azis
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This dataset consists of audio recordings in Indonesian language, categorized into two distinct classes: human voices (real) and synthetic voices generated using artificial intelligence (AI). Each class comprises 21 audio files, resulting in a total of 42 audio files. Each recording has a duration ranging from approximately 4 to 9 minutes, with an average length of around 6 minutes per file. All recordings are provided in WAV format and accompanied by a CSV file containing detailed duration metadata for each audio file.

    This dataset is suitable for research and applications in speech recognition, voice authenticity detection, audio analysis, and related fields. It enables comparative analysis between natural Indonesian speech and AI-generated synthetic speech.

  7. AI-Generated vs Human-Written Text Dataset

    • kaggle.com
    zip
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Batyr Sharimbayev (2025). AI-Generated vs Human-Written Text Dataset [Dataset]. https://www.kaggle.com/datasets/hardkazakh/ai-generated-vs-human-written-text-dataset
    Explore at:
    zip(83075 bytes)Available download formats
    Dataset updated
    Sep 17, 2025
    Authors
    Batyr Sharimbayev
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Description

    This dataset contains two main collections of texts:
    - AI-Generated Texts: Produced using ChatGPT, Gemini, Grok, Deepseek in response to academic-style prompts across multiple domains, including Mathematics, Biology, History, Economics, Computer Science, and IELTS-style essays.
    - Human-Written Texts: Collected from authentic academic source such as arXiv, including metadata (author, year, and source).

    To simulate diverse writing conditions, the dataset is extended with different variations of AI outputs, such as paraphrasing, translation, and humanization. This allows researchers to study AI text detection, authorship classification, and style transfer.

    Variables and How They Are Obtained

    1. Generated

    Texts produced by ChatGPT, Gemini, Grok, Deepseek in response to academic prompts. Each prompt specifies a subject area and includes formatting restrictions to avoid the use of mathematical formulas, symbols, lists, and special formatting.

    Prompts for Generated Texts:

    PromptSubject
    "Explain the fundamental principles of calculus, including differentiation and integration, with real-world applications. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."Mathematics
    "Explain the process of cellular respiration and its role in energy production within living organisms. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."Biology
    "Analyze the causes and consequences of the Industrial Revolution, highlighting its impact on global economies and societies. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."History
    "Explain the principles of supply and demand and their effects on market equilibrium, with examples. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."Economics
    "Describe the basics of machine learning, including supervised and unsupervised learning techniques. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."Computer Science
    "Provide 400-word passage written at an IELTS Band 6 level: Government investment in the arts, such as music and theatre, is a waste of money. Governments must invest this money in public services instead. To what extent do you agree with this statement?"IELTS Essay

    2. Paraphrased

    Reworded versions of the AI-generated texts.
    - Obtained using QuillBot paraphrasing tool (default settings).
    - Example instruction: “Paraphrase the following text to avoid direct repetition but keep the meaning the same.”

    3. Translated

    AI-generated texts translated into another language and back into English to simulate style distortion.
    - Step 1: Translated into Russian with Yandex Translate.
    - Step 2: Back-translated into English using Google Translate.

    4. Humanized

    AI-generated texts rewritten to resemble writing by a non-native English speaker at approximately IELTS Band 6 level. The style reflects competent English usage but with minor errors and awkward phrasing.

    Prompt for Humanized Texts:

    Rewrite the following text passage to reflect the writing style of a non-native English speaker who has achieved a band level 6 in IELTS writing. This level indicates a competent user of English, but with some inaccuracies, inappropriate usage, and misunderstandings. The text should be mostly clear but may contain occasional errors in grammar, vocabulary, and coherence.

    Text Passage for Rewriting: [Insert text here]

    Note: Aim for errors that are typical of an IELTS band level 6 writer. These could include minor grammatical mistakes, slight misuse of vocabulary, and occasional awkward phrasing. However, the overall meaning of the text should remain clear and understandable.

    Word Count: approximately 400

    5. Human-Written

    Authentic texts authored by researchers.
    - Sources: arXiv.org.
    - Metadata includes author name, publication year, and source.

  8. Trojan Detection Software Challenge - image-classification-jun2020-train

    • data.nist.gov
    • nist.gov
    • +1more
    Updated Mar 31, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paul Majurski (2020). Trojan Detection Software Challenge - image-classification-jun2020-train [Dataset]. http://doi.org/10.18434/M32195
    Explore at:
    Dataset updated
    Mar 31, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Michael Paul Majurski
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    Round 1 Training Dataset The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1000 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Errata: This dataset had a software bug in the trigger embedding code that caused 4 models trained for this dataset to have a ground truth value of 'poisoned' but which did not contain any triggers embedded. These models should not be used. Models Without a Trigger Embedded: id-00000184 id-00000599 id-00000858 id-00001088 Google Drive Mirror: https://drive.google.com/open?id=1uwVt3UCRL2fCX9Xvi2tLoz_z-DwbU6Ce

  9. AH&AITD – Arslan’s Human and AI Text Database

    • figshare.com
    xlsx
    Updated May 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arslan Akram (2025). AH&AITD – Arslan’s Human and AI Text Database [Dataset]. http://doi.org/10.6084/m9.figshare.29144348.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 24, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Arslan Akram
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains 11,580 samples spanning both human-written and AI-generated content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability. To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.Composition1. Human-Written Samples (Total: 5,790)Collected from:Open Web Text (2,343 samples)Blogs (196 samples)Web Text (397 samples)Q&A Platforms (670 samples)News Articles (430 samples)Opinion Statements (1,549 samples)Scientific Research Abstracts (205 samples)2. AI-Generated Samples (Total: 5,790)Generated using:ChatGPT (1,130 samples)GPT-4 (744 samples)Paraphrase Models (1,694 samples)GPT-2 (328 samples)GPT-3 (296 samples)DaVinci (GPT-3.5 variant) (433 samples)GPT-3.5 (364 samples)OPT-IML (406 samples)Flan-T5 (395 samples)Citation:Akram, A. (2023). AH&AITD: Arslan’s Human and AI Text Database. [Dataset]. Associated with the article: An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence, 4(2), 44–55.

  10. Trojan Detection Software Challenge - Round 2 Training Dataset

    • data.nist.gov
    • nist.gov
    • +1more
    Updated Aug 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 2 Training Dataset [Dataset]. http://doi.org/10.18434/M32285
    Explore at:
    Dataset updated
    Aug 5, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Michael Paul Majurski
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1104 trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  11. d

    Data and trained models for: Human-robot facial co-expression

    • search.dataone.org
    • resodate.org
    • +1more
    Updated Jul 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuhang Hu; Boyuan Chen; Jiong Lin; Yunzhe Wang; Yingke Wang; Cameron Mehlman; Hod Lipson (2025). Data and trained models for: Human-robot facial co-expression [Dataset]. http://doi.org/10.5061/dryad.gxd2547t7
    Explore at:
    Dataset updated
    Jul 28, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Yuhang Hu; Boyuan Chen; Jiong Lin; Yunzhe Wang; Yingke Wang; Cameron Mehlman; Hod Lipson
    Description

    Large language models are enabling rapid progress in robotic verbal communication, but nonverbal communication is not keeping pace. Physical humanoid robots struggle to express and communicate using facial movement, relying primarily on voice. The challenge is twofold: First, the actuation of an expressively versatile robotic face is mechanically challenging. A second challenge is knowing what expression to generate so that they appear natural, timely, and genuine. Here we propose that both barriers can be alleviated by training a robot to anticipate future facial expressions and execute them simultaneously with a human. Whereas delayed facial mimicry looks disingenuous, facial co-expression feels more genuine since it requires correctly inferring the human's emotional state for timely execution. We find that a robot can learn to predict a forthcoming smile about 839 milliseconds before the human smiles, and using a learned inverse kinematic facial self-model, co-express the smile simul..., During the data collection phase, the robot generated symmetrical facial expressions, which we thought can cover most of the situation and could reduce the size of the model. We used an Intel RealSense D435i to capture RGB images and cropped them to 480 320. We logged each motor command value and robot images to form a single data pair without any human labeling., , # Dataset for Paper "Human-Robot Facial Co-expression"

    Overview

    This dataset accompanies the research on human-robot facial co-expression, aiming to enhance nonverbal interaction by training robots to anticipate and simultaneously execute human facial expressions. Our study proposes a method where robots can learn to predict forthcoming human facial expressions and execute them in real time, thereby making the interaction feel more genuine and natural.

    https://doi.org/10.5061/dryad.gxd2547t7

    Description of the data and file structure

    The dataset is organized into several zip files, each containing different components essential for replicating our study's results or for use in related research projects:

    • pred_training_data.zip: Contains the data used for training the predictive model. This dataset is crucial for developing models that predict human facial expressions based on input frames.
    • pred_model.zip: Contains the...
  12. Trojan Detection Software Challenge - Round 4 Holdout Dataset

    • data.nist.gov
    • nist.gov
    • +2more
    Updated Dec 31, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 4 Holdout Dataset [Dataset]. http://doi.org/10.18434/mds2-2372
    Explore at:
    Dataset updated
    Dec 31, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Michael Paul Majurski
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The data being generated and disseminated is the holdout data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 288 adversarially trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  13. Trojan Detection Software Challenge - Round 2 Holdout Dataset

    • data.nist.gov
    • s.cnmilf.com
    • +2more
    Updated Oct 23, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 2 Holdout Dataset [Dataset]. http://doi.org/10.18434/mds2-2322
    Explore at:
    Dataset updated
    Oct 23, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Michael Paul Majurski
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The data being generated and disseminated is the holdout data used to evaluate trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 144 trained, human level, image classification AI models using a variety of architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  14. Trojan Detection Software Challenge - Round 2 Test Dataset

    • data.nist.gov
    • nist.gov
    • +2more
    Updated Oct 30, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 2 Test Dataset [Dataset]. http://doi.org/10.18434/mds2-2321
    Explore at:
    Dataset updated
    Oct 30, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Michael Paul Majurski
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    The data being generated and disseminated is the test data used to evaluate trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 144 trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  15. Human vs. Machine-Generated Text Stories

    • kaggle.com
    zip
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kian Jazayeri (2024). Human vs. Machine-Generated Text Stories [Dataset]. https://www.kaggle.com/datasets/kianjazayeri/human-vs-machine-generated-text-stories
    Explore at:
    zip(2678132 bytes)Available download formats
    Dataset updated
    Aug 4, 2024
    Authors
    Kian Jazayeri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains a collection of stories designed to facilitate research in distinguishing between human-generated and machine-generated text. The dataset includes 5000 human-generated stories sourced from the ROCStories corpus and machine-generated continuations of these stories produced using the FALCON-7b language model with three different settings.

    Dataset Composition:

    Human-Generated Stories: The original 5000 stories from the ROCStories corpus.

    Machine-Generated Stories (Setting 1): Continuations generated by FALCON-7b with balanced diversity and quality settings (Temperature: 1.0, Top-K Sampling: 50, Top-p Sampling: 0.9).

    Machine-Generated Stories (Setting 2): Continuations generated by FALCON-7b with high creativity and diversity settings (Temperature: 1.5, Top-K Sampling: 100, Top-p Sampling: 0.95).

    Machine-Generated Stories (Setting 3): Continuations generated by FALCON-7b with conservative and deterministic settings (Temperature: 0.7, Top-K Sampling: 20, Top-p Sampling: 0.8).

    Columns:

    Human Story: The original story written by a human.

    Machine-Generated (Setting 1): The continuation of the story generated by the FALCON-7b model with balanced settings.

    Machine-Generated (Setting 2): The continuation of the story generated by the FALCON-7b model with creative settings.

    Machine-Generated (Setting 3): The continuation of the story generated by the FALCON-7b model with conservative settings.

    Purpose and Use: This dataset is intended for researchers and practitioners in the fields of Natural Language Processing (NLP) and Machine Learning. It provides a valuable resource for developing and testing models aimed at distinguishing between human and machine-generated text. Applications of this research include improving the detection of AI-generated content, enhancing text generation models, and exploring the nuances of human versus machine creativity in storytelling.

    Acknowledgements:

    The human-generated stories are sourced from the ROCStories corpus: Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., & Allen, J. (2016). A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. https://doi.org/10.18653/v1/n16-1098

    The machine-generated continuations are created using the FALCON-7b language model. License:

    This dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing for sharing, adaptation, and usage with appropriate credit to the original creators.

  16. u

    Data from: Identifying Machine-Paraphrased Plagiarism

    • portalinvestigacio.uib.cat
    • opendatalab.com
    • +2more
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela; Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela (2021). Identifying Machine-Paraphrased Plagiarism [Dataset]. https://portalinvestigacio.uib.cat/documentos/688b602617bb6239d2d49012
    Explore at:
    Dataset updated
    2021
    Authors
    Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela; Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela
    Description

    README.txt Title: Identifying Machine-Paraphrased Plagiarism
    Authors: Jan Philip Wahle, Terry Ruas, Tomas Foltynek, Norman Meuschke, and Bela Gipp
    contact email: wahle@gipplab.org; ruas@gipplab.org;
    Venue: iConference
    Year: 2022
    ================================================================
    Dataset Description: Training:
    200,767 paragraphs (98,282 original, 102,485paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API). Testing:
    SpinBot:
    arXiv - Original - 20,966; Spun - 20,867
    Theses - Original - 5,226; Spun - 3,463
    Wikipedia - Original - 39,241; Spun - 40,729

    SpinnerChief-4W:
    arXiv - Original - 20,966; Spun - 21,671
    Theses - Original - 2,379; Spun - 2,941
    Wikipedia - Original - 39,241; Spun - 39,618

    SpinnerChief-2W:
    arXiv - Original - 20,966; Spun - 21,719
    Theses - Original - 2,379; Spun - 2,941
    Wikipedia - Original - 39,241; Spun - 39,697 ================================================================
    Dataset Structure: [human_evaluation] folder: human evaluation to identify human-generated text and machine-paraphrased text. It contains the files (original and spun) as for the answer-key for the survey performed with human subjects (all data is anonymous for privacy reasons). NNNNN.txt - whole document from which an extract was taken for human evaluation
    key.txt.zip - information about each case (ORIG/SPUN)
    results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
    results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded.
    [automated_evaluation]: contains all files used for the automated evaluation considering spinbot and spinnerchief. Each paraphrase tool folder contains: [corpus] and [vectors] sub-folders. For [spinnerchief], two variations are included, with 4-word-chaging ratio (default) and 2-word-chaging ratio. [vectors] sub-folder contains the average of all word vectors for each paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e., label mg or og). Each file belongs to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file. The word embedding technique used is described in the file name with the following structure:

  17. Data from: Deep learning four decades of human migration: datasets

    • zenodo.org
    csv, nc
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Gaskin; Thomas Gaskin; Guy Abel; Guy Abel (2025). Deep learning four decades of human migration: datasets [Dataset]. http://doi.org/10.5281/zenodo.17344747
    Explore at:
    csv, ncAvailable download formats
    Dataset updated
    Oct 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas Gaskin; Thomas Gaskin; Guy Abel; Guy Abel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Zenodo repository contains all migration flow estimates associated with the paper "Deep learning four decades of human migration." Evaluation code, training data, trained neural networks, and smaller flow datasets are available in the main GitHub repository, which also provides detailed instructions on data sourcing. Due to file size limits, the larger datasets are archived here.

    Data is available in both NetCDF (.nc) and CSV (.csv) formats. The NetCDF format is more compact and pre-indexed, making it suitable for large files. In Python, datasets can be opened as xarray.Dataset objects, enabling coordinate-based data selection.

    Each dataset uses the following coordinate conventions:

    • Year: 1990–2023
    • Birth ISO: Country of birth (UN ISO3)
    • Origin ISO: Country of origin (UN ISO3)
    • Destination ISO: Destination country (UN ISO3)
    • Country ISO: Used for net migration data (UN ISO3)

    The following data files are provided:

    • T.nc: Full table of flows disaggregated by country of birth. Dimensions: Year, Birth ISO, Origin ISO, Destination ISO
    • flows.nc: Total origin-destination flows (equivalent to T summed over Birth ISO). Dimensions: Year, Origin ISO, Destination ISO
    • net_migration.nc: Net migration data by country. Dimensions: Year, Country ISO
    • stocks.nc: Stock estimates for each country pair. Dimensions: Year, Origin ISO (corresponding to Birth ISO), Destination ISO
    • test_flows.nc: Flow estimates on a randomly selected set of test edges, used for model validation

    Additionally, two CSV files are provided for convenience:

    • mig_unilateral.csv: Unilateral migration estimates per country, comprising:
      • imm: Total immigration flows
      • emi: Total emigration flows
      • net: Net migration
      • imm_pop: Total immigrant population (non-native-born)
      • emi_pop: Total emigrant population (living abroad)
    • mig_bilateral.csv: Bilateral flow data, comprising:
      • mig_prev: Total origin-destination flows
      • mig_brth: Total birth-destination flows, where Origin ISO reflects place of birth

    Each dataset includes a mean variable (mean estimate) and a std variable (standard deviation of the estimate).

    An ISO3 conversion table is also provided.

  18. Trojan Detection Software Challenge - image-classification-feb2021-train

    • data.nist.gov
    • s.cnmilf.com
    • +2more
    Updated Dec 14, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Paul Majurski (2020). Trojan Detection Software Challenge - image-classification-feb2021-train [Dataset]. http://doi.org/10.18434/mds2-2340
    Explore at:
    Dataset updated
    Dec 14, 2020
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Authors
    Michael Paul Majurski
    License

    https://www.nist.gov/open/licensehttps://www.nist.gov/open/license

    Description

    Round 4 Train Dataset The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1008 adversarially trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.

  19. d

    Bobby Cole Ambient New Age Atmospheres (Machine Learning (ML) Data) |...

    • datarade.ai
    .wav
    Updated Sep 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bobby Cole Music (2025). Bobby Cole Ambient New Age Atmospheres (Machine Learning (ML) Data) | Original, Premium and Custom Record | 16GB+ Collection [Dataset]. https://datarade.ai/data-products/bobby-cole-ambient-new-age-atmospheres-machine-learning-ml-bobby-cole-music
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Sep 17, 2025
    Dataset authored and provided by
    Bobby Cole Music
    Area covered
    Martinique, Mozambique, Latvia, Sri Lanka, Cayman Islands, Bermuda, Denmark, Kyrgyzstan, United Republic of, Vietnam
    Description

    The Ambience Dataset is a curated collection of original ambient recordings designed to support machine learning in sound classification, emotional modeling, and spatial audio recognition. These tracks feature evolving textures, minimal melodic content, and carefully layered soundscapes that reflect environments both natural and abstract. From soothing drones to atmospheric washes and immersive pads, this dataset captures the subtle complexity of ambient music.

    Each track is paired with detailed metadata including tempo (if applicable), duration, sonic descriptors, frequency characteristics, and mood annotations—offering vital training data for MIR, generative ambience, ambient sound classification, and more. With no AI-generated content, every track is 100% human-produced and studio-crafted, providing high-fidelity, expressive audio for advanced AI audio development.

  20. d

    Bobby Cole Music Classical Music (Machine Learning (ML) Data) | Original,...

    • datarade.ai
    .wav
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bobby Cole Music (2025). Bobby Cole Music Classical Music (Machine Learning (ML) Data) | Original, Premium and Custom Record | 55GB+ Collection (Copy) [Dataset]. https://datarade.ai/data-products/bobby-cole-music-classical-music-machine-learning-ml-data-bobby-cole-music-e937
    Explore at:
    .wavAvailable download formats
    Dataset updated
    Sep 16, 2025
    Dataset authored and provided by
    Bobby Cole Music
    Area covered
    Gambia, Faroe Islands, Maldives, Philippines, Togo, Lao People's Democratic Republic, Ukraine, Seychelles, Azerbaijan, Romania
    Description

    The Nursery Rhymes Dataset is a charming collection of custom-composed children’s songs that replicate the classic nursery rhyme format—featuring repetitive phrasing, rhyme schemes, melodic simplicity, and age-appropriate pacing. Instruments include piano, glockenspiel, ukulele, soft drums, and warm vocal tones designed to be soothing, clear, and accessible for young learners.

    Each track includes metadata detailing tempo, rhyme patterns, key, syllable count, verse structure, and lyrical themes. This enables AI models to learn patterns in early childhood music such as phoneme repetition, call-and-response, learning cues, and safe tonal qualities.

    All content is 100% human-created in a professional studio, with absolutely no AI-generated melodies or vocals. The music is designed to mimic the familiar structure of classic nursery rhymes while offering entirely original content. This makes it perfect for training models that support children’s language acquisition, music education, vocal detection, emotional regulation tools, and intelligent kids’ media platforms.

    From educational games to voice assistants for children, this dataset provides a foundation of safe, consistent, and musically sound data for any child-focused AI application.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ARJUN VERMA (2025). AI vs Human Text Balanced 360k+ records [Dataset]. https://www.kaggle.com/datasets/arjunverma2004/ai-vs-human-text-balanced-180k-records
Organization logo

AI vs Human Text Balanced 360k+ records

360K+ AI and Human Generated Essays

Explore at:
zip(263669225 bytes)Available download formats
Dataset updated
Sep 7, 2025
Authors
ARJUN VERMA
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains a collection of AI-generated and human-written text samples. It is designed for research in AI text detection, natural language processing, and machine learning tasks such as binary classification.

The original dataset, created by Shane Gerami , provided a large set of examples for distinguishing between AI and human writing.

In this version, I have balanced the dataset to ensure equal representation of AI and human text. This makes it more suitable for training and evaluating machine learning models without bias toward one class.

🔍 Features - Text: The written passage (AI or human). - Label: - 0 → Human-written - 1→ AI-generated

⚡ Use Cases - Training models to detect AI-generated text - Benchmarking text classification approaches - Research in AI detection, authorship attribution, and content moderation

🙌 Acknowledgments

Original dataset by Shane Gerami: AI vs Human Text

This balanced version prepared and published by @arjunverma2004

Search
Clear search
Close search
Google apps
Main menu