95 datasets found

AI vs Human Text Balanced 360k+ records
kaggle.com
zip
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ARJUN VERMA (2025). AI vs Human Text Balanced 360k+ records [Dataset]. https://www.kaggle.com/datasets/arjunverma2004/ai-vs-human-text-balanced-180k-records
Explore at:
zip(263669225 bytes)Available download formats
Dataset updated
Sep 7, 2025
Authors
ARJUN VERMA
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset contains a collection of AI-generated and human-written text samples. It is designed for research in AI text detection, natural language processing, and machine learning tasks such as binary classification.

The original dataset, created by Shane Gerami , provided a large set of examples for distinguishing between AI and human writing.

In this version, I have balanced the dataset to ensure equal representation of AI and human text. This makes it more suitable for training and evaluating machine learning models without bias toward one class.

🔍 Features - Text: The written passage (AI or human). - Label: - 0 → Human-written - 1→ AI-generated

⚡ Use Cases - Training models to detect AI-generated text - Benchmarking text classification approaches - Research in AI detection, authorship attribution, and content moderation

🙌 Acknowledgments

Original dataset by Shane Gerami: AI vs Human Text

This balanced version prepared and published by @arjunverma2004
Data from: ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic...
zenodo.org
bin, csv, zip
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Khairallah; Ali Khairallah; Arkaitz Zubiaga; Arkaitz Zubiaga (2025). ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection [Dataset]. http://doi.org/10.5281/zenodo.17249602
Explore at:
zip, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17249602
Dataset updated
Oct 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ali Khairallah; Ali Khairallah; Arkaitz Zubiaga; Arkaitz Zubiaga
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description

The ALHD (Arabic LLM and Human Dataset) is a large-scale, multigenre, and comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection
i
Urdu Human and AI text Dataset (UHAT)
ieee-dataport.org
Updated Jul 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Ammar (2025). Urdu Human and AI text Dataset (UHAT) [Dataset]. https://ieee-dataport.org/documents/urdu-human-and-ai-text-dataset-uhat
Explore at:
Dataset updated
Jul 20, 2025
Authors
Muhammad Ammar
Description
Dataset Overview This dataset is designed for Urdu text classification
ShutterStock Dataset for AI vs Human-Gen. Image
kaggle.com
zip
Updated Jun 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sachin Singh (2025). ShutterStock Dataset for AI vs Human-Gen. Image [Dataset]. https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image
Explore at:
zip(11617243112 bytes)Available download formats
Dataset updated
Jun 19, 2025
Authors
Sachin Singh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
ShutterStock AI vs. Human-Generated Image Dataset

This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.

Dataset Overview:

Total Images: 100,000

Training Data: 80,000 images (majority AI-generated)

Test Data: 20,000 images

Image Sources: A mix of AI-generated images and real photographs or illustrations created by human artists

Labeling: Each image is labeled as either AI-generated or human-created

Potential Use Cases:

AI-Generated Image Detection: Train models to distinguish between AI and human-made images.

Deep Learning & Computer Vision Research: Develop and benchmark CNNs, transformers, and other architectures.

Generative Model Evaluation: Compare AI-generated images to real images for quality assessment.

Digital Forensics: Identify synthetic media for applications in fake image detection.

Ethical AI & Content Authenticity: Study the impact of AI-generated visuals in media and ensure transparency.

Why This Dataset?

With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.

Explore the dataset and contribute to advancing AI-generated content detection!

Step 1: Install and Authenticate Kaggle API

If you haven't installed the Kaggle API, run:
bash pip install kaggle Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).

Step 2: Use wget

wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip

Step 3: Extract the Dataset

Once downloaded, extract the dataset using:
bash unzip dataset.zip -d dataset_folder

Now your dataset is ready to use! 🚀
Tran et al. Final_Dataset.xlsx
figshare.com
xlsx
Updated Nov 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam (2024). Tran et al. Final_Dataset.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.27619839.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27619839.v1
Dataset updated
Nov 12, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Van Hieu Tran; Yakub Sebastian; Asif Karim; Sami Azam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artificial Intelligence (AI) has emerged as a critical challenge to the authenticity of journalistic content, raising concerns over the ease with which artificially generated articles can mimic human-written news. This study focuses on using machine learning to identify distinguishing features, or “stylistic fingerprints,” of AI-generated and human-authored journalism. By analyzing these unique characteristics, we aim to classify news pieces with high accuracy, enhancing our ability to verify the authenticity of digital news.To conduct this study, we gathered a balanced dataset of 150 original journalistic articles and their 150 AI-generated counterparts, sourced from popular news websites. A variety of lexical, syntactic, and readability features were extracted from each article to serve as input data for training machine learning models. Five classifiers were then trained to evaluate how accurately they could distinguish between authentic and artificial articles, with each model learning specific patterns and variations in writing style.In addition to model training, BERTopic, a topic modeling technique, was applied to extract salient keywords from the journalistic articles. These keywords were used to prompt Google’s Gemini, an AI text generation model, to create artificial articles on the same topics as the original human-written pieces. This ensured a high level of relevance between authentic and AI-generated articles, which added complexity to the classification task.Among the five classifiers tested, the Random Forest model delivered the best performance, achieving an accuracy of 98.3% along with high precision (0.984), recall (0.983), and F1-score (0.983). Feature importance analyses were conducted using methods like Random Forest Feature Importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination. This analysis revealed that the top five discriminative features were sentence length range, paragraph length coefficient of variation, verb ratio, sentence complexity tags, and paragraph length range. These features appeared to encapsulate subtle but meaningful stylistic differences between human and AI-generated content.This research makes a significant contribution to combating disinformation by offering a robust method for authenticating journalistic content. By employing machine learning to identify subtle linguistic patterns, this study not only advances our understanding of AI in journalism but also enhances the tools available to ensure the credibility of news in the digital age.
m
Speech Dataset of Human and AI-Generated Voices
data.mendeley.com
kaggle.com
Updated Sep 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huzain Azis (2025). Speech Dataset of Human and AI-Generated Voices [Dataset]. http://doi.org/10.17632/5czyx2vppv.2
Explore at:
Unique identifier
https://doi.org/10.17632/5czyx2vppv.2
Dataset updated
Sep 15, 2025
Authors
Huzain Azis
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This dataset consists of audio recordings in Indonesian language, categorized into two distinct classes: human voices (real) and synthetic voices generated using artificial intelligence (AI). Each class comprises 21 audio files, resulting in a total of 42 audio files. Each recording has a duration ranging from approximately 4 to 9 minutes, with an average length of around 6 minutes per file. All recordings are provided in WAV format and accompanied by a CSV file containing detailed duration metadata for each audio file.

This dataset is suitable for research and applications in speech recognition, voice authenticity detection, audio analysis, and related fields. It enables comparative analysis between natural Indonesian speech and AI-generated synthetic speech.

AI-Generated vs Human-Written Text Dataset

kaggle.com

zip

Updated Sep 17, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Batyr Sharimbayev (2025). AI-Generated vs Human-Written Text Dataset [Dataset]. https://www.kaggle.com/datasets/hardkazakh/ai-generated-vs-human-written-text-dataset

Explore at:

zip(83075 bytes)Available download formats

Dataset updated

Sep 17, 2025

Authors

Batyr Sharimbayev

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Description

This dataset contains two main collections of texts:
- AI-Generated Texts: Produced using ChatGPT, Gemini, Grok, Deepseek in response to academic-style prompts across multiple domains, including Mathematics, Biology, History, Economics, Computer Science, and IELTS-style essays.
- Human-Written Texts: Collected from authentic academic source such as arXiv, including metadata (author, year, and source).

To simulate diverse writing conditions, the dataset is extended with different variations of AI outputs, such as paraphrasing, translation, and humanization. This allows researchers to study AI text detection, authorship classification, and style transfer.

Variables and How They Are Obtained

1. Generated

Texts produced by ChatGPT, Gemini, Grok, Deepseek in response to academic prompts. Each prompt specifies a subject area and includes formatting restrictions to avoid the use of mathematical formulas, symbols, lists, and special formatting.

Prompts for Generated Texts:

Prompt	Subject
"Explain the fundamental principles of calculus, including differentiation and integration, with real-world applications. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."	Mathematics
"Explain the process of cellular respiration and its role in energy production within living organisms. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."	Biology
"Analyze the causes and consequences of the Industrial Revolution, highlighting its impact on global economies and societies. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."	History
"Explain the principles of supply and demand and their effects on market equilibrium, with examples. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."	Economics
"Describe the basics of machine learning, including supervised and unsupervised learning techniques. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc."	Computer Science
"Provide 400-word passage written at an IELTS Band 6 level: Government investment in the arts, such as music and theatre, is a waste of money. Governments must invest this money in public services instead. To what extent do you agree with this statement?"	IELTS Essay

2. Paraphrased

Reworded versions of the AI-generated texts.
- Obtained using QuillBot paraphrasing tool (default settings).
- Example instruction: “Paraphrase the following text to avoid direct repetition but keep the meaning the same.”

3. Translated

AI-generated texts translated into another language and back into English to simulate style distortion.
- Step 1: Translated into Russian with Yandex Translate.
- Step 2: Back-translated into English using Google Translate.

4. Humanized

AI-generated texts rewritten to resemble writing by a non-native English speaker at approximately IELTS Band 6 level. The style reflects competent English usage but with minor errors and awkward phrasing.

Prompt for Humanized Texts:

Rewrite the following text passage to reflect the writing style of a non-native English speaker who has achieved a band level 6 in IELTS writing. This level indicates a competent user of English, but with some inaccuracies, inappropriate usage, and misunderstandings. The text should be mostly clear but may contain occasional errors in grammar, vocabulary, and coherence.

Text Passage for Rewriting: [Insert text here]

Note: Aim for errors that are typical of an IELTS band level 6 writer. These could include minor grammatical mistakes, slight misuse of vocabulary, and occasional awkward phrasing. However, the overall meaning of the text should remain clear and understandable.

Word Count: approximately 400

5. Human-Written

Authentic texts authored by researchers.
- Sources: arXiv.org.
- Metadata includes author name, publication year, and source.

Trojan Detection Software Challenge - image-classification-jun2020-train
data.nist.gov
nist.gov
+1more
Updated Mar 31, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Paul Majurski (2020). Trojan Detection Software Challenge - image-classification-jun2020-train [Dataset]. http://doi.org/10.18434/M32195
Explore at:
Unique identifier
https://doi.org/10.18434/M32195, https://identifiers.org/ark:/88434/mds2-2195
Dataset updated
Mar 31, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Michael Paul Majurski
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
Round 1 Training Dataset The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1000 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Errata: This dataset had a software bug in the trigger embedding code that caused 4 models trained for this dataset to have a ground truth value of 'poisoned' but which did not contain any triggers embedded. These models should not be used. Models Without a Trigger Embedded: id-00000184 id-00000599 id-00000858 id-00001088 Google Drive Mirror: https://drive.google.com/open?id=1uwVt3UCRL2fCX9Xvi2tLoz_z-DwbU6Ce
AH&AITD – Arslan’s Human and AI Text Database
figshare.com
xlsx
Updated May 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arslan Akram (2025). AH&AITD – Arslan’s Human and AI Text Database [Dataset]. http://doi.org/10.6084/m9.figshare.29144348.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29144348.v1
Dataset updated
May 24, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Arslan Akram
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains 11,580 samples spanning both human-written and AI-generated content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability. To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.Composition1. Human-Written Samples (Total: 5,790)Collected from:Open Web Text (2,343 samples)Blogs (196 samples)Web Text (397 samples)Q&A Platforms (670 samples)News Articles (430 samples)Opinion Statements (1,549 samples)Scientific Research Abstracts (205 samples)2. AI-Generated Samples (Total: 5,790)Generated using:ChatGPT (1,130 samples)GPT-4 (744 samples)Paraphrase Models (1,694 samples)GPT-2 (328 samples)GPT-3 (296 samples)DaVinci (GPT-3.5 variant) (433 samples)GPT-3.5 (364 samples)OPT-IML (406 samples)Flan-T5 (395 samples)Citation:Akram, A. (2023). AH&AITD: Arslan’s Human and AI Text Database. [Dataset]. Associated with the article: An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence, 4(2), 44–55.
Trojan Detection Software Challenge - Round 2 Training Dataset
data.nist.gov
nist.gov
+1more
Updated Aug 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 2 Training Dataset [Dataset]. http://doi.org/10.18434/M32285
Explore at:
Unique identifier
https://doi.org/10.18434/M32285, https://identifiers.org/ark:/88434/mds2-2285 https://identifiers.org/ark:/88434/mds2-2285/pdr
Dataset updated
Aug 5, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Michael Paul Majurski
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1104 trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
d
Data and trained models for: Human-robot facial co-expression
search.dataone.org
resodate.org
+1more
Updated Jul 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuhang Hu; Boyuan Chen; Jiong Lin; Yunzhe Wang; Yingke Wang; Cameron Mehlman; Hod Lipson (2025). Data and trained models for: Human-robot facial co-expression [Dataset]. http://doi.org/10.5061/dryad.gxd2547t7
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.gxd2547t7
Dataset updated
Jul 28, 2025
Dataset provided by
Dryad Digital Repository
Authors
Yuhang Hu; Boyuan Chen; Jiong Lin; Yunzhe Wang; Yingke Wang; Cameron Mehlman; Hod Lipson
Description
Large language models are enabling rapid progress in robotic verbal communication, but nonverbal communication is not keeping pace. Physical humanoid robots struggle to express and communicate using facial movement, relying primarily on voice. The challenge is twofold: First, the actuation of an expressively versatile robotic face is mechanically challenging. A second challenge is knowing what expression to generate so that they appear natural, timely, and genuine. Here we propose that both barriers can be alleviated by training a robot to anticipate future facial expressions and execute them simultaneously with a human. Whereas delayed facial mimicry looks disingenuous, facial co-expression feels more genuine since it requires correctly inferring the human's emotional state for timely execution. We find that a robot can learn to predict a forthcoming smile about 839 milliseconds before the human smiles, and using a learned inverse kinematic facial self-model, co-express the smile simul..., During the data collection phase, the robot generated symmetrical facial expressions, which we thought can cover most of the situation and could reduce the size of the model. We used an Intel RealSense D435i to capture RGB images and cropped them to 480 320. We logged each motor command value and robot images to form a single data pair without any human labeling., , # Dataset for Paper "Human-Robot Facial Co-expression"

Overview

This dataset accompanies the research on human-robot facial co-expression, aiming to enhance nonverbal interaction by training robots to anticipate and simultaneously execute human facial expressions. Our study proposes a method where robots can learn to predict forthcoming human facial expressions and execute them in real time, thereby making the interaction feel more genuine and natural.

https://doi.org/10.5061/dryad.gxd2547t7

Description of the data and file structure

The dataset is organized into several zip files, each containing different components essential for replicating our study's results or for use in related research projects:

pred_training_data.zip: Contains the data used for training the predictive model. This dataset is crucial for developing models that predict human facial expressions based on input frames.

pred_model.zip: Contains the...
Trojan Detection Software Challenge - Round 4 Holdout Dataset
data.nist.gov
nist.gov
+2more
Updated Dec 31, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 4 Holdout Dataset [Dataset]. http://doi.org/10.18434/mds2-2372
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2372, https://identifiers.org/ark:/88434/mds2-2372
Dataset updated
Dec 31, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Michael Paul Majurski
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The data being generated and disseminated is the holdout data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 288 adversarially trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Trojan Detection Software Challenge - Round 2 Holdout Dataset
data.nist.gov
s.cnmilf.com
+2more
Updated Oct 23, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 2 Holdout Dataset [Dataset]. http://doi.org/10.18434/mds2-2322
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2322, https://identifiers.org/ark:/88434/mds2-2322 https://identifiers.org/ark:/88434/mds2-2322/pdr
Dataset updated
Oct 23, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Michael Paul Majurski
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The data being generated and disseminated is the holdout data used to evaluate trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 144 trained, human level, image classification AI models using a variety of architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Trojan Detection Software Challenge - Round 2 Test Dataset
data.nist.gov
nist.gov
+2more
Updated Oct 30, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Paul Majurski (2020). Trojan Detection Software Challenge - Round 2 Test Dataset [Dataset]. http://doi.org/10.18434/mds2-2321
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2321, https://identifiers.org/ark:/88434/mds2-2321
Dataset updated
Oct 30, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Michael Paul Majurski
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
The data being generated and disseminated is the test data used to evaluate trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 144 trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Human vs. Machine-Generated Text Stories
kaggle.com
zip
Updated Aug 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kian Jazayeri (2024). Human vs. Machine-Generated Text Stories [Dataset]. https://www.kaggle.com/datasets/kianjazayeri/human-vs-machine-generated-text-stories
Explore at:
zip(2678132 bytes)Available download formats
Dataset updated
Aug 4, 2024
Authors
Kian Jazayeri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains a collection of stories designed to facilitate research in distinguishing between human-generated and machine-generated text. The dataset includes 5000 human-generated stories sourced from the ROCStories corpus and machine-generated continuations of these stories produced using the FALCON-7b language model with three different settings.

Dataset Composition:

Human-Generated Stories: The original 5000 stories from the ROCStories corpus.

Machine-Generated Stories (Setting 1): Continuations generated by FALCON-7b with balanced diversity and quality settings (Temperature: 1.0, Top-K Sampling: 50, Top-p Sampling: 0.9).

Machine-Generated Stories (Setting 2): Continuations generated by FALCON-7b with high creativity and diversity settings (Temperature: 1.5, Top-K Sampling: 100, Top-p Sampling: 0.95).

Machine-Generated Stories (Setting 3): Continuations generated by FALCON-7b with conservative and deterministic settings (Temperature: 0.7, Top-K Sampling: 20, Top-p Sampling: 0.8).

Columns:

Human Story: The original story written by a human.

Machine-Generated (Setting 1): The continuation of the story generated by the FALCON-7b model with balanced settings.

Machine-Generated (Setting 2): The continuation of the story generated by the FALCON-7b model with creative settings.

Machine-Generated (Setting 3): The continuation of the story generated by the FALCON-7b model with conservative settings.

Purpose and Use: This dataset is intended for researchers and practitioners in the fields of Natural Language Processing (NLP) and Machine Learning. It provides a valuable resource for developing and testing models aimed at distinguishing between human and machine-generated text. Applications of this research include improving the detection of AI-generated content, enhancing text generation models, and exploring the nuances of human versus machine creativity in storytelling.

Acknowledgements:

The human-generated stories are sourced from the ROCStories corpus: Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., & Allen, J. (2016). A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. https://doi.org/10.18653/v1/n16-1098

The machine-generated continuations are created using the FALCON-7b language model. License:

This dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing for sharing, adaptation, and usage with appropriate credit to the original creators.
u
Data from: Identifying Machine-Paraphrased Plagiarism
portalinvestigacio.uib.cat
opendatalab.com
+2more
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela; Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela (2021). Identifying Machine-Paraphrased Plagiarism [Dataset]. https://portalinvestigacio.uib.cat/documentos/688b602617bb6239d2d49012
Explore at:
Dataset updated
2021
Authors
Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela; Wahle, Jan Philip; Ruas, Terry; Foltynek, Tomas; Meuschke, Norman; Gipp, Bela
Description
README.txt Title: Identifying Machine-Paraphrased Plagiarism
Authors: Jan Philip Wahle, Terry Ruas, Tomas Foltynek, Norman Meuschke, and Bela Gipp
contact email: wahle@gipplab.org; ruas@gipplab.org;
Venue: iConference
Year: 2022
================================================================
Dataset Description: Training:
200,767 paragraphs (98,282 original, 102,485paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API). Testing:
SpinBot:
arXiv - Original - 20,966; Spun - 20,867
Theses - Original - 5,226; Spun - 3,463
Wikipedia - Original - 39,241; Spun - 40,729

SpinnerChief-4W:
arXiv - Original - 20,966; Spun - 21,671
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,618

SpinnerChief-2W:
arXiv - Original - 20,966; Spun - 21,719
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,697 ================================================================
Dataset Structure: [human_evaluation] folder: human evaluation to identify human-generated text and machine-paraphrased text. It contains the files (original and spun) as for the answer-key for the survey performed with human subjects (all data is anonymous for privacy reasons). NNNNN.txt - whole document from which an extract was taken for human evaluation
key.txt.zip - information about each case (ORIG/SPUN)
results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded.
[automated_evaluation]: contains all files used for the automated evaluation considering spinbot and spinnerchief. Each paraphrase tool folder contains: [corpus] and [vectors] sub-folders. For [spinnerchief], two variations are included, with 4-word-chaging ratio (default) and 2-word-chaging ratio. [vectors] sub-folder contains the average of all word vectors for each paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e., label mg or og). Each file belongs to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file. The word embedding technique used is described in the file name with the following structure:
Data from: Deep learning four decades of human migration: datasets
zenodo.org
csv, nc
Updated Oct 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Gaskin; Thomas Gaskin; Guy Abel; Guy Abel (2025). Deep learning four decades of human migration: datasets [Dataset]. http://doi.org/10.5281/zenodo.17344747
Explore at:
csv, ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17344747
Dataset updated
Oct 13, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas Gaskin; Thomas Gaskin; Guy Abel; Guy Abel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Zenodo repository contains all migration flow estimates associated with the paper "Deep learning four decades of human migration." Evaluation code, training data, trained neural networks, and smaller flow datasets are available in the main GitHub repository, which also provides detailed instructions on data sourcing. Due to file size limits, the larger datasets are archived here.

Data is available in both NetCDF (.nc) and CSV (.csv) formats. The NetCDF format is more compact and pre-indexed, making it suitable for large files. In Python, datasets can be opened as xarray.Dataset objects, enabling coordinate-based data selection.

Each dataset uses the following coordinate conventions:

Year: 1990–2023

Birth ISO: Country of birth (UN ISO3)

Origin ISO: Country of origin (UN ISO3)

Destination ISO: Destination country (UN ISO3)

Country ISO: Used for net migration data (UN ISO3)

The following data files are provided:

T.nc: Full table of flows disaggregated by country of birth. Dimensions: Year, Birth ISO, Origin ISO, Destination ISO

flows.nc: Total origin-destination flows (equivalent to T summed over Birth ISO). Dimensions: Year, Origin ISO, Destination ISO

net_migration.nc: Net migration data by country. Dimensions: Year, Country ISO

stocks.nc: Stock estimates for each country pair. Dimensions: Year, Origin ISO (corresponding to Birth ISO), Destination ISO

test_flows.nc: Flow estimates on a randomly selected set of test edges, used for model validation

Additionally, two CSV files are provided for convenience:

mig_unilateral.csv: Unilateral migration estimates per country, comprising:

imm: Total immigration flows

emi: Total emigration flows

net: Net migration

imm_pop: Total immigrant population (non-native-born)

emi_pop: Total emigrant population (living abroad)

mig_bilateral.csv: Bilateral flow data, comprising:

mig_prev: Total origin-destination flows

mig_brth: Total birth-destination flows, where Origin ISO reflects place of birth

Each dataset includes a mean variable (mean estimate) and a std variable (standard deviation of the estimate).

An ISO3 conversion table is also provided.
Trojan Detection Software Challenge - image-classification-feb2021-train
data.nist.gov
s.cnmilf.com
+2more
Updated Dec 14, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Paul Majurski (2020). Trojan Detection Software Challenge - image-classification-feb2021-train [Dataset]. http://doi.org/10.18434/mds2-2340
Explore at:
Unique identifier
https://doi.org/10.18434/mds2-2340, https://identifiers.org/ark:/88434/mds2-2345 https://identifiers.org/ark:/88434/mds2-2345/pdr
Dataset updated
Dec 14, 2020
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Authors
Michael Paul Majurski
License
https://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Description
Round 4 Train Dataset The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1008 adversarially trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
d
Bobby Cole Ambient New Age Atmospheres (Machine Learning (ML) Data) |...
datarade.ai
.wav
Updated Sep 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bobby Cole Music (2025). Bobby Cole Ambient New Age Atmospheres (Machine Learning (ML) Data) | Original, Premium and Custom Record | 16GB+ Collection [Dataset]. https://datarade.ai/data-products/bobby-cole-ambient-new-age-atmospheres-machine-learning-ml-bobby-cole-music
Explore at:
.wavAvailable download formats
Dataset updated
Sep 17, 2025
Dataset authored and provided by
Bobby Cole Music
Area covered
Martinique, Mozambique, Latvia, Sri Lanka, Cayman Islands, Bermuda, Denmark, Kyrgyzstan, United Republic of, Vietnam
Description
The Ambience Dataset is a curated collection of original ambient recordings designed to support machine learning in sound classification, emotional modeling, and spatial audio recognition. These tracks feature evolving textures, minimal melodic content, and carefully layered soundscapes that reflect environments both natural and abstract. From soothing drones to atmospheric washes and immersive pads, this dataset captures the subtle complexity of ambient music.

Each track is paired with detailed metadata including tempo (if applicable), duration, sonic descriptors, frequency characteristics, and mood annotations—offering vital training data for MIR, generative ambience, ambient sound classification, and more. With no AI-generated content, every track is 100% human-produced and studio-crafted, providing high-fidelity, expressive audio for advanced AI audio development.
d
Bobby Cole Music Classical Music (Machine Learning (ML) Data) | Original,...
datarade.ai
.wav
Updated Sep 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bobby Cole Music (2025). Bobby Cole Music Classical Music (Machine Learning (ML) Data) | Original, Premium and Custom Record | 55GB+ Collection (Copy) [Dataset]. https://datarade.ai/data-products/bobby-cole-music-classical-music-machine-learning-ml-data-bobby-cole-music-e937
Explore at:
.wavAvailable download formats
Dataset updated
Sep 16, 2025
Dataset authored and provided by
Bobby Cole Music
Area covered
Gambia, Faroe Islands, Maldives, Philippines, Togo, Lao People's Democratic Republic, Ukraine, Seychelles, Azerbaijan, Romania
Description
The Nursery Rhymes Dataset is a charming collection of custom-composed children’s songs that replicate the classic nursery rhyme format—featuring repetitive phrasing, rhyme schemes, melodic simplicity, and age-appropriate pacing. Instruments include piano, glockenspiel, ukulele, soft drums, and warm vocal tones designed to be soothing, clear, and accessible for young learners.

Each track includes metadata detailing tempo, rhyme patterns, key, syllable count, verse structure, and lyrical themes. This enables AI models to learn patterns in early childhood music such as phoneme repetition, call-and-response, learning cues, and safe tonal qualities.

All content is 100% human-created in a professional studio, with absolutely no AI-generated melodies or vocals. The music is designed to mimic the familiar structure of classic nursery rhymes while offering entirely original content. This makes it perfect for training models that support children’s language acquisition, music education, vocal detection, emotional regulation tools, and intelligent kids’ media platforms.

From educational games to voice assistants for children, this dataset provides a foundation of safe, consistent, and musically sound data for any child-focused AI application.

Facebook

Twitter

Click to copy link

Link copied

Cite

ARJUN VERMA (2025). AI vs Human Text Balanced 360k+ records [Dataset]. https://www.kaggle.com/datasets/arjunverma2004/ai-vs-human-text-balanced-180k-records

AI vs Human Text Balanced 360k+ records

360K+ AI and Human Generated Essays

Explore at:

zip(263669225 bytes)Available download formats

Dataset updated

Sep 7, 2025

Authors

ARJUN VERMA

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

This dataset contains a collection of AI-generated and human-written text samples. It is designed for research in AI text detection, natural language processing, and machine learning tasks such as binary classification.

The original dataset, created by Shane Gerami , provided a large set of examples for distinguishing between AI and human writing.

In this version, I have balanced the dataset to ensure equal representation of AI and human text. This makes it more suitable for training and evaluating machine learning models without bias toward one class.

🔍 Features - Text: The written passage (AI or human). - Label: - 0 → Human-written - 1→ AI-generated

⚡ Use Cases - Training models to detect AI-generated text - Benchmarking text classification approaches - Research in AI detection, authorship attribution, and content moderation

🙌 Acknowledgments

Original dataset by Shane Gerami: AI vs Human Text

This balanced version prepared and published by @arjunverma2004

Clear search

Close search

Google apps

Main menu

AI vs Human Text Balanced 360k+ records

Data from: ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic...

Urdu Human and AI text Dataset (UHAT)

ShutterStock Dataset for AI vs Human-Gen. Image

Dataset Overview:

Potential Use Cases:

Why This Dataset?

Step 1: Install and Authenticate Kaggle API

Step 2: Use wget

Step 3: Extract the Dataset

Tran et al. Final_Dataset.xlsx

Speech Dataset of Human and AI-Generated Voices

AI-Generated vs Human-Written Text Dataset

Dataset Description

Variables and How They Are Obtained

1. Generated

2. Paraphrased

3. Translated

4. Humanized

5. Human-Written

Trojan Detection Software Challenge - image-classification-jun2020-train

AH&AITD – Arslan’s Human and AI Text Database

Trojan Detection Software Challenge - Round 2 Training Dataset

Data and trained models for: Human-robot facial co-expression

Overview

Description of the data and file structure

Trojan Detection Software Challenge - Round 4 Holdout Dataset

Trojan Detection Software Challenge - Round 2 Holdout Dataset

Trojan Detection Software Challenge - Round 2 Test Dataset

Human vs. Machine-Generated Text Stories

Data from: Identifying Machine-Paraphrased Plagiarism

Data from: Deep learning four decades of human migration: datasets

Trojan Detection Software Challenge - image-classification-feb2021-train

Bobby Cole Ambient New Age Atmospheres (Machine Learning (ML) Data) |...

Bobby Cole Music Classical Music (Machine Learning (ML) Data) | Original,...

AI vs Human Text Balanced 360k+ records

360K+ AI and Human Generated Essays