Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a collection of AI-generated and human-written text samples. It is designed for research in AI text detection, natural language processing, and machine learning tasks such as binary classification.
The original dataset, created by Shane Gerami , provided a large set of examples for distinguishing between AI and human writing.
In this version, I have balanced the dataset to ensure equal representation of AI and human text. This makes it more suitable for training and evaluating machine learning models without bias toward one class.
🔍 Features
- Text: The written passage (AI or human).
- Label:
- 0 → Human-written
- 1→ AI-generated
⚡ Use Cases - Training models to detect AI-generated text - Benchmarking text classification approaches - Research in AI detection, authorship attribution, and content moderation
🙌 Acknowledgments
Original dataset by Shane Gerami: AI vs Human Text
This balanced version prepared and published by @arjunverma2004
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Facebook
TwitterDataset Overview This dataset is designed for Urdu text classification
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
ShutterStock AI vs. Human-Generated Image Dataset
This dataset is curated to facilitate research in distinguishing AI-generated images from human-created ones, leveraging ShutterStock data. As AI-generated imagery becomes more sophisticated, developing models that can classify and analyze such images is crucial for applications in content moderation, digital forensics, and media authenticity verification.
With the rise of generative AI models like Stable Diffusion, DALL·E, and MidJourney, the ability to differentiate between synthetic and real images has become a crucial challenge. This dataset offers a structured way to train AI models on this task, making it a valuable resource for both academic research and practical applications.
Explore the dataset and contribute to advancing AI-generated content detection!
If you haven't installed the Kaggle API, run:
bash
pip install kaggle
Then, download your kaggle.json API key from Kaggle Account and move it to ~/.kaggle/ (Linux/Mac) or `C:\Users\YourUser.kaggle` (Windows).
wget --no-check-certificate --header "Authorization: Bearer $(cat ~/.kaggle/kaggle.json | jq -r .token)" "https://www.kaggle.com/datasets/shreyasraghav/shutterstock-dataset-for-ai-vs-human-gen-image" -O dataset.zip
Once downloaded, extract the dataset using:
bash
unzip dataset.zip -d dataset_folder
Now your dataset is ready to use! 🚀
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial Intelligence (AI) has emerged as a critical challenge to the authenticity of journalistic content, raising concerns over the ease with which artificially generated articles can mimic human-written news. This study focuses on using machine learning to identify distinguishing features, or “stylistic fingerprints,” of AI-generated and human-authored journalism. By analyzing these unique characteristics, we aim to classify news pieces with high accuracy, enhancing our ability to verify the authenticity of digital news.To conduct this study, we gathered a balanced dataset of 150 original journalistic articles and their 150 AI-generated counterparts, sourced from popular news websites. A variety of lexical, syntactic, and readability features were extracted from each article to serve as input data for training machine learning models. Five classifiers were then trained to evaluate how accurately they could distinguish between authentic and artificial articles, with each model learning specific patterns and variations in writing style.In addition to model training, BERTopic, a topic modeling technique, was applied to extract salient keywords from the journalistic articles. These keywords were used to prompt Google’s Gemini, an AI text generation model, to create artificial articles on the same topics as the original human-written pieces. This ensured a high level of relevance between authentic and AI-generated articles, which added complexity to the classification task.Among the five classifiers tested, the Random Forest model delivered the best performance, achieving an accuracy of 98.3% along with high precision (0.984), recall (0.983), and F1-score (0.983). Feature importance analyses were conducted using methods like Random Forest Feature Importance, Analysis of Variance (ANOVA), Mutual Information, and Recursive Feature Elimination. This analysis revealed that the top five discriminative features were sentence length range, paragraph length coefficient of variation, verb ratio, sentence complexity tags, and paragraph length range. These features appeared to encapsulate subtle but meaningful stylistic differences between human and AI-generated content.This research makes a significant contribution to combating disinformation by offering a robust method for authenticating journalistic content. By employing machine learning to identify subtle linguistic patterns, this study not only advances our understanding of AI in journalism but also enhances the tools available to ensure the credibility of news in the digital age.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This dataset consists of audio recordings in Indonesian language, categorized into two distinct classes: human voices (real) and synthetic voices generated using artificial intelligence (AI). Each class comprises 21 audio files, resulting in a total of 42 audio files. Each recording has a duration ranging from approximately 4 to 9 minutes, with an average length of around 6 minutes per file. All recordings are provided in WAV format and accompanied by a CSV file containing detailed duration metadata for each audio file.
This dataset is suitable for research and applications in speech recognition, voice authenticity detection, audio analysis, and related fields. It enables comparative analysis between natural Indonesian speech and AI-generated synthetic speech.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains two main collections of texts:
- AI-Generated Texts: Produced using ChatGPT, Gemini, Grok, Deepseek in response to academic-style prompts across multiple domains, including Mathematics, Biology, History, Economics, Computer Science, and IELTS-style essays.
- Human-Written Texts: Collected from authentic academic source such as arXiv, including metadata (author, year, and source).
To simulate diverse writing conditions, the dataset is extended with different variations of AI outputs, such as paraphrasing, translation, and humanization. This allows researchers to study AI text detection, authorship classification, and style transfer.
Texts produced by ChatGPT, Gemini, Grok, Deepseek in response to academic prompts. Each prompt specifies a subject area and includes formatting restrictions to avoid the use of mathematical formulas, symbols, lists, and special formatting.
Prompts for Generated Texts:
| Prompt | Subject |
|---|---|
| "Explain the fundamental principles of calculus, including differentiation and integration, with real-world applications. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc." | Mathematics |
| "Explain the process of cellular respiration and its role in energy production within living organisms. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc." | Biology |
| "Analyze the causes and consequences of the Industrial Revolution, highlighting its impact on global economies and societies. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc." | History |
| "Explain the principles of supply and demand and their effects on market equilibrium, with examples. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc." | Economics |
| "Describe the basics of machine learning, including supervised and unsupervised learning techniques. Instructions: 1) Write about 400 words. 2) Avoid Mathematical formulas and symbols. 3) If possible avoid itemization. 4) Avoid bold letters, headers, etc." | Computer Science |
| "Provide 400-word passage written at an IELTS Band 6 level: Government investment in the arts, such as music and theatre, is a waste of money. Governments must invest this money in public services instead. To what extent do you agree with this statement?" | IELTS Essay |
Reworded versions of the AI-generated texts.
- Obtained using QuillBot paraphrasing tool (default settings).
- Example instruction: “Paraphrase the following text to avoid direct repetition but keep the meaning the same.”
AI-generated texts translated into another language and back into English to simulate style distortion.
- Step 1: Translated into Russian with Yandex Translate.
- Step 2: Back-translated into English using Google Translate.
AI-generated texts rewritten to resemble writing by a non-native English speaker at approximately IELTS Band 6 level. The style reflects competent English usage but with minor errors and awkward phrasing.
Prompt for Humanized Texts:
Rewrite the following text passage to reflect the writing style of a non-native English speaker who has achieved a band level 6 in IELTS writing. This level indicates a competent user of English, but with some inaccuracies, inappropriate usage, and misunderstandings. The text should be mostly clear but may contain occasional errors in grammar, vocabulary, and coherence.
Text Passage for Rewriting: [Insert text here]
Note: Aim for errors that are typical of an IELTS band level 6 writer. These could include minor grammatical mistakes, slight misuse of vocabulary, and occasional awkward phrasing. However, the overall meaning of the text should remain clear and understandable.
Word Count: approximately 400
Authentic texts authored by researchers.
- Sources: arXiv.org.
- Metadata includes author name, publication year, and source.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Round 1 Training Dataset The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1000 trained, human level, image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Errata: This dataset had a software bug in the trigger embedding code that caused 4 models trained for this dataset to have a ground truth value of 'poisoned' but which did not contain any triggers embedded. These models should not be used. Models Without a Trigger Embedded: id-00000184 id-00000599 id-00000858 id-00001088 Google Drive Mirror: https://drive.google.com/open?id=1uwVt3UCRL2fCX9Xvi2tLoz_z-DwbU6Ce
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains 11,580 samples spanning both human-written and AI-generated content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability. To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.Composition1. Human-Written Samples (Total: 5,790)Collected from:Open Web Text (2,343 samples)Blogs (196 samples)Web Text (397 samples)Q&A Platforms (670 samples)News Articles (430 samples)Opinion Statements (1,549 samples)Scientific Research Abstracts (205 samples)2. AI-Generated Samples (Total: 5,790)Generated using:ChatGPT (1,130 samples)GPT-4 (744 samples)Paraphrase Models (1,694 samples)GPT-2 (328 samples)GPT-3 (296 samples)DaVinci (GPT-3.5 variant) (433 samples)GPT-3.5 (364 samples)OPT-IML (406 samples)Flan-T5 (395 samples)Citation:Akram, A. (2023). AH&AITD: Arslan’s Human and AI Text Database. [Dataset]. Associated with the article: An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence, 4(2), 44–55.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1104 trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Facebook
TwitterLarge language models are enabling rapid progress in robotic verbal communication, but nonverbal communication is not keeping pace. Physical humanoid robots struggle to express and communicate using facial movement, relying primarily on voice. The challenge is twofold: First, the actuation of an expressively versatile robotic face is mechanically challenging. A second challenge is knowing what expression to generate so that they appear natural, timely, and genuine. Here we propose that both barriers can be alleviated by training a robot to anticipate future facial expressions and execute them simultaneously with a human. Whereas delayed facial mimicry looks disingenuous, facial co-expression feels more genuine since it requires correctly inferring the human's emotional state for timely execution. We find that a robot can learn to predict a forthcoming smile about 839 milliseconds before the human smiles, and using a learned inverse kinematic facial self-model, co-express the smile simul..., During the data collection phase, the robot generated symmetrical facial expressions, which we thought can cover most of the situation and could reduce the size of the model. We used an Intel RealSense D435i to capture RGB images and cropped them to 480 320. We logged each motor command value and robot images to form a single data pair without any human labeling., , # Dataset for Paper "Human-Robot Facial Co-expression"
This dataset accompanies the research on human-robot facial co-expression, aiming to enhance nonverbal interaction by training robots to anticipate and simultaneously execute human facial expressions. Our study proposes a method where robots can learn to predict forthcoming human facial expressions and execute them in real time, thereby making the interaction feel more genuine and natural.
https://doi.org/10.5061/dryad.gxd2547t7
The dataset is organized into several zip files, each containing different components essential for replicating our study's results or for use in related research projects:
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The data being generated and disseminated is the holdout data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 288 adversarially trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The data being generated and disseminated is the holdout data used to evaluate trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 144 trained, human level, image classification AI models using a variety of architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
The data being generated and disseminated is the test data used to evaluate trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 144 trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains a collection of stories designed to facilitate research in distinguishing between human-generated and machine-generated text. The dataset includes 5000 human-generated stories sourced from the ROCStories corpus and machine-generated continuations of these stories produced using the FALCON-7b language model with three different settings.
Dataset Composition:
Human-Generated Stories: The original 5000 stories from the ROCStories corpus.
Machine-Generated Stories (Setting 1): Continuations generated by FALCON-7b with balanced diversity and quality settings (Temperature: 1.0, Top-K Sampling: 50, Top-p Sampling: 0.9).
Machine-Generated Stories (Setting 2): Continuations generated by FALCON-7b with high creativity and diversity settings (Temperature: 1.5, Top-K Sampling: 100, Top-p Sampling: 0.95).
Machine-Generated Stories (Setting 3): Continuations generated by FALCON-7b with conservative and deterministic settings (Temperature: 0.7, Top-K Sampling: 20, Top-p Sampling: 0.8).
Columns:
Human Story: The original story written by a human.
Machine-Generated (Setting 1): The continuation of the story generated by the FALCON-7b model with balanced settings.
Machine-Generated (Setting 2): The continuation of the story generated by the FALCON-7b model with creative settings.
Machine-Generated (Setting 3): The continuation of the story generated by the FALCON-7b model with conservative settings.
Purpose and Use: This dataset is intended for researchers and practitioners in the fields of Natural Language Processing (NLP) and Machine Learning. It provides a valuable resource for developing and testing models aimed at distinguishing between human and machine-generated text. Applications of this research include improving the detection of AI-generated content, enhancing text generation models, and exploring the nuances of human versus machine creativity in storytelling.
Acknowledgements:
The human-generated stories are sourced from the ROCStories corpus: Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P., & Allen, J. (2016). A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. https://doi.org/10.18653/v1/n16-1098
The machine-generated continuations are created using the FALCON-7b language model. License:
This dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing for sharing, adaptation, and usage with appropriate credit to the original creators.
Facebook
TwitterREADME.txt Title: Identifying Machine-Paraphrased Plagiarism
Authors: Jan Philip Wahle, Terry Ruas, Tomas Foltynek, Norman Meuschke, and Bela Gipp
contact email: wahle@gipplab.org; ruas@gipplab.org;
Venue: iConference
Year: 2022
================================================================
Dataset Description: Training:
200,767 paragraphs (98,282 original, 102,485paraphrased) extracted from 8,024 Wikipedia (English) articles (4,012 original, 4,012 paraphrased using the SpinBot API). Testing:
SpinBot:
arXiv - Original - 20,966; Spun - 20,867
Theses - Original - 5,226; Spun - 3,463
Wikipedia - Original - 39,241; Spun - 40,729
SpinnerChief-4W:
arXiv - Original - 20,966; Spun - 21,671
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,618
SpinnerChief-2W:
arXiv - Original - 20,966; Spun - 21,719
Theses - Original - 2,379; Spun - 2,941
Wikipedia - Original - 39,241; Spun - 39,697 ================================================================
Dataset Structure: [human_evaluation] folder: human evaluation to identify human-generated text and machine-paraphrased text. It contains the files (original and spun) as for the answer-key for the survey performed with human subjects (all data is anonymous for privacy reasons). NNNNN.txt - whole document from which an extract was taken for human evaluation
key.txt.zip - information about each case (ORIG/SPUN)
results.xlsx - raw results downloaded from the survey tool (the extracts which humans judged are in the first line)
results-corrected.xlsx - at the very beginning, there was a mistake in one question (wrong extract). These results were excluded.
[automated_evaluation]: contains all files used for the automated evaluation considering spinbot and spinnerchief. Each paraphrase tool folder contains: [corpus] and [vectors] sub-folders. For [spinnerchief], two variations are included, with 4-word-chaging ratio (default) and 2-word-chaging ratio. [vectors] sub-folder contains the average of all word vectors for each paragraph. Each line has the number of dimensions of the word embeddings technique used (see paper for more details) followed by its respective class (i.e., label mg or og). Each file belongs to one class, either "mg" or "og". The values are comma-separated (.csv). The extension is .arff can be read as a normal .txt file. The word embedding technique used is described in the file name with the following structure:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This Zenodo repository contains all migration flow estimates associated with the paper "Deep learning four decades of human migration." Evaluation code, training data, trained neural networks, and smaller flow datasets are available in the main GitHub repository, which also provides detailed instructions on data sourcing. Due to file size limits, the larger datasets are archived here.
Data is available in both NetCDF (.nc) and CSV (.csv) formats. The NetCDF format is more compact and pre-indexed, making it suitable for large files. In Python, datasets can be opened as xarray.Dataset objects, enabling coordinate-based data selection.
Each dataset uses the following coordinate conventions:
The following data files are provided:
T summed over Birth ISO). Dimensions: Year, Origin ISO, Destination ISOAdditionally, two CSV files are provided for convenience:
imm: Total immigration flowsemi: Total emigration flowsnet: Net migrationimm_pop: Total immigrant population (non-native-born)emi_pop: Total emigrant population (living abroad)mig_prev: Total origin-destination flowsmig_brth: Total birth-destination flows, where Origin ISO reflects place of birthEach dataset includes a mean variable (mean estimate) and a std variable (standard deviation of the estimate).
An ISO3 conversion table is also provided.
Facebook
Twitterhttps://www.nist.gov/open/licensehttps://www.nist.gov/open/license
Round 4 Train Dataset The data being generated and disseminated is the training data used to construct trojan detection software solutions. This data, generated at NIST, consists of human level AIs trained to perform image classification. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 1008 adversarially trained, human level, image classification AI models using a variety of model architectures. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present.
Facebook
TwitterThe Ambience Dataset is a curated collection of original ambient recordings designed to support machine learning in sound classification, emotional modeling, and spatial audio recognition. These tracks feature evolving textures, minimal melodic content, and carefully layered soundscapes that reflect environments both natural and abstract. From soothing drones to atmospheric washes and immersive pads, this dataset captures the subtle complexity of ambient music.
Each track is paired with detailed metadata including tempo (if applicable), duration, sonic descriptors, frequency characteristics, and mood annotations—offering vital training data for MIR, generative ambience, ambient sound classification, and more. With no AI-generated content, every track is 100% human-produced and studio-crafted, providing high-fidelity, expressive audio for advanced AI audio development.
Facebook
TwitterThe Nursery Rhymes Dataset is a charming collection of custom-composed children’s songs that replicate the classic nursery rhyme format—featuring repetitive phrasing, rhyme schemes, melodic simplicity, and age-appropriate pacing. Instruments include piano, glockenspiel, ukulele, soft drums, and warm vocal tones designed to be soothing, clear, and accessible for young learners.
Each track includes metadata detailing tempo, rhyme patterns, key, syllable count, verse structure, and lyrical themes. This enables AI models to learn patterns in early childhood music such as phoneme repetition, call-and-response, learning cues, and safe tonal qualities.
All content is 100% human-created in a professional studio, with absolutely no AI-generated melodies or vocals. The music is designed to mimic the familiar structure of classic nursery rhymes while offering entirely original content. This makes it perfect for training models that support children’s language acquisition, music education, vocal detection, emotional regulation tools, and intelligent kids’ media platforms.
From educational games to voice assistants for children, this dataset provides a foundation of safe, consistent, and musically sound data for any child-focused AI application.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a collection of AI-generated and human-written text samples. It is designed for research in AI text detection, natural language processing, and machine learning tasks such as binary classification.
The original dataset, created by Shane Gerami , provided a large set of examples for distinguishing between AI and human writing.
In this version, I have balanced the dataset to ensure equal representation of AI and human text. This makes it more suitable for training and evaluating machine learning models without bias toward one class.
🔍 Features
- Text: The written passage (AI or human).
- Label:
- 0 → Human-written
- 1→ AI-generated
⚡ Use Cases - Training models to detect AI-generated text - Benchmarking text classification approaches - Research in AI detection, authorship attribution, and content moderation
🙌 Acknowledgments
Original dataset by Shane Gerami: AI vs Human Text
This balanced version prepared and published by @arjunverma2004