100+ datasets found

ChatGPT Classification Dataset
kaggle.com
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi (2023). ChatGPT Classification Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimaktabdar/chatgpt-classification-dataset
Explore at:
zip(718710 bytes)Available download formats
Dataset updated
Sep 7, 2023
Authors
Mahdi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We have compiled a dataset that consists of textual articles including common terminology, concepts and definitions in the field of computer science, artificial intelligence, and cyber security. This dataset consists of both human-generated text and OpenAI’s ChatGPT-generated text. Human-generated answers were collected from different computer science dictionaries and encyclopedias including “The Encyclopedia of Computer Science and Technology” and "Encyclopedia of Human-Computer Interaction". AI-generated content in our dataset was produced by simply posting questions to OpenAI’s ChatGPT and manually documenting the resulting responses. A rigorous data-cleaning process has been performed to remove unwanted Unicode characters, styling and formatting tags. To structure our dataset for binary classification, we combined both AI-generated and Human-generated answers into a single column and assigned appropriate labels to each data point (Human-generated = 0 and AI-generated = 1).

This creates our article-level dataset (article_level_data.csv) which consists of a total of 1018 articles, 509 AI-generated and 509 Human-generated. Additionally, we have divided each article into its sentences and labelled them accordingly. This is mainly to evaluate the performance of classification models and pipelines when it comes to shorter sentence-level data points. This constructs our sentence-level dataset (sentence_level_data.csv) which consists of a total of 7344 entries (4008 AI-generated and 3336 Human-generated).

We appreciate it, if you cite the following article if you happen to use this dataset in any scientific publication:

Maktab Dar Oghaz, M., Dhame, K., Singaram, G., & Babu Saheer, L. (2023). Detection and Classification of ChatGPT Generated Contents Using Deep Transformer Models. Frontiers in Artificial Intelligence.

https://www.techrxiv.org/users/692552/articles/682641/master/file/data/ChatGPT_generated_Content_Detection/ChatGPT_generated_Content_Detection.pdf
ChatGPT User Reviews
kaggle.com
zip
Updated Jun 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bhavik Jikadara (2024). ChatGPT User Reviews [Dataset]. https://www.kaggle.com/datasets/bhavikjikadara/chatgpt-user-feedback
Explore at:
zip(5709734 bytes)Available download formats
Dataset updated
Jun 30, 2024
Authors
Bhavik Jikadara
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description

This dataset consists of daily-updated user reviews and ratings for the ChatGPT Android App. The dataset includes several key attributes that capture various aspects of the reviews, providing insights into user experiences and feedback over time.

Columns Explanation

userName: The display name of the user who posted the review.

content: The text content of the review. This column contains the actual review text written by the user. It includes user opinions, feedback, and detailed descriptions of their experiences with the ChatGPT app.

score: The rating given by the user, typically ranging from 1 to 5. This column captures the numerical rating provided by the user. Higher scores indicate better experiences, while lower scores indicate dissatisfaction.

thumbsUpCount: The number of thumbs up (likes) the review received. This column shows how many other users found the review helpful or agreed with the sentiments expressed. It serves as a measure of the review's relevancy and impact.

at: The timestamp of when the review was posted. This column includes the date and time when the review was submitted. It is crucial for tracking the temporal distribution of reviews and analyzing trends over time.

Collection Methods

Data Source: The data is collected from user reviews submitted through the ChatGPT Android App's review section on the Google Play Store.

Frequency: The dataset is updated daily to capture the most recent user feedback and ratings.

Automation: An automated script is used to scrape and compile the reviews, ensuring that the dataset is current and comprehensive.

Data Cleaning: Basic preprocessing is performed to ensure data quality, such as removing duplicates and handling missing values.
h
awesome-chatgpt-prompts
huggingface.co
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatih Kadir Akın (2023). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2023
Authors
Fatih Kadir Akın
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
🧠 Awesome ChatGPT Prompts [CSV dataset]

This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

License

CC-0
b
ChatGPT Revenue and Usage Statistics (2025)
businessofapps.com
Updated Feb 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Business of Apps (2023). ChatGPT Revenue and Usage Statistics (2025) [Dataset]. https://www.businessofapps.com/data/chatgpt-statistics/
Explore at:
Dataset updated
Feb 9, 2023
Dataset authored and provided by
Business of Apps
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
ChatGPT was the chatbot that kickstarted the generative AI revolution, which has been responsible for hundreds of billions of dollars in data centres, graphics chips and AI startups. Launched by...
S
Test dataset of ChatGPT in medical field
scidb.cn
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
robin shen (2023). Test dataset of ChatGPT in medical field [Dataset]. http://doi.org/10.57760/sciencedb.o00130.00001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.o00130.00001
Dataset updated
Mar 3, 2023
Dataset provided by
Science Data Bank
Authors
robin shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The researcher tests the QA capability of ChatGPT in the medical field from the following aspects:1. Test their reserve capacity for medical knowledge2. Check their ability to read literature and understand medical literature3. Test their ability of auxiliary diagnosis after reading case data4. Test its error correction ability for case data5. Test its ability to standardize medical terms6. Test their evaluation ability to experts7. Check their ability to evaluate medical institutionsThe conclusion is:ChatGPT has great potential in the application of medical and health care, and may directly replace human beings or even professionals at a certain level in some fields;The researcher preliminarily believe that ChatGPT has basic medical knowledge and the ability of multiple rounds of dialogue, and its ability to understand Chinese is not weak;ChatGPT has the ability to read, understand and correct cases;ChatGPT has the ability of information extraction and terminology standardization, and is quite excellent;ChatGPT has the reasoning ability of medical knowledge;ChatGPT has the ability of continuous learning. After continuous training, its level has improved significantly;ChatGPT does not have the academic evaluation ability of Chinese medical talents, and the results are not ideal;ChatGPT does not have the academic evaluation ability of Chinese medical institutions, and the results are not ideal;ChatGPT is an epoch-making product, which can become a useful assistant for medical diagnosis and treatment, knowledge service, literature reading, review and paper writing.
s
Data from: ChatGPT in education: A discourse analysis of worries and...
socialmediaarchive.org
csv, json, txt
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ChatGPT in education: A discourse analysis of worries and concerns on social media [Dataset]. https://socialmediaarchive.org/record/54
Explore at:
csv(6528597), json(248465998), txt(4908229)Available download formats
Dataset updated
Sep 26, 2023
Description
The rapid advancements in generative AI models present new opportunities in the education sector. However, it is imperative to acknowledge and address the potential risks and concerns that may arise with their use. We collected Twitter data to identify key concerns related to the use of ChatGPT in education. This dataset is used to support the study "ChatGPT in education: A discourse analysis of worries and concerns on social media."

In this study, we particularly explored two research questions. RQ1 (Concerns): What are the key concerns that Twitter users perceive with using ChatGPT in education? RQ2 (Accounts): Which accounts are implicated in the discussion of these concerns? In summary, our study underscores the importance of responsible and ethical use of AI in education and highlights the need for collaboration among stakeholders to regulate AI policy.
h
Chatgpt
huggingface.co
Updated Apr 12, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajdeep Chatterjee (2023). Chatgpt [Dataset]. https://huggingface.co/datasets/RajChat/Chatgpt
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2023
Authors
Rajdeep Chatterjee
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenAssistant Conversations Dataset (OASST1)

Dataset Summary

In an effort to democratize research on large-scale alignment, we release OpenAssistant Conversations (OASST1), a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. The corpus is a product of a worldwide crowd-sourcing effort… See the full description on the dataset page: https://huggingface.co/datasets/RajChat/Chatgpt.
ChatGPT Reddit
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Armita Razavi (2023). ChatGPT Reddit [Dataset]. https://www.kaggle.com/datasets/armitaraz/chatgpt-reddit/data
Explore at:
zip(5282154 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
Armita Razavi
License
https://www.reddit.com/wiki/apihttps://www.reddit.com/wiki/api
Description
Here you can find about 50K comments on Reddit website regarding ChatGPT . The comments are gathered from Reddit's Posts from 4 subreddits.

The data includes comment_id, comment_parent_id, comment_body and subreddit

comment_id : the comment's id

comment_parent_id: the comment's id which the current comment is replied to.

comment_body: the comment

subreddit: the community/subreddit name of the comment

The Date and other information related to comments will be added in the next version. This dataset is useful to get insight about the public take on ChatGPT and also for text analysis, text visualizations, Inline Question Answering, Text Summarization, NER and other tasks like clustering and so on.

Please note that this dataset is not cleaned or preprocessed so if you want to get your hands dirty with data, it's a good practice to level up your skills in data cleaning too :)

And please don't forget to UPVOTE it in case you find it useful and enjoy it.
h
ASRS-ChatGPT
huggingface.co
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archana Tikayat Ray (2023). ASRS-ChatGPT [Dataset]. http://doi.org/10.57967/hf/0830
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0830
Dataset updated
Jun 29, 2023
Authors
Archana Tikayat Ray
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

The dataset contains a total of 9984 incident records and 9 columns. Some of the columns contain ground truth values whereas others contain information generated by ChatGPT based on the incident Narratives. The creation of this dataset is aimed at providing researchers with columns generated by using ChatGPT API which is not freely available.

Dataset Structure

The column names present in the dataset and their descriptions are provided below:

Column… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/ASRS-ChatGPT.
Global cases of sensitive data spill into ChatGPT 2023
statista.com
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2023). Global cases of sensitive data spill into ChatGPT 2023 [Dataset]. https://www.statista.com/statistics/1378692/corporate-sensitive-data-spill-chatgpt/
Explore at:
Dataset updated
Jun 15, 2023
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
Between the 9th and 15th of April 2023, per 100,000 employees, *** cases of sensitive data leaking on ChatGPT were spotted in worldwide companies. Compared to an observation between February and March 2023, the figure had increased by around ** percent. The second-most common type of confidential data shared on ChatGPT was source code, with *** cases per 100,000 employees.
89k ChatGPT conversations
kaggle.com
zip
Updated May 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Noah Persaud (2023). 89k ChatGPT conversations [Dataset]. https://www.kaggle.com/datasets/noahpersaud/89k-chatgpt-conversations
Explore at:
zip(681600031 bytes)Available download formats
Dataset updated
May 4, 2023
Authors
Noah Persaud
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains all available conversations from chatlogs.net between users and ChatGPT. Version 1 contains all conversations available up to the cutoff date of April 4, 2023. Version 1 contains all conversations available up to the cutoff date of April 20, 2023.
Top user concerns about ChatGPT SEA 2023
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Top user concerns about ChatGPT SEA 2023 [Dataset]. https://www.statista.com/statistics/1382944/sea-top-user-concerns-about-chat-gpt/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Feb 2023
Area covered
Asia
Description
In a survey conducted across **** Southeast Asian countries in February 2023, almost half of the respondents selected collection of personal data as one of the concerns they had regarding the usage of chatbots like ChatGPT. In contrast, ethical issues related to data privacy and intellectual property were a concern for ** percent of the respondents.
Z
A dataset to investigate ChatGPT for enhancing Students' Learning Experience...
data.niaid.nih.gov
zenodo.org
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schicchi, Daniele; Taibi, Davide (2024). A dataset to investigate ChatGPT for enhancing Students' Learning Experience via Concept Maps [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_12076680
Explore at:
Dataset updated
Jun 19, 2024
Dataset provided by
Institute for Educational Technology, National Research Council of Italy
Authors
Schicchi, Daniele; Taibi, Davide
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset was compiled to examine the use of ChatGPT 3.5 in educational settings, particularly for creating and personalizing concept maps. The data has been organized into three folders: Maps, Texts, and Questionnaires. The Maps folder contains the graphical representation of the concept maps and the PlanUML code for drawing them in Italian and English. The Texts folder contains the source text used as input for the map's creation The Questionnaires folder includes the students' responses to the three administered questionnaires.
All GPT-4 Conversations
kaggle.com
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). All GPT-4 Conversations [Dataset]. https://www.kaggle.com/datasets/thedevastator/all-gpt-4-synthetic-chat-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description

All GPT-4 Generated Datasets

Every chat dataset generated by GPT-4 from Huggingface at the same format

From [Huggingface datasets]

About this dataset

How to use the dataset

The dataset includes all chat conversations generated by GPT-4 that are hosted on open Huggingface datasets. Everything is converted to the same format so the datasets can be easily merged and used for large scale training of LLMs.

Acknowledgements

This dataset is a collection of several single chat datasets. If you use this dataset in your research, please credit the original authors of the internal datasets. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
m
Public data files containing the data used for the ChatGPT survey (XLSX) and...
figshare.mq.edu.au
researchdata.edu.au
xlsx
Updated Sep 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matt Bower; Jodie Torrington; Jennifer Lai; Peter Petocz; Mark Alfano (2023). Public data files containing the data used for the ChatGPT survey (XLSX) and the survey containing variable selection codes (DOCX). [Dataset]. http://doi.org/10.25949/24123306.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25949/24123306.v1
Dataset updated
Sep 15, 2023
Dataset provided by
Macquarie University
Authors
Matt Bower; Jodie Torrington; Jennifer Lai; Peter Petocz; Mark Alfano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This project investigated teacher attitudes towards Generative Artificial Intelligence Tools (GAITs). In excess of three hundred teachers were surveyed across a broad variety of teaching levels, demographic areas, experience levels, and disciplinary areas, to better understand how they believe teaching and assessment should change as a result of GAITs such as ChatGPT.Teachers were invited to complete an online survey relating to their perceptions of the open Artificial Intelligence (AI) tool ChatGPT, and how it will influence what they teach and how they assess. The purpose of the study is to provide teachers, policymakers, and society at large with an understanding of the potential impact of tools such as ChatGPT on Education.This dataset contains public data files used for the ChatGPT survey (XLSX) and the survey containing variable selection codes (DOCX). See the second sheet of the XLSX file for variable descriptions.
h
scraped-chatgpt-conversations
huggingface.co
Updated Apr 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arya Nistane (2023). scraped-chatgpt-conversations [Dataset]. https://huggingface.co/datasets/ar852/scraped-chatgpt-conversations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2023
Authors
Arya Nistane
Description
Dataset Card for Dataset Name

Dataset Summary

scraped-chatgpt-conversations contains ~100k conversations between a user and chatgpt that were shared online through reddit, twitter, or sharegpt. For sharegpt, the conversations were directly scraped from the website. For reddit and twitter, images were downloaded from submissions, segmented, and run through an OCR pipeline to obtain a conversation list. For information on how the each json file is structured, please see… See the full description on the dataset page: https://huggingface.co/datasets/ar852/scraped-chatgpt-conversations.
t
ChatGPT Discussion Trends
tickertrends.io
html
Updated Oct 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TickerTrends (2025). ChatGPT Discussion Trends [Dataset]. https://tickertrends.io/chatgpt-trends
Explore at:
htmlAvailable download formats
Dataset updated
Oct 11, 2025
Dataset authored and provided by
TickerTrends
License
https://tickertrends.io/termshttps://tickertrends.io/terms
Time period covered
Nov 2022 - Present
Area covered
Global
Variables measured
Keyword Volume, Topic Mentions, Trend Momentum
Description
Monthly dataset tracking topic frequency, keyword volume, and conversation patterns across ChatGPT discussions. Data is normalized on a 0 to 100 scale for easy comparison. Aggregates millions of AI interactions to reveal emerging trends, user interests, and discussion momentum across technology, finance, health, education, and business categories.
Datasets .csv
figshare.com
txt
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaser Alhasawi (2024). Datasets .csv [Dataset]. http://doi.org/10.6084/m9.figshare.25053146.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25053146.v1
Dataset updated
Jan 24, 2024
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Yaser Alhasawi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset for this research project was meticulously constructed to investigate the adoption of ChatGPT among students in the United States. The primary objective was to gain insights into the technological barriers and resistances faced by students in integrating ChatGPT into their information systems. The dataset was designed to capture the diverse adoption patterns among students in various public and private schools and universities across the United States. By examining adoption rates, frequency of usage, and the contexts in which ChatGPT is employed, the research sought to provide a comprehensive understanding of how students are incorporating this technology into their information systems. Moreover, by including participants from diverse educational institutions, the research sought to ensure a comprehensive representation of the student population in the United States. This approach aimed to provide nuanced insights into how factors such as educational background, institution type, and technological familiarity influence ChatGPT adoption.
h
ChatGPT-Research-Abstracts
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicolai Thorer Sivesind, ChatGPT-Research-Abstracts [Dataset]. https://huggingface.co/datasets/NicolaiSivesind/ChatGPT-Research-Abstracts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Nicolai Thorer Sivesind
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
ChatGPT-Research-Abstracts

This is a dataset created in relation to a bachelor thesis written by Nicolai Thorer Sivesind and Andreas Bentzen Winje. It contains human-produced and machine-generated text samples of scientific research abstracts. A reformatted version for text-classification is available in the dataset collection Human-vs-Machine. In this collection, all samples are split into separate data points for real and generated, and labeled either 0 (human-produced) or 1… See the full description on the dataset page: https://huggingface.co/datasets/NicolaiSivesind/ChatGPT-Research-Abstracts.
e
Outcome of ChatGPT Advice – Survey Data
expresslegalfunding.com
html
Updated Sep 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Express Legal Funding (2025). Outcome of ChatGPT Advice – Survey Data [Dataset]. https://expresslegalfunding.com/chatgpt-study/
Explore at:
htmlAvailable download formats
Dataset updated
Sep 10, 2025
Dataset authored and provided by
Express Legal Funding
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Unsure – Not sure yet, Helpful – It led to a good result, Neutral – It made no real difference, Harmful – It caused problems or a bad result
Description
This dataset summarizes how ChatGPT users rated the outcomes of the advice they received, including whether it was helpful, harmful, neutral, or uncertain, based on a 2025 U.S. survey.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mahdi (2023). ChatGPT Classification Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimaktabdar/chatgpt-classification-dataset

ChatGPT Classification Dataset

Classification of ChatGPT generated text from human generated text

Explore at:

114 scholarly articles cite this dataset (View in Google Scholar)

zip(718710 bytes)Available download formats

Dataset updated

Sep 7, 2023

Authors

Mahdi

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

We have compiled a dataset that consists of textual articles including common terminology, concepts and definitions in the field of computer science, artificial intelligence, and cyber security. This dataset consists of both human-generated text and OpenAI’s ChatGPT-generated text. Human-generated answers were collected from different computer science dictionaries and encyclopedias including “The Encyclopedia of Computer Science and Technology” and "Encyclopedia of Human-Computer Interaction". AI-generated content in our dataset was produced by simply posting questions to OpenAI’s ChatGPT and manually documenting the resulting responses. A rigorous data-cleaning process has been performed to remove unwanted Unicode characters, styling and formatting tags. To structure our dataset for binary classification, we combined both AI-generated and Human-generated answers into a single column and assigned appropriate labels to each data point (Human-generated = 0 and AI-generated = 1).

This creates our article-level dataset (article_level_data.csv) which consists of a total of 1018 articles, 509 AI-generated and 509 Human-generated. Additionally, we have divided each article into its sentences and labelled them accordingly. This is mainly to evaluate the performance of classification models and pipelines when it comes to shorter sentence-level data points. This constructs our sentence-level dataset (sentence_level_data.csv) which consists of a total of 7344 entries (4008 AI-generated and 3336 Human-generated).

We appreciate it, if you cite the following article if you happen to use this dataset in any scientific publication:

Maktab Dar Oghaz, M., Dhame, K., Singaram, G., & Babu Saheer, L. (2023). Detection and Classification of ChatGPT Generated Contents Using Deep Transformer Models. Frontiers in Artificial Intelligence.

https://www.techrxiv.org/users/692552/articles/682641/master/file/data/ChatGPT_generated_Content_Detection/ChatGPT_generated_Content_Detection.pdf

Clear search

Close search

Google apps

Main menu

ChatGPT Classification Dataset

ChatGPT User Reviews

Dataset Description

Columns Explanation

Collection Methods

awesome-chatgpt-prompts

ChatGPT Revenue and Usage Statistics (2025)

Test dataset of ChatGPT in medical field

Data from: ChatGPT in education: A discourse analysis of worries and...

Chatgpt

ChatGPT Reddit

ASRS-ChatGPT

Global cases of sensitive data spill into ChatGPT 2023

89k ChatGPT conversations

Top user concerns about ChatGPT SEA 2023

A dataset to investigate ChatGPT for enhancing Students' Learning Experience...

All GPT-4 Conversations

All GPT-4 Generated Datasets

Every chat dataset generated by GPT-4 from Huggingface at the same format

About this dataset

How to use the dataset

Acknowledgements

License

Public data files containing the data used for the ChatGPT survey (XLSX) and...

scraped-chatgpt-conversations

ChatGPT Discussion Trends

Datasets .csv

ChatGPT-Research-Abstracts

Outcome of ChatGPT Advice – Survey Data

ChatGPT Classification Dataset

Classification of ChatGPT generated text from human generated text