58 datasets found

i
Chat-GPT Generated Sample Weather Data
ieee-dataport.org
Updated Mar 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Outman (2023). Chat-GPT Generated Sample Weather Data [Dataset]. https://ieee-dataport.org/documents/chat-gpt-generated-sample-weather-data
Explore at:
Dataset updated
Mar 27, 2023
Authors
Alexander Outman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
humidity
Text sample datasets and AI detectors test results
figshare.com
txt
Updated Oct 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Popkov (2023). Text sample datasets and AI detectors test results [Dataset]. http://doi.org/10.6084/m9.figshare.24208443.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24208443.v1
Dataset updated
Oct 18, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Andrey Popkov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset includes three distinct subsets of text:Open Access Academic Articles: A collection of 100 open-access articles from various academic journals focused on mental health and psychiatry published between 2016-2018. The articles are selected from reputable journals including JAMA, The Lancet Psychiatry, WPJ, and AM J Psy.ChatGPT-Generated Texts: Discussion section samples generated by ChatGPT (GPT-4 model, version as of August 3, 2023, OpenAI) that are designed to imitate the style and content of academic articles in the field of mental health and psychiatry.Claude-Generated Texts: Discussion section samples generated by Claude (Version 2, Anthropic) with the aim of imitating academic articles in the same field.Additionally, the dataset contains the results of tests performed using ZeroGPT and Originality.AI to evaluate the AI texts vs the academic articles for the percentage of texts identified as being AI-generated.Please cite this dataset if you make use of it in your research.
h
gpt-generated-news-sentences
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
joshua, gpt-generated-news-sentences [Dataset]. https://huggingface.co/datasets/joshuapsa/gpt-generated-news-sentences
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
joshua
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card

This dataset was created solely for the purpose of code testing. This dataset was generated from prompting chatGPT to create sample pieces of news setences according to a topic. Sample prompt: "generate 50 sentences on the topic of "very recent breaking news on wars and conflicts events" with some sample location names. One example: "a missile struck near a residential building in Kiev last night, Russia denied Ukraine's accusations of attacking non-military targets""… See the full description on the dataset page: https://huggingface.co/datasets/joshuapsa/gpt-generated-news-sentences.
h
chatgpt-paraphrases
huggingface.co
Updated Mar 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Humarin (2023). chatgpt-paraphrases [Dataset]. https://huggingface.co/datasets/humarin/chatgpt-paraphrases
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 17, 2023
Dataset authored and provided by
Humarin
License
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
Description
This is a dataset of paraphrases created by ChatGPT. Model based on this dataset is avaible: model

We used this prompt to generate paraphrases

Generate 5 similar paraphrases for this question, show it like a numbered list without commentaries: {text} This dataset is based on the Quora paraphrase question, texts from the SQUAD 2.0 and the CNN news dataset. We generated 5 paraphrases for each sample, totally this dataset has about 420k data rows. You can make 30 rows from a row from… See the full description on the dataset page: https://huggingface.co/datasets/humarin/chatgpt-paraphrases.
f
Table_1_Evaluation of the quality and quantity of artificial...
frontiersin.figshare.com
docx
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jisun Choi; Ah Ran Oh; Jungchan Park; Ryung A. Kang; Seung Yeon Yoo; Dong Jae Lee; Kwangmo Yang (2024). Table_1_Evaluation of the quality and quantity of artificial intelligence-generated responses about anesthesia and surgery: using ChatGPT 3.5 and 4.0.DOCX [Dataset]. http://doi.org/10.3389/fmed.2024.1400153.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fmed.2024.1400153.s001
Dataset updated
Jul 11, 2024
Dataset provided by
Frontiers
Authors
Jisun Choi; Ah Ran Oh; Jungchan Park; Ryung A. Kang; Seung Yeon Yoo; Dong Jae Lee; Kwangmo Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThe large-scale artificial intelligence (AI) language model chatbot, Chat Generative Pre-Trained Transformer (ChatGPT), is renowned for its ability to provide data quickly and efficiently. This study aimed to assess the medical responses of ChatGPT regarding anesthetic procedures.MethodsTwo anesthesiologist authors selected 30 questions representing inquiries patients might have about surgery and anesthesia. These questions were inputted into two versions of ChatGPT in English. A total of 31 anesthesiologists then evaluated each response for quality, quantity, and overall assessment, using 5-point Likert scales. Descriptive statistics summarized the scores, and a paired sample t-test compared ChatGPT 3.5 and 4.0.ResultsRegarding quality, “appropriate” was the most common rating for both ChatGPT 3.5 and 4.0 (40 and 48%, respectively). For quantity, responses were deemed “insufficient” in 59% of cases for 3.5, and “adequate” in 69% for 4.0. In overall assessment, 3 points were most common for 3.5 (36%), while 4 points were predominant for 4.0 (42%). Mean quality scores were 3.40 and 3.73, and mean quantity scores were − 0.31 (between insufficient and adequate) and 0.03 (between adequate and excessive), respectively. The mean overall score was 3.21 for 3.5 and 3.67 for 4.0. Responses from 4.0 showed statistically significant improvement in three areas.ConclusionChatGPT generated responses mostly ranging from appropriate to slightly insufficient, providing an overall average amount of information. Version 4.0 outperformed 3.5, and further research is warranted to investigate the potential utility of AI chatbots in assisting patients with medical information.
Production Support Jira tickets samples
kaggle.com
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ravindra Singh (2025). Production Support Jira tickets samples [Dataset]. https://www.kaggle.com/datasets/ravindrasingh/production-support-jira-tickets-samples
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ravindra Singh
Description
ChatGPT generated dataset - Jira Tickets samples from production support, you can use this sample to perform vector search or any other AI or model testing.
h
awesome-chatgpt-prompts
huggingface.co
Updated Dec 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatih Kadir Akın (2023). awesome-chatgpt-prompts [Dataset]. https://huggingface.co/datasets/fka/awesome-chatgpt-prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 15, 2023
Authors
Fatih Kadir Akın
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
🧠 Awesome ChatGPT Prompts [CSV dataset]

This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub

License

CC-0
f
Data from: AI vs academia: Experimental study on AI text detectors’ accuracy...
tandf.figshare.com
docx
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey A. Popkov; Tyson S. Barrett (2025). AI vs academia: Experimental study on AI text detectors’ accuracy in behavioral health academic writing [Dataset]. http://doi.org/10.6084/m9.figshare.25459810.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25459810.v1
Dataset updated
May 12, 2025
Dataset provided by
Taylor & Francis
Authors
Andrey A. Popkov; Tyson S. Barrett
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Artificial Intelligence (AI) language models continue to expand in both access and capability. As these models have evolved, the number of academic journals in medicine and healthcare which have explored policies regarding AI-generated text has increased. The implementation of such policies requires accurate AI detection tools. Inaccurate detectors risk unnecessary penalties for human authors and/or may compromise the effective enforcement of guidelines against AI-generated content. Yet, the accuracy of AI text detection tools in identifying human-written versus AI-generated content has been found to vary across published studies. This experimental study used a sample of behavioral health publications and found problematic false positive and false negative rates from both free and paid AI detection tools. The study assessed 100 research articles from 2016–2018 in behavioral health and psychiatry journals and 200 texts produced by AI chatbots (100 by “ChatGPT” and 100 by “Claude”). The free AI detector showed a median of 27.2% for the proportion of academic text identified as AI-generated, while commercial software Originality.AI demonstrated better performance but still had limitations, especially in detecting texts generated by Claude. These error rates raise doubts about relying on AI detectors to enforce strict policies around AI text generation in behavioral health publications.
d
ChatGPT examples in the hydrological sciences
dataone.org
hydroshare.org
+1more
Updated Dec 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dylan Irvine (2023). ChatGPT examples in the hydrological sciences [Dataset]. http://doi.org/10.4211/hs.fc0552275ea14c7082218c42ebd63da6
Explore at:
Unique identifier
https://doi.org/10.4211/hs.fc0552275ea14c7082218c42ebd63da6
Dataset updated
Dec 30, 2023
Dataset provided by
Hydroshare
Authors
Dylan Irvine
Area covered
Description
ChatGPT has forever changed the way that many industries operate. Much of the focus of Artificial Intelligence (AI) has been on their ability to generate text. However, it is likely that their ability to generate computer codes and scripts will also have a major impact. We demonstrate the use of ChatGPT to generate Python scripts to perform hydrological analyses and highlight the opportunities, limitations and risks that AI poses in the hydrological sciences.

Here, we provide four worked examples of the use of ChatGPT to generate scripts to conduct hydrological analyses. We also provide a full list of the libraries available to the ChatGPT Advanced Data Analysis plugin (only available in the paid version). These files relate to a manuscript that is to be submitted to Hydrological Processes. The authors of the manuscript are Dylan J. Irvine, Landon J.S. Halloran and Philip Brunner.

If you find these examples useful and/or use them, we would appreciate if you could cite the associated publication in Hydrological Processes. Details to be made available upon final publication.
P
Human-ChatGPT texts Dataset
paperswithcode.com
Updated Nov 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raghav Gaggar; Ashish Bhagchandani; Harsh Oza (2023). Human-ChatGPT texts Dataset [Dataset]. https://paperswithcode.com/dataset/human-chatgpt-texts
Explore at:
Dataset updated
Nov 25, 2023
Authors
Raghav Gaggar; Ashish Bhagchandani; Harsh Oza
Description
A dataset including texts by humans (labeled 0) and then rephrased by ChatGPT (labeled 1), created to train models for machine-generated text detection.

It is a robust dataset - includes text of various lengths, and the human texts are taken from multiple sources.
h
ChatGPT-4o-Writing-Prompts
huggingface.co
Updated Sep 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gryphe Padar (2024). ChatGPT-4o-Writing-Prompts [Dataset]. https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2024
Authors
Gryphe Padar
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
ChatGPT-4o Writing Prompts

This is a dataset containing 3746 short stories, generated with OpenAI's chatgpt-4o-latest model and using Reddit's Writing Prompts subreddit as a source. Each sample is generally between 6000-8000 characters long. These stories were thoroughly cleaned and then further enriched with a title and a series of applicable genres.
Note that I did not touch the Markdown ChatGPT-4o produced by itself to enrich its output, as I very much enjoy the added flavour… See the full description on the dataset page: https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts.
o
Awesome ChatGPT Prompts
opendatabay.com
.csv
Updated Jun 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Awesome ChatGPT Prompts [Dataset]. https://www.opendatabay.com/data/ai-ml/b19fe949-9f50-4a6e-ba87-7318e75458c2
Explore at:
.csvAvailable download formats
Dataset updated
Jun 20, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
Welcome to the "Awesome ChatGPT Prompts" dataset on Kaggle! This is a collection of prompt examples to be used with the ChatGPT model.

The ChatGPT model is a large language model trained by OpenAI that is capable of generating human-like text. By providing it with a prompt, it can generate responses that continue the conversation or expand on the given prompt.

License

CC0

Original Data Source: Awesome ChatGPT Prompts
f
Data Sheet 1_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
xlsx
Updated Feb 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 1_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s001
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Z
Prompts generated from ChatGPT3.5, ChatGPT4, LLama3-8B, and Mistral-7B with...
data.niaid.nih.gov
portaldelaciencia.uva.es
Updated Nov 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javier, Conde (2024). Prompts generated from ChatGPT3.5, ChatGPT4, LLama3-8B, and Mistral-7B with NYT and HC3 topics in different roles and parameters configurations [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10646081
Explore at:
Dataset updated
Nov 16, 2024
Dataset provided by
Gonzalo, Martínez
Pedro, Reviriego
Elena, Merino
Javier, Conde
José Alberto, Hernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

Prompts generated from ChatGPT3.5, ChatGPT4, Llama3-8B, and Mistral-7B with NYT and HC3 topics in different roles and parameter configurations.

The dataset is useful to study lexical aspects of LLMs with different parameters/roles configurations.

The 0_Base_Topics.xlsx file lists the topics used for the dataset generation

The rest of the files collect the answers of ChatGPT to these topics with different configurations of parameters/context:

Temperature (parameter): Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

Frequency penalty (parameter): Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.

Top probability (parameter): An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.

Presence penalty (parameter): Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.

Roles (context)

Default: No role is assigned to the LLM, the default role is used.

Child: The LLM is requested to answer as a five-year-old child.

Young adult male: The LLM is requested to answer as a young male adult.

Young adult female: The LLM is requested to answer as a young female adult.

Elderly adult male: The LLM is requested to answer as an elderly male adult.

Elderly adult female: The LLM is requested to answer as an elderly female adult.

Affluent adult male: The LLM is requested to answer as an affluent male adult.

Affluent adult female: The LLM is requested to answer as an affluent female adult.

Lower-class adult male: The LLM is requested to answer as a lower-class male adult.

Lower-class adult female: The LLM is requested to answer as a lower-class female adult.

Erudite: The LLM is requested to answer as an erudite who uses a rich vocabulary.

Paper

Paper: Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study

Cite:

@article{10.1145/3696459,author = {Mart\'{\i}nez, Gonzalo and Hern\'{a}ndez, Jos\'{e} Alberto and Conde, Javier and Reviriego, Pedro and Merino-G\'{o}mez, Elena},title = {Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study},year = {2024},publisher = {Association for Computing Machinery},address = {New York, NY, USA},issn = {2157-6904},url = {https://doi.org/10.1145/3696459},doi = {10.1145/3696459},abstract = ,note = {Just Accepted},journal = {ACM Trans. Intell. Syst. Technol.},month = sep,keywords = {LLM, Lexical diversity, ChatGPT, Evaluation}}
E
ChatGPT Statistics By Market, User, Price And Performance (2025)
electroiq.com
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Electro IQ (2025). ChatGPT Statistics By Market, User, Price And Performance (2025) [Dataset]. https://electroiq.com/stats/chatgpt-statistics/
Explore at:
Dataset updated
Jul 2, 2025
Dataset authored and provided by
Electro IQ
License
https://electroiq.com/privacy-policyhttps://electroiq.com/privacy-policy
Time period covered
2022 - 2032
Area covered
Global
Description
Introduction

ChatGPT Statistics: In todayâ€™s technologically advancing world, Artificial Intelligence (AI) is no longer just science fiction; it has also become an integral part of everyday life. One of the most exciting examples of AI in action is ChatGPT, a powerful language model developed by OpenAI. ChatGPT is a conversational AI tool capable of generating human-like responses, assisting with a variety of tasks ranging from writing to coding, customer service, education, and more. In everyday life, the implementation of ChatGPT is growing enormously as it enables communication faster, smarter, and more intuitively.

This article examines how ChatGPT operates and its statistical analysis from various perspectives, including its practical applications, and the evolving conversations surrounding its benefits, limitations, and future potential.
Z
Replication Package for "Improving the Readability of Generated Tests Using...
data.niaid.nih.gov
zenodo.org
Updated Oct 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregory Gay (2023). Replication Package for "Improving the Readability of Generated Tests Using GPT-4 and ChatGPT Code Interpreter" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8289841
Explore at:
Dataset updated
Oct 5, 2023
Dataset authored and provided by
Gregory Gay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While automated test generation can decrease the human burden associated with testing, it does not eliminate this burden. Humans must still work with generated test cases to interpret testing results, debug the code, build and maintain a comprehensive test suite, and many other tasks. Therefore, a major challenge with automated test generation is understandability of generated test test cases.

Large language models (LLMs), machine learning models trained on massive corpora of textual data - including both natural language and programming languages - are an emerging technology with great potential for performing language-related predictive tasks such as translation, summarization, and decision support.

In this study, we are exploring the capabilities of LLMs with regard to improving test case understandability.

This package contains the data produced during this exploration:

The examples directory contains the three case studies we tested our transformation process on:

queue_example: Tests of a basic queue data structure

httpie_sessions: Tests of the sessions module from the httpie project.

string_utils_validation: Tests of the validation module from the python-string-utils project.

Each directory contains the modules-under-test, the original test cases generated by Pynguin, and the transformed test cases.

Two trials were performed per case example of the transformation technique to assess the impact of different results from the LLM.

The survey directory contains the survey that was sent to assess the impact of the transformation on test readability.

survey.pdf contains the survey questions.

responses.xlsx contains the survey results.
AH&AITD – Arslan’s Human and AI Text Database
figshare.com
xlsx
Updated May 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arslan Akram (2025). AH&AITD – Arslan’s Human and AI Text Database [Dataset]. http://doi.org/10.6084/m9.figshare.29144348.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29144348.v1
Dataset updated
May 24, 2025
Dataset provided by
figshare
Authors
Arslan Akram
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains 11,580 samples spanning both human-written and AI-generated content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability. To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.Composition1. Human-Written Samples (Total: 5,790)Collected from:Open Web Text (2,343 samples)Blogs (196 samples)Web Text (397 samples)Q&A Platforms (670 samples)News Articles (430 samples)Opinion Statements (1,549 samples)Scientific Research Abstracts (205 samples)2. AI-Generated Samples (Total: 5,790)Generated using:ChatGPT (1,130 samples)GPT-4 (744 samples)Paraphrase Models (1,694 samples)GPT-2 (328 samples)GPT-3 (296 samples)DaVinci (GPT-3.5 variant) (433 samples)GPT-3.5 (364 samples)OPT-IML (406 samples)Flan-T5 (395 samples)Citation:Akram, A. (2023). AH&AITD: Arslan’s Human and AI Text Database. [Dataset]. Associated with the article: An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence, 4(2), 44–55.
h
code_exercises
huggingface.co
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jina AI (2023). code_exercises [Dataset]. https://huggingface.co/datasets/jinaai/code_exercises
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 7, 2023
Dataset authored and provided by
Jina AI
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for "code_exercises"

Code exercise

This dataset is composed of a diverse set of ~120k Python code exercises (~120m total tokens) generated by ChatGPT 3.5. It is designed to distill ChatGPT 3.5 knowledge about Python coding tasks into other (potentially smaller) models. The exercises have been generated by following the steps described in the related GitHub repository. The generated exercises follow the format of the Human Eval benchmark. Each training sample… See the full description on the dataset page: https://huggingface.co/datasets/jinaai/code_exercises.
f
AI for Identifying Dismissive and Acceptive Acts in a Conversation Dataset...
figshare.com
zip
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neuman Yair; Yohai Cohen (2025). AI for Identifying Dismissive and Acceptive Acts in a Conversation Dataset and model code [Dataset]. http://doi.org/10.6084/m9.figshare.28539149.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28539149.v1
Dataset updated
Mar 5, 2025
Dataset provided by
figshare
Authors
Neuman Yair; Yohai Cohen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Human interactions involve dialogue acts that can be responded to by acceptive or dismissive acts. For example, a person sharing a painful past event uses the dialogue act of disclosure. The disclosure can be responded to empathically or by dismissing the pain. Here, we address the challenge of automatically identifying dismissive and acceptive dialogue acts in a conversation. We used massive AI-generated datasets of utterances and dialogues expressing acceptive/dismissive behavior to address the challenge. Next, we trained and tested several machine-learning models that performed highly in classifying utterances as acceptive or dismissive. The basic approach described in this paper can empower the development of automatic interactive systems in contexts ranging from artificial therapists to assistant robots for the elderly.
h
ChatGPT-Jailbreak-Prompts
huggingface.co
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rubén Darío Jaramillo Romero (2023). ChatGPT-Jailbreak-Prompts [Dataset]. https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2023
Authors
Rubén Darío Jaramillo Romero
Description
Dataset Card for Dataset Name

Name

ChatGPT Jailbreak Prompts

Dataset Summary

ChatGPT Jailbreak Prompts is a complete collection of jailbreak related prompts for ChatGPT. This dataset is intended to provide a valuable resource for understanding and generating text in the context of jailbreaking in ChatGPT.

Languages

[English]