Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
humidity
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes three distinct subsets of text:Open Access Academic Articles: A collection of 100 open-access articles from various academic journals focused on mental health and psychiatry published between 2016-2018. The articles are selected from reputable journals including JAMA, The Lancet Psychiatry, WPJ, and AM J Psy.ChatGPT-Generated Texts: Discussion section samples generated by ChatGPT (GPT-4 model, version as of August 3, 2023, OpenAI) that are designed to imitate the style and content of academic articles in the field of mental health and psychiatry.Claude-Generated Texts: Discussion section samples generated by Claude (Version 2, Anthropic) with the aim of imitating academic articles in the same field.Additionally, the dataset contains the results of tests performed using ZeroGPT and Originality.AI to evaluate the AI texts vs the academic articles for the percentage of texts identified as being AI-generated.Please cite this dataset if you make use of it in your research.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card
This dataset was created solely for the purpose of code testing. This dataset was generated from prompting chatGPT to create sample pieces of news setences according to a topic. Sample prompt: "generate 50 sentences on the topic of "very recent breaking news on wars and conflicts events" with some sample location names. One example: "a missile struck near a residential building in Kiev last night, Russia denied Ukraine's accusations of attacking non-military targets""⊠See the full description on the dataset page: https://huggingface.co/datasets/joshuapsa/gpt-generated-news-sentences.
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
This is a dataset of paraphrases created by ChatGPT. Model based on this dataset is avaible: model
We used this prompt to generate paraphrases
Generate 5 similar paraphrases for this question, show it like a numbered list without commentaries: {text} This dataset is based on the Quora paraphrase question, texts from the SQUAD 2.0 and the CNN news dataset. We generated 5 paraphrases for each sample, totally this dataset has about 420k data rows. You can make 30 rows from a row from⊠See the full description on the dataset page: https://huggingface.co/datasets/humarin/chatgpt-paraphrases.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThe large-scale artificial intelligence (AI) language model chatbot, Chat Generative Pre-Trained Transformer (ChatGPT), is renowned for its ability to provide data quickly and efficiently. This study aimed to assess the medical responses of ChatGPT regarding anesthetic procedures.MethodsTwo anesthesiologist authors selected 30 questions representing inquiries patients might have about surgery and anesthesia. These questions were inputted into two versions of ChatGPT in English. A total of 31 anesthesiologists then evaluated each response for quality, quantity, and overall assessment, using 5-point Likert scales. Descriptive statistics summarized the scores, and a paired sample t-test compared ChatGPT 3.5 and 4.0.ResultsRegarding quality, âappropriateâ was the most common rating for both ChatGPT 3.5 and 4.0 (40 and 48%, respectively). For quantity, responses were deemed âinsufficientâ in 59% of cases for 3.5, and âadequateâ in 69% for 4.0. In overall assessment, 3 points were most common for 3.5 (36%), while 4 points were predominant for 4.0 (42%). Mean quality scores were 3.40 and 3.73, and mean quantity scores were â 0.31 (between insufficient and adequate) and 0.03 (between adequate and excessive), respectively. The mean overall score was 3.21 for 3.5 and 3.67 for 4.0. Responses from 4.0 showed statistically significant improvement in three areas.ConclusionChatGPT generated responses mostly ranging from appropriate to slightly insufficient, providing an overall average amount of information. Version 4.0 outperformed 3.5, and further research is warranted to investigate the potential utility of AI chatbots in assisting patients with medical information.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
đ§ Awesome ChatGPT Prompts [CSV dataset]
This is a Dataset Repository of Awesome ChatGPT Prompts View All Prompts on GitHub
License
CC-0
ChatGPT generated dataset - Jira Tickets samples from production support, you can use this sample to perform vector search or any other AI or model testing.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Artificial Intelligence (AI) language models continue to expand in both access and capability. As these models have evolved, the number of academic journals in medicine and healthcare which have explored policies regarding AI-generated text has increased. The implementation of such policies requires accurate AI detection tools. Inaccurate detectors risk unnecessary penalties for human authors and/or may compromise the effective enforcement of guidelines against AI-generated content. Yet, the accuracy of AI text detection tools in identifying human-written versus AI-generated content has been found to vary across published studies. This experimental study used a sample of behavioral health publications and found problematic false positive and false negative rates from both free and paid AI detection tools. The study assessed 100 research articles from 2016â2018 in behavioral health and psychiatry journals and 200 texts produced by AI chatbots (100 by âChatGPTâ and 100 by âClaudeâ). The free AI detector showed a median of 27.2% for the proportion of academic text identified as AI-generated, while commercial software Originality.AI demonstrated better performance but still had limitations, especially in detecting texts generated by Claude. These error rates raise doubts about relying on AI detectors to enforce strict policies around AI text generation in behavioral health publications.
ChatGPT has forever changed the way that many industries operate. Much of the focus of Artificial Intelligence (AI) has been on their ability to generate text. However, it is likely that their ability to generate computer codes and scripts will also have a major impact. We demonstrate the use of ChatGPT to generate Python scripts to perform hydrological analyses and highlight the opportunities, limitations and risks that AI poses in the hydrological sciences.
Here, we provide four worked examples of the use of ChatGPT to generate scripts to conduct hydrological analyses. We also provide a full list of the libraries available to the ChatGPT Advanced Data Analysis plugin (only available in the paid version). These files relate to a manuscript that is to be submitted to Hydrological Processes. The authors of the manuscript are Dylan J. Irvine, Landon J.S. Halloran and Philip Brunner.
If you find these examples useful and/or use them, we would appreciate if you could cite the associated publication in Hydrological Processes. Details to be made available upon final publication.
A dataset including texts by humans (labeled 0) and then rephrased by ChatGPT (labeled 1), created to train models for machine-generated text detection.
It is a robust dataset - includes text of various lengths, and the human texts are taken from multiple sources.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
ChatGPT-4o Writing Prompts
This is a dataset containing 3746 short stories, generated with OpenAI's chatgpt-4o-latest model and using Reddit's Writing Prompts subreddit as a source. Each sample is generally between 6000-8000 characters long.
These stories were thoroughly cleaned and then further enriched with a title and a series of applicable genres.
Note that I did not touch the Markdown ChatGPT-4o produced by itself to enrich its output, as I very much enjoy the added flavour⊠See the full description on the dataset page: https://huggingface.co/datasets/Gryphe/ChatGPT-4o-Writing-Prompts.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Welcome to the "Awesome ChatGPT Prompts" dataset on Kaggle! This is a collection of prompt examples to be used with the ChatGPT model.
The ChatGPT model is a large language model trained by OpenAI that is capable of generating human-like text. By providing it with a prompt, it can generate responses that continue the conversation or expand on the given prompt.
CC0
Original Data Source: Awesome ChatGPT Prompts
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAIâs GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Prompts generated from ChatGPT3.5, ChatGPT4, Llama3-8B, and Mistral-7B with NYT and HC3 topics in different roles and parameter configurations.
The dataset is useful to study lexical aspects of LLMs with different parameters/roles configurations.
The 0_Base_Topics.xlsx file lists the topics used for the dataset generation
The rest of the files collect the answers of ChatGPT to these topics with different configurations of parameters/context:
Temperature (parameter): Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
Frequency penalty (parameter): Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
Top probability (parameter): An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
Presence penalty (parameter): Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
Roles (context)
Default: No role is assigned to the LLM, the default role is used.
Child: The LLM is requested to answer as a five-year-old child.
Young adult male: The LLM is requested to answer as a young male adult.
Young adult female: The LLM is requested to answer as a young female adult.
Elderly adult male: The LLM is requested to answer as an elderly male adult.
Elderly adult female: The LLM is requested to answer as an elderly female adult.
Affluent adult male: The LLM is requested to answer as an affluent male adult.
Affluent adult female: The LLM is requested to answer as an affluent female adult.
Lower-class adult male: The LLM is requested to answer as a lower-class male adult.
Lower-class adult female: The LLM is requested to answer as a lower-class female adult.
Erudite: The LLM is requested to answer as an erudite who uses a rich vocabulary.
Paper
Paper: Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study
Cite:
@article{10.1145/3696459,author = {Mart\'{\i}nez, Gonzalo and Hern\'{a}ndez, Jos\'{e} Alberto and Conde, Javier and Reviriego, Pedro and Merino-G\'{o}mez, Elena},title = {Beware of Words: Evaluating the Lexical Diversity of Conversational LLMs using ChatGPT as Case Study},year = {2024},publisher = {Association for Computing Machinery},address = {New York, NY, USA},issn = {2157-6904},url = {https://doi.org/10.1145/3696459},doi = {10.1145/3696459},abstract = ,note = {Just Accepted},journal = {ACM Trans. Intell. Syst. Technol.},month = sep,keywords = {LLM, Lexical diversity, ChatGPT, Evaluation}}
https://electroiq.com/privacy-policyhttps://electroiq.com/privacy-policy
ChatGPT Statistics: In todayĂąâŹâąs technologically advancing world, Artificial Intelligence (AI) is no longer just science fiction; it has also become an integral part of everyday life. One of the most exciting examples of AI in action is ChatGPT, a powerful language model developed by OpenAI. ChatGPT is a conversational AI tool capable of generating human-like responses, assisting with a variety of tasks ranging from writing to coding, customer service, education, and more. In everyday life, the implementation of ChatGPT is growing enormously as it enables communication faster, smarter, and more intuitively.
This article examines how ChatGPT operates and its statistical analysis from various perspectives, including its practical applications, and the evolving conversations surrounding its benefits, limitations, and future potential.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
While automated test generation can decrease the human burden associated with testing, it does not eliminate this burden. Humans must still work with generated test cases to interpret testing results, debug the code, build and maintain a comprehensive test suite, and many other tasks. Therefore, a major challenge with automated test generation is understandability of generated test test cases.
Large language models (LLMs), machine learning models trained on massive corpora of textual data - including both natural language and programming languages - are an emerging technology with great potential for performing language-related predictive tasks such as translation, summarization, and decision support.
In this study, we are exploring the capabilities of LLMs with regard to improving test case understandability.
This package contains the data produced during this exploration:
The examples directory contains the three case studies we tested our transformation process on:
queue_example: Tests of a basic queue data structure
httpie_sessions: Tests of the sessions module from the httpie project.
string_utils_validation: Tests of the validation module from the python-string-utils project.
Each directory contains the modules-under-test, the original test cases generated by Pynguin, and the transformed test cases.
Two trials were performed per case example of the transformation technique to assess the impact of different results from the LLM.
The survey directory contains the survey that was sent to assess the impact of the transformation on test readability.
survey.pdf contains the survey questions.
responses.xlsx contains the survey results.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AH&AITD is a comprehensive benchmark dataset designed to support the evaluation of AI-generated text detection tools. The dataset contains 11,580 samples spanning both human-written and AI-generated content across multiple domains. It was developed to address limitations in previous datasets, particularly in terms of diversity, scale, and real-world applicability. To facilitate research in the detection of AI-generated text by providing a diverse, multi-domain dataset. This dataset enables fair benchmarking of detection tools across various writing styles and content categories.Composition1. Human-Written Samples (Total: 5,790)Collected from:Open Web Text (2,343 samples)Blogs (196 samples)Web Text (397 samples)Q&A Platforms (670 samples)News Articles (430 samples)Opinion Statements (1,549 samples)Scientific Research Abstracts (205 samples)2. AI-Generated Samples (Total: 5,790)Generated using:ChatGPT (1,130 samples)GPT-4 (744 samples)Paraphrase Models (1,694 samples)GPT-2 (328 samples)GPT-3 (296 samples)DaVinci (GPT-3.5 variant) (433 samples)GPT-3.5 (364 samples)OPT-IML (406 samples)Flan-T5 (395 samples)Citation:Akram, A. (2023). AH&AITD: Arslanâs Human and AI Text Database. [Dataset]. Associated with the article: An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence, 4(2), 44â55.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for "code_exercises"
Code exercise
This dataset is composed of a diverse set of ~120k Python code exercises (~120m total tokens) generated by ChatGPT 3.5. It is designed to distill ChatGPT 3.5 knowledge about Python coding tasks into other (potentially smaller) models. The exercises have been generated by following the steps described in the related GitHub repository. The generated exercises follow the format of the Human Eval benchmark. Each training sample⊠See the full description on the dataset page: https://huggingface.co/datasets/jinaai/code_exercises.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Human interactions involve dialogue acts that can be responded to by acceptive or dismissive acts. For example, a person sharing a painful past event uses the dialogue act of disclosure. The disclosure can be responded to empathically or by dismissing the pain. Here, we address the challenge of automatically identifying dismissive and acceptive dialogue acts in a conversation. We used massive AI-generated datasets of utterances and dialogues expressing acceptive/dismissive behavior to address the challenge. Next, we trained and tested several machine-learning models that performed highly in classifying utterances as acceptive or dismissive. The basic approach described in this paper can empower the development of automatic interactive systems in contexts ranging from artificial therapists to assistant robots for the elderly.
Dataset Card for Dataset Name
Name
ChatGPT Jailbreak Prompts
Dataset Summary
ChatGPT Jailbreak Prompts is a complete collection of jailbreak related prompts for ChatGPT. This dataset is intended to provide a valuable resource for understanding and generating text in the context of jailbreaking in ChatGPT.
Languages
[English]
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
humidity