80 datasets found

Estimated water consumption for training GPT-3 2023
statista.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Estimated water consumption for training GPT-3 2023 [Dataset]. https://www.statista.com/statistics/1536925/gpt-3-estimated-water-consumption-training/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jul 2023
Area covered
Worldwide
Description
GPT-3's water consumption for the training phase was estimated at roughly 4.8 billion liters of water, when assuming the model was trained on Microsoft's Iowa data center (OpeanAI has disclosed that the data center was used for training parts of the GPT-4 model). If the model were to have been fully trained in the Washington data center, water consumption could have been as high as 15 billion liters. That would've amounted to more than Microsoft's total water withdrawals in 2023.
S
Test dataset of ChatGPT in medical field
scidb.cn
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
robin shen (2023). Test dataset of ChatGPT in medical field [Dataset]. http://doi.org/10.57760/sciencedb.o00130.00001
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.o00130.00001
Dataset updated
Mar 3, 2023
Dataset provided by
Science Data Bank
Authors
robin shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The researcher tests the QA capability of ChatGPT in the medical field from the following aspects:1. Test their reserve capacity for medical knowledge2. Check their ability to read literature and understand medical literature3. Test their ability of auxiliary diagnosis after reading case data4. Test its error correction ability for case data5. Test its ability to standardize medical terms6. Test their evaluation ability to experts7. Check their ability to evaluate medical institutionsThe conclusion is:ChatGPT has great potential in the application of medical and health care, and may directly replace human beings or even professionals at a certain level in some fields;The researcher preliminarily believe that ChatGPT has basic medical knowledge and the ability of multiple rounds of dialogue, and its ability to understand Chinese is not weak;ChatGPT has the ability to read, understand and correct cases;ChatGPT has the ability of information extraction and terminology standardization, and is quite excellent;ChatGPT has the reasoning ability of medical knowledge;ChatGPT has the ability of continuous learning. After continuous training, its level has improved significantly;ChatGPT does not have the academic evaluation ability of Chinese medical talents, and the results are not ideal;ChatGPT does not have the academic evaluation ability of Chinese medical institutions, and the results are not ideal;ChatGPT is an epoch-making product, which can become a useful assistant for medical diagnosis and treatment, knowledge service, literature reading, review and paper writing.
b
ChatGPT Revenue and Usage Statistics (2025)
businessofapps.com
Updated Feb 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Business of Apps (2023). ChatGPT Revenue and Usage Statistics (2025) [Dataset]. https://www.businessofapps.com/data/chatgpt-statistics/
Explore at:
Dataset updated
Feb 9, 2023
Dataset authored and provided by
Business of Apps
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
ChatGPT was the chatbot that kickstarted the generative AI revolution, which has been responsible for hundreds of billions of dollars in data centres, graphics chips and AI startups. Launched by...
m
Date Set: ChatGPT as an education and learning tool for engineering,...
data.mendeley.com
Updated Jun 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAVINDRA BHARDWAJ (2025). Date Set: ChatGPT as an education and learning tool for engineering, technology and general studies: performance analysis of ChatGPT 3.0 on CSE, GATE and JEE examinations of India [Dataset]. http://doi.org/10.17632/995zwcz5yt.2
Explore at:
Unique identifier
https://doi.org/10.17632/995zwcz5yt.2
Dataset updated
Jun 25, 2025
Authors
RAVINDRA BHARDWAJ
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
This is the raw data that is used in the publication: ChatGPT as an education and learning tool for engineering, technology and general studies: performance analysis of ChatGPT 3.0 on CSE, GATE and JEE examinations of India.
Few, But More Orgnized data for train and test!
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reza JafariRaviz (2023). Few, But More Orgnized data for train and test! [Dataset]. https://www.kaggle.com/datasets/rezajafariraviz/few-but-more-orgnized-data-for-train-and-test
Explore at:
zip(2185756 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
Reza JafariRaviz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The data has been created for use in an AI detection competition. Two prompts are passed to chatbots to elicit responses. The chatbots used are Bing, Bard, and ChatGPT. The data is also labeled to indicate whether the prompt includes the source text or not.
ChatGPT Classification Dataset
kaggle.com
zip
Updated Sep 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahdi (2023). ChatGPT Classification Dataset [Dataset]. https://www.kaggle.com/datasets/mahdimaktabdar/chatgpt-classification-dataset
Explore at:
zip(718710 bytes)Available download formats
Dataset updated
Sep 7, 2023
Authors
Mahdi
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We have compiled a dataset that consists of textual articles including common terminology, concepts and definitions in the field of computer science, artificial intelligence, and cyber security. This dataset consists of both human-generated text and OpenAI’s ChatGPT-generated text. Human-generated answers were collected from different computer science dictionaries and encyclopedias including “The Encyclopedia of Computer Science and Technology” and "Encyclopedia of Human-Computer Interaction". AI-generated content in our dataset was produced by simply posting questions to OpenAI’s ChatGPT and manually documenting the resulting responses. A rigorous data-cleaning process has been performed to remove unwanted Unicode characters, styling and formatting tags. To structure our dataset for binary classification, we combined both AI-generated and Human-generated answers into a single column and assigned appropriate labels to each data point (Human-generated = 0 and AI-generated = 1).

This creates our article-level dataset (article_level_data.csv) which consists of a total of 1018 articles, 509 AI-generated and 509 Human-generated. Additionally, we have divided each article into its sentences and labelled them accordingly. This is mainly to evaluate the performance of classification models and pipelines when it comes to shorter sentence-level data points. This constructs our sentence-level dataset (sentence_level_data.csv) which consists of a total of 7344 entries (4008 AI-generated and 3336 Human-generated).

We appreciate it, if you cite the following article if you happen to use this dataset in any scientific publication:

Maktab Dar Oghaz, M., Dhame, K., Singaram, G., & Babu Saheer, L. (2023). Detection and Classification of ChatGPT Generated Contents Using Deep Transformer Models. Frontiers in Artificial Intelligence.

https://www.techrxiv.org/users/692552/articles/682641/master/file/data/ChatGPT_generated_Content_Detection/ChatGPT_generated_Content_Detection.pdf
f
Data Sheet 2_Large language models generating synthetic clinical datasets: a...
frontiersin.figshare.com
figshare.com
xlsx
Updated Feb 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1533508.s002
Dataset updated
Feb 5, 2025
Dataset provided by
Frontiers
Authors
Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.
Energy consumption when training LLMs in 2022 (in MWh)
statista.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Worldwide
Description
Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over **********-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of *** Germans in 2022. While not a staggering amount, it is a considerable use of energy. Energy savings through AI While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a ***** expect that AI might reduce power consumption by *** to ******* percent. Considering that much of the world uses mobile phones this would be a considerable energy saver. Emissions are considerable The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly *** tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.
b
GPQA Benchmarks and Pricing, Aug 2025
binaryverseai.com
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VALS AI (2025). GPQA Benchmarks and Pricing, Aug 2025 [Dataset]. https://binaryverseai.com/chatgpt-o3-pro-review-benchmarks-hacks/
Explore at:
Dataset updated
Aug 9, 2025
Dataset authored and provided by
VALS AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparison of model accuracy on GPQA, token pricing, and latency for leading AI reasoning models.
Global interest in ChatGPT on Google search weekly 2022-2025
statista.com
Updated Nov 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global interest in ChatGPT on Google search weekly 2022-2025 [Dataset]. https://www.statista.com/statistics/1366930/chatgpt-google-search-weekly-worldwide/
Explore at:
Dataset updated
Nov 22, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Nov 6, 2022 - Oct 25, 2025
Area covered
Worldwide
Description
In the week from October 19 to 25, 2025, global Google searches for the word "ChatGPT" reached a peak of 100 index points, indicating a significant increase in interest and thus the highest interest over the observed period. On October 21, 2025, OpenAI introduced ChatGPT Atlas, a web browser with ChatGPT built in. Interest in the chatbot, developed by U.S.-based OpenAI and launched in November 2022, started rising in the week ending December 3, 2022. ChatGPT, which stands for Chat Generative Pre-trained Transformer, is an AI-powered auto-generative text system able to give human-sounding replies and reproduce human-like interactions when prompted.
ChatGPT for MLM
kaggle.com
zip
Updated Mar 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evgenii Pishchik (2023). ChatGPT for MLM [Dataset]. https://www.kaggle.com/datasets/pe4eniks/chatgpt-for-mlm/code
Explore at:
zip(20267 bytes)Available download formats
Dataset updated
Mar 15, 2023
Authors
Evgenii Pishchik
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description.

This is a small dataset of synthetically generated samples for the MLM task using ChatGPT.

For data construction I use these requests. All requests were generated consistently and within one chat.

140 queries about general CV. 40 queries about datasets for CV. 40 queries about articles in CV. 20 queries about transformers in CV. 20 queries about training pipelines in CV. 20 queries about libraries for CV. 20 queries about hardware for CV.

Training.

You have a prompt with one [MASK] token that you need to predict and correct word at this position.

Data structure.

data.csv - main file with all data.

synthetic.txt - raw outputs from ChatGPT.

preprocess.py - convertation from raw to structured data.
Top web domains cited by LLMs 2025
statista.com
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Top web domains cited by LLMs 2025 [Dataset]. https://www.statista.com/statistics/1620335/top-web-domains-cited-by-llms/
Explore at:
Dataset updated
Jun 29, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 2025
Area covered
Worldwide
Description
A June 2025 study found that ****** was the most frequently cited web domain by large language models (LLMs). The platform was referenced in approximately ** percent of the analyzed cases, likely due to the content licensing agreement between Google and Reddit in early 2024 for the purpose of AI models training. ********* ranked second, being mentioned in roughly ** percent of the times, while ****** and ******* were mentioned ** percent.
f
Data Sheet 1_Evaluating the strengths and limitations of multimodal...
frontiersin.figshare.com
docx
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saif Aldeen AlRyalat; Ayman Mohammed Musleh; Malik Y. Kahook (2024). Data Sheet 1_Evaluating the strengths and limitations of multimodal ChatGPT-4 in detecting glaucoma using fundus images.docx [Dataset]. http://doi.org/10.3389/fopht.2024.1387190.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fopht.2024.1387190.s001
Dataset updated
Jun 7, 2024
Dataset provided by
Frontiers
Authors
Saif Aldeen AlRyalat; Ayman Mohammed Musleh; Malik Y. Kahook
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewThis study evaluates the diagnostic accuracy of a multimodal large language model (LLM), ChatGPT-4, in recognizing glaucoma using color fundus photographs (CFPs) with a benchmark dataset and without prior training or fine tuning.MethodsThe publicly accessible Retinal Fundus Glaucoma Challenge “REFUGE” dataset was utilized for analyses. The input data consisted of the entire 400 image testing set. The task involved classifying fundus images into either ‘Likely Glaucomatous’ or ‘Likely Non-Glaucomatous’. We constructed a confusion matrix to visualize the results of predictions from ChatGPT-4, focusing on accuracy of binary classifications (glaucoma vs non-glaucoma).ResultsChatGPT-4 demonstrated an accuracy of 90% with a 95% confidence interval (CI) of 87.06%-92.94%. The sensitivity was found to be 50% (95% CI: 34.51%-65.49%), while the specificity was 94.44% (95% CI: 92.08%-96.81%). The precision was recorded at 50% (95% CI: 34.51%-65.49%), and the F1 Score was 0.50.ConclusionChatGPT-4 achieved relatively high diagnostic accuracy without prior fine tuning on CFPs. Considering the scarcity of data in specialized medical fields, including ophthalmology, the use of advanced AI techniques, such as LLMs, might require less data for training compared to other forms of AI with potential savings in time and financial resources. It may also pave the way for the development of innovative tools to support specialized medical care, particularly those dependent on multimodal data for diagnosis and follow-up, irrespective of resource constraints.
m
PIM treatments optimization with PM-TOM using STOPP and Beers criteria and...
data.mendeley.com
Updated Nov 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adnan Kulenovic (2024). PIM treatments optimization with PM-TOM using STOPP and Beers criteria and ChatGPT - a case study [Dataset]. http://doi.org/10.17632/3mcz5hy342.2
Explore at:
Unique identifier
https://doi.org/10.17632/3mcz5hy342.2
Dataset updated
Nov 11, 2024
Authors
Adnan Kulenovic
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PM-TOM (Personalized Medicine: Therapy Optimization Method) is a decision support tool designed to find treatments with the minimal STOPP and Beers criteria risks and ADRs caused by ADEs, DDIs, DCIs, DGIs and DFIs. The tool optimizes a treatment by considering drug classes selected by health professionals and applicable to the patient's conditions.

This data set includes the details of a polypharmacy treatment of an older patient's case at admission to a deprescribing facility, discharge, and after applying the PM-TOM optimization. All three treatments were reviewed by ChatGPT 4.0, trained on a large set of medical literature, which confirmed the benefits of the optimized treatment and its alignment with the Beers and STOPP/START criteria.

The integrated PM-TOM and ChatGPT approach has several advantages: 1. PM-TOM simplifies finding and reviewing effective drug regimens, allowing healthcare providers to leverage their expertise more efficiently. 2. Detailed PM-TOM reports facilitate the involvement of pharmacists in monitoring and adjusting polypharmacy treatments. 3. Targeted PM-TOM recommendations help reduce non-actionable alerts, mitigating alert fatigue and minimizing prescribing errors. 4. PM-TOM rapidly evaluates and optimizes complex treatment plans, enabling consideration of multiple safe alternatives. 5. When applied at the primary care level, this approach helps prevent and reduce inappropriate prescribing, including prescribing cascades. 6. AI tools like ChatGPT, trained on up-to-date medical information, provide additional insights to help healthcare providers refine treatment plans.
DAIGT-V4-TRAIN-DATASET
kaggle.com
zip
Updated Jan 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darek Kłeczek (2024). DAIGT-V4-TRAIN-DATASET [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v4-train-dataset/data?select=train_v4_drcat_01.csv
Explore at:
zip(51270920 bytes)Available download formats
Dataset updated
Jan 15, 2024
Authors
Darek Kłeczek
Description
New release of DAIGT train dataset! Improvement:

Everything that was already in V3 dataset, plus a little bit of extra magic!

8000+ texts I generated with llama-based models finetuned on Persuade corpus 🔥🔥🔥

Sources (please upvote the original datasets!):

Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset)

Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/)

Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b)

Text generated with ChatGPT and GPT4 by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays)

2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic)

LLM-generated essay using PaLM from Google Gen-AI by @kingki19 (https://www.kaggle.com/datasets/kingki19/llm-generated-essay-using-palm-from-google-gen-ai)

Official train essays

Essays I generated with various LLMs

License: MIT for the data I generated. Check source datasets for the other sources mentioned above.
DAIGT Proper Train Dataset
kaggle.com
zip
Updated Nov 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darek Kłeczek (2023). DAIGT Proper Train Dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-proper-train-dataset
Explore at:
zip(124388618 bytes)Available download formats
Dataset updated
Nov 5, 2023
Authors
Darek Kłeczek
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Version 2 updated on 11/2/2023:

Since there is no proper train dataset for LLM - Detect AI Generated Text competition, I decided to create one.

Ingredients (please upvote the included datasets!): - Text generated with ChatGPT by MOTH (https://www.kaggle.com/datasets/alejopaullier/daigt-external-dataset) - Persuade corpus contributed by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/persaude-corpus-2/) - Text generated with Llama-70b and Falcon180b by Nicholas Broad (https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b) - Text generated with ChatGPT by Radek (https://www.kaggle.com/datasets/radek1/llm-generated-essays) - Official train essays - Essays I generated with various LLMs

New version includes: - EssayID if available - Generation prompt if available - Random 10 fold split stratified by source dataset

Version 3 updated on 11/3/2023: - Additional 2400+ AI examples generated with Mistral 7B instruct and a new prompt (let's see how it works!)

Version 4 updated on 11/5/2023: - Additional 2000 Claude essays generated by @darraghdog (https://www.kaggle.com/datasets/darraghdog/hello-claude-1000-essays-from-anthropic)
n
A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jun 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott McGrath (2024). A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to selected genetics questions - Full study data [Dataset]. http://doi.org/10.5061/dryad.s4mw6m9cv
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.s4mw6m9cv
Dataset updated
Jun 4, 2024
Dataset provided by
University of California, Berkeley
Authors
Scott McGrath
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: Our objective is to evaluate the efficacy of ChatGPT 4 in accurately and effectively delivering genetic information, building on previous findings with ChatGPT 3.5. We focus on assessing the utility, limitations, and ethical implications of using ChatGPT in medical settings. Materials and Methods: A structured questionnaire, including the Brief User Survey (BUS-15) and custom questions, was developed to assess ChatGPT 4's clinical value. An expert panel of genetic counselors and clinical geneticists independently evaluated ChatGPT 4's responses to these questions. We also involved comparative analysis with ChatGPT 3.5, utilizing descriptive statistics and using R for data analysis. Results: ChatGPT 4 demonstrated improvements over 3.5 in context recognition, relevance, and informativeness. However, performance variability and concerns about the naturalness of the output were noted. No significant difference in accuracy was found between ChatGPT 3.5 and 4.0. Notably, the efficacy of ChatGPT 4 varied significantly across different genetic conditions, with specific differences identified between responses related to BRCA1 and HFE. Discussion and Conclusion: This study highlights ChatGPT 4's potential in genomics, noting significant advancements over its predecessor. Despite these improvements, challenges remain, including the risk of outdated information and the necessity of ongoing refinement. The variability in performance across different genetic conditions underscores the need for expert oversight and continuous AI training. ChatGPT 4, while showing promise, emphasizes the importance of balancing technological innovation with ethical responsibility in healthcare information delivery. Methods Study Design This study was conducted to evaluate the performance of ChatGPT 4 (March 23rd, 2023) Model) in the context of genetic counseling and education. The evaluation involved a structured questionnaire, which included questions selected from the Brief User Survey (BUS-15) and additional custom questions designed to assess the clinical value of ChatGPT 4's responses. Questionnaire Development The questionnaire was built on Qualtrics, which comprised twelve questions: seven selected from the BUS-15 preceded by two additional questions that we designed. The initial questions focused on quality and answer relevancy: 1. The overall quality of the Chatbot’s response is: (5-point Likert: Very poor to Very Good) 2. The Chatbot delivered an answer that provided the relevant information you would include if asked the question. (5-point Likert: Strongly disagree to Strongly agree) The BUS-15 questions (7-point Likert: Strongly disagree to Strongly agree) focused on: 1. Recognition and facilitation of users’ goal and intent: Chatbot seems able to recognize the user’s intent and guide the user to its goals. 2. Relevance of information: The chatbot provides relevant and appropriate information/answer to people at each stage to make them closer to their goal. 3. Maxim of quantity: The chatbot responds in an informative way without adding too much information. 4. Resilience to failure: Chatbot seems able to find ways to respond appropriately even when it encounters situations or arguments it is not equipped to handle. 5. Understandability and politeness: The chatbot seems able to understand input and convey correct statements and answers without ambiguity and with acceptable manners. 6. Perceived conversational credibility: The chatbot responds in a credible and informative way without adding too much information. 7. Meet the neurodiverse needs: Chatbot seems able to meet needs and be used by users independently form their health conditions, well-being, age, etc. Expert Panel and Data Collection A panel of experts (two genetic counselors and two clinical geneticists) was provided with a link to the survey containing the questions. They independently evaluated the responses from ChatGPT 4 without discussing the questions or answers among themselves until after the survey submission. This approach ensured unbiased evaluation.
IT data center systems total spending worldwide 2012-2025
statista.com
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). IT data center systems total spending worldwide 2012-2025 [Dataset]. https://www.statista.com/statistics/314596/total-data-center-systems-worldwide-spending-forecast/
Explore at:
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
Worldwide spending on data center systems is projected to reach over, *** billion U.S. dollars in 2025, marking a significant ** percent increase from 2024. This growth reflects the ongoing digital transformation across industries and the increasing demand for advanced computing capabilities. The surge in data center investments is closely tied to the rapid expansion of artificial intelligence technologies, particularly with the wake of generative AI. AI chips fuel market growth The rise in data center spending aligns with the booming AI chip market, which is expected to reach ** billion U.S. dollars by 2025. Nvidia has emerged as a leader in this space, with its data center revenue skyrocketing due to the crucial role its GPUs play in training and running large language models like ChatGPT. The global GPU market, valued at ** billion U.S. dollars in 2024, is a key driver of this growth, powering advancements in machine learning and deep learning applications. Semiconductor industry adapts to AI demands The broader semiconductor industry is also evolving to meet the demands of AI technologies. With global semiconductor revenues surpassing *** billion U.S. dollars in 2023, the market is expected to approach *** billion U.S. dollars in 2024. AI chips are becoming increasingly prevalent in servers, data centers and storage infrastructures. This trend is reflected in the data centers and storage semiconductor market, which is projected to grow from ** billion U.S. dollars in 2023 to *** billion U.S. dollars by 2025, driven by the development of image sensors and edge AI processors.
Large Language Model content safety considerations text data
nexdata.ai
m.nexdata.ai
Updated Oct 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Large Language Model content safety considerations text data [Dataset]. https://www.nexdata.ai/datasets/llm/1349
Explore at:
Dataset updated
Oct 3, 2023
Dataset authored and provided by
Nexdata
Variables measured
Language, Data size, Data content, Storage format, Collecting type, Collecting method
Description
Large Language Model content safety considerations text data, about 570,000 in total, this dataset can be used for tasks such as LLM training, chatgpt
h
chinese_chatgpt_corpus
huggingface.co
Updated Apr 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zeye sun (2023). chinese_chatgpt_corpus [Dataset]. https://huggingface.co/datasets/sunzeyeah/chinese_chatgpt_corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2023
Authors
zeye sun
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for chinese_chatgpt_corpus

Dataset Summary

This repo collects chinese corpus for Supervised Finetuning (SFT) and Reinforcement Learning From Human Feedback (RLHF).

Supported Tasks and Leaderboards

More Information Needed

Languages

Chinese

Dataset Structure Data Instances train_data_external_v1.jsonl

Size of downloaded dataset files: 5.04 GB Size of the generated dataset: 0 GB Total amount of disk used:… See the full description on the dataset page: https://huggingface.co/datasets/sunzeyeah/chinese_chatgpt_corpus.

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista, Estimated water consumption for training GPT-3 2023 [Dataset]. https://www.statista.com/statistics/1536925/gpt-3-estimated-water-consumption-training/

Estimated water consumption for training GPT-3 2023

Explore at:

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

Jul 2023

Area covered

Worldwide

Description

GPT-3's water consumption for the training phase was estimated at roughly 4.8 billion liters of water, when assuming the model was trained on Microsoft's Iowa data center (OpeanAI has disclosed that the data center was used for training parts of the GPT-4 model). If the model were to have been fully trained in the Washington data center, water consumption could have been as high as 15 billion liters. That would've amounted to more than Microsoft's total water withdrawals in 2023.

Clear search

Close search

Google apps

Main menu

Estimated water consumption for training GPT-3 2023

Test dataset of ChatGPT in medical field

ChatGPT Revenue and Usage Statistics (2025)

Date Set: ChatGPT as an education and learning tool for engineering,...

Few, But More Orgnized data for train and test!

ChatGPT Classification Dataset

Data Sheet 2_Large language models generating synthetic clinical datasets: a...

Energy consumption when training LLMs in 2022 (in MWh)

GPQA Benchmarks and Pricing, Aug 2025

Global interest in ChatGPT on Google search weekly 2022-2025

ChatGPT for MLM

Description.

Training.

Data structure.

Top web domains cited by LLMs 2025

Data Sheet 1_Evaluating the strengths and limitations of multimodal...

PIM treatments optimization with PM-TOM using STOPP and Beers criteria and...

DAIGT-V4-TRAIN-DATASET

DAIGT Proper Train Dataset

A comparative evaluation of ChatGPT 3.5 and ChatGPT 4 in responses to...

IT data center systems total spending worldwide 2012-2025

Large Language Model content safety considerations text data

chinese_chatgpt_corpus

Estimated water consumption for training GPT-3 2023