100+ datasets found
  1. h

    long-llm-data

    • huggingface.co
    Updated Aug 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peitian Zhang (2024). long-llm-data [Dataset]. https://huggingface.co/datasets/namespace-Pt/long-llm-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2024
    Authors
    Peitian Zhang
    Description

    namespace-Pt/long-llm-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. m

    FileMarket | Telegram Users Geolocation Data with IP & Consent | 50,000...

    • filemarket.mydatastorefront.com
    Updated Aug 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | Telegram Users Geolocation Data with IP & Consent | 50,000 Records | AI, ML, DL & LLM Training Data [Dataset]. https://filemarket.mydatastorefront.com/products/filemarket-telegram-users-geolocation-data-with-ip-consen-filemarket
    Explore at:
    Dataset updated
    Aug 18, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Anguilla, Saint Pierre and Miquelon, Germany, Malaysia, Qatar, Gabon, Samoa, Luxembourg, State of, Georgia
    Description

    Comprehensive dataset of Telegram users' geolocations with IP addresses, fully consented, comprising 50,000 records. Ideal for AI, ML, DL, and LLM training, this dataset provides detailed geospatial insights across various regions, enhancing geofencing, localization, and behavioral analysis models.

  3. f

    Data Sheet 2_Large language models generating synthetic clinical datasets: a...

    • frontiersin.figshare.com
    xlsx
    Updated Feb 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin (2025). Data Sheet 2_Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1533508.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    Frontiers
    Authors
    Austin A. Barr; Joshua Quan; Eddie Guo; Emre Sezgin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundClinical data is instrumental to medical research, machine learning (ML) model development, and advancing surgical care, but access is often constrained by privacy regulations and missing data. Synthetic data offers a promising solution to preserve privacy while enabling broader data access. Recent advances in large language models (LLMs) provide an opportunity to generate synthetic data with reduced reliance on domain expertise, computational resources, and pre-training.ObjectiveThis study aims to assess the feasibility of generating realistic tabular clinical data with OpenAI’s GPT-4o using zero-shot prompting, and evaluate the fidelity of LLM-generated data by comparing its statistical properties to the Vital Signs DataBase (VitalDB), a real-world open-source perioperative dataset.MethodsIn Phase 1, GPT-4o was prompted to generate a dataset with qualitative descriptions of 13 clinical parameters. The resultant data was assessed for general errors, plausibility of outputs, and cross-verification of related parameters. In Phase 2, GPT-4o was prompted to generate a dataset using descriptive statistics of the VitalDB dataset. Fidelity was assessed using two-sample t-tests, two-sample proportion tests, and 95% confidence interval (CI) overlap.ResultsIn Phase 1, GPT-4o generated a complete and structured dataset comprising 6,166 case files. The dataset was plausible in range and correctly calculated body mass index for all case files based on respective heights and weights. Statistical comparison between the LLM-generated datasets and VitalDB revealed that Phase 2 data achieved significant fidelity. Phase 2 data demonstrated statistical similarity in 12/13 (92.31%) parameters, whereby no statistically significant differences were observed in 6/6 (100.0%) categorical/binary and 6/7 (85.71%) continuous parameters. Overlap of 95% CIs were observed in 6/7 (85.71%) continuous parameters.ConclusionZero-shot prompting with GPT-4o can generate realistic tabular synthetic datasets, which can replicate key statistical properties of real-world perioperative data. This study highlights the potential of LLMs as a novel and accessible modality for synthetic data generation, which may address critical barriers in clinical data access and eliminate the need for technical expertise, extensive computational resources, and pre-training. Further research is warranted to enhance fidelity and investigate the use of LLMs to amplify and augment datasets, preserve multivariate relationships, and train robust ML models.

  4. llm-human-large-data

    • kaggle.com
    zip
    Updated Oct 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hemanthh Velliyangirie (2024). llm-human-large-data [Dataset]. https://www.kaggle.com/datasets/hemanthhvv/llm-human-large-data/suggestions
    Explore at:
    zip(153243329 bytes)Available download formats
    Dataset updated
    Oct 15, 2024
    Authors
    Hemanthh Velliyangirie
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Hemanthh Velliyangirie

    Released under Apache 2.0

    Contents

  5. d

    Coresignal | Private Company Data | Company Data | AI-Enriched Datasets |...

    • datarade.ai
    .json, .csv
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coresignal, Coresignal | Private Company Data | Company Data | AI-Enriched Datasets | Global / 35M+ Records / Updated Weekly [Dataset]. https://datarade.ai/data-products/coresignal-private-company-data-company-data-ai-enriche-coresignal
    Explore at:
    .json, .csvAvailable download formats
    Dataset authored and provided by
    Coresignal
    Area covered
    Argentina, Jamaica, Kyrgyzstan, Benin, Senegal, Grenada, Kiribati, Pitcairn, Togo, Bhutan
    Description

    This Private Company Data dataset is a refined version of our company datasets, consisting of 35M+ data records.

    It’s an excellent data solution for companies with limited data engineering capabilities and those who want to reduce their time to value. You get filtered, cleaned, unified, and standardized B2B private company data. This data is also enriched by leveraging a carefully instructed large language model (LLM).

    AI-powered data enrichment offers more accurate information in key data fields, such as company descriptions. It also produces over 20 additional data points that are very valuable to B2B businesses. Enhancing and highlighting the most important information in web data contributes to quicker time to value, making data processing much faster and easier.

    For your convenience, you can choose from multiple data formats (Parquet, JSON, JSONL, or CSV) and select suitable delivery frequency (quarterly, monthly, or weekly).

    Coresignal is a leading private company data provider in the web data sphere with an extensive focus on firmographic data and public employee profiles. More than 3B data records in different categories enable companies to build data-driven products and generate actionable insights. Coresignal is exceptional in terms of data freshness, with 890M+ records updated monthly for unprecedented accuracy and relevance.

  6. t

    The Information’s AI Data Center Database

    • theinformation.com
    • notlon.app
    csv
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Information (2024). The Information’s AI Data Center Database [Dataset]. https://www.theinformation.com/projects/ai-data-center-database
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 3, 2024
    Dataset authored and provided by
    The Information
    Area covered
    Worldwide
    Dataset funded by
    The Information
    Description

    Top artificial intelligence firms are racing to build the biggest and most powerful Nvidia server chip clusters to win in AI. Below, we mapped the biggest completed and planned server clusters. Check back often, as we'll update the list when we confirm more data.

  7. h

    llm-sgd-dst8-split-training-data

    • huggingface.co
    Updated Jul 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    llm-sgd-dst8-split-training-data [Dataset]. https://huggingface.co/datasets/amay01/llm-sgd-dst8-split-training-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2023
    Authors
    Ammer Ayach
    Description

    Dataset Card for "llm-sgd-dst8-split-training-data"

    More Information needed

  8. llm-science-wikipedia-data-b

    • kaggle.com
    Updated Oct 11, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kami (2023). llm-science-wikipedia-data-b [Dataset]. https://www.kaggle.com/datasets/kami634/llm-science-wikipedia-data-b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    kami
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by kami

    Released under CC0: Public Domain

    Contents

  9. augmented-data-for-llm-detect-ai-generated-text

    • kaggle.com
    zip
    Updated Jan 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JISU KIM8873 (2024). augmented-data-for-llm-detect-ai-generated-text [Dataset]. https://www.kaggle.com/datasets/jisukim8873/augmented-data-for-llm-detect-ai-generated-text
    Explore at:
    zip(328850388 bytes)Available download formats
    Dataset updated
    Jan 22, 2024
    Authors
    JISU KIM8873
    Description

    Dataset

    This dataset was created by JISU KIM8873

    Contents

  10. d

    Replication Data for: Synthetic Replacements for Human Survey Data? The...

    • dataone.org
    • dataverse.harvard.edu
    • +1more
    Updated Sep 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bisbee, James (2024). Replication Data for: Synthetic Replacements for Human Survey Data? The Perils of Large Language Models [Dataset]. http://doi.org/10.7910/DVN/VPN481
    Explore at:
    Dataset updated
    Sep 25, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Bisbee, James
    Description

    Large Language Models (LLMs) offer new research possibilities for social scientists, but their potential as “synthetic data" is still largely unknown. In this paper, we investigate how accurately the popular LLM ChatGPT can recover public opinion, prompting the LLM to adopt different “personas” and then provide feeling thermometer scores for 11 sociopolitical groups. The average scores generated by ChatGPT correspond closely to the averages in our baseline survey, the 2016–2020 American National Election Study. Nevertheless, sampling by ChatGPT is not reliable for statistical inference: there is less variation in responses than in the real surveys, and regression coefficients often differ significantly from equivalent estimates obtained using ANES data. We also document how the distribution of synthetic responses varies with minor changes in prompt wording, and we show how the same prompt yields significantly different results over a three-month period. Altogether, our findings raise serious concerns about the quality, reliability, and reproducibility of synthetic survey data generated by LLMs.

  11. h

    LLM-QE-DPO-Training-Data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chengpingan, LLM-QE-DPO-Training-Data [Dataset]. https://huggingface.co/datasets/chengpingan/LLM-QE-DPO-Training-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    chengpingan
    Description

    chengpingan/LLM-QE-DPO-Training-Data dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. P

    vpfrc_llm_vulnerability_classifier Dataset

    • paperswithcode.com
    Updated Dec 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sam Relins; Daniel Birks; Charlie Lloyd (2024). vpfrc_llm_vulnerability_classifier Dataset [Dataset]. https://paperswithcode.com/dataset/vpfrc-llm-vulnerability-classifier
    Explore at:
    Dataset updated
    Dec 15, 2024
    Authors
    Sam Relins; Daniel Birks; Charlie Lloyd
    Description

    LLM-Based Vulnerability Classification in Police Narratives This repository contains datasets used in our research on applying large language models (LLMs) to identify indicators of vulnerability in police incident narratives. These resources support the replication of findings in our paper: "Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives."

    Project Overview Law enforcement frequently encounters vulnerable individuals, but identifying vulnerability factors in police records remains challenging. Our research explores how LLMs can assist in identifying four key vulnerability indicators in police Field Interrogation and Observation (FIO) narratives:

    Mental health issues Drug abuse Alcoholism Homelessness

    This project advances police research methodology by: 1. Evaluating LLM performance in vulnerability classification against human labelers 2. Comparing different LLM architectures and prompt engineering approaches 3. Investigating potential demographic biases through counterfactual analysis 4. Developing a reusable framework for qualitative text analysis

    Datasets This repository includes four key datasets:

    boston_narratives_test_classified_4000.csv: 4,000 narratives classified with our LLM pipeline, including all labels and model explanations counterfactual_narratives_all_coded.csv: Systematically generated counterfactual narratives with varied demographic characteristics examples_for_counterfactuals.csv: 100 base narratives used for counterfactual generation labelled_fio_data_for_analysis.csv: 500 pre-processed examples with human and GPT-4o labels

    Code Repository The complete codebase for replicating our research is available in our GitHub repository: llm-deductive-coding (particularly in the boston_fio_paper directory).

    The repository includes: - Data preprocessing scripts - Classification pipeline implementation - Counterfactual generation code - Analysis notebooks - Visualization tools

    Citation If you use these resources in your research, please cite our paper:

    bibtex @article{author2023llm, title={Using Instruction-Tuned Large Language Models to Identify Indicators of Vulnerability in Police Incident Narratives}, author={Relins, S. and Birks, D and Lloyd, C}, journal={Arxiv Preprint}, year={2023}, note={Currently under review for the Journal of Quantitative Criminology} }

    License These datasets are released under the MIT License. The original Boston FIO data is released under the Open Data Commons Public Domain Dedication and License (PDDL).

    Contact For questions about this research or datasets, please contact the authors or open an issue in our GitHub repository.

  13. Data from 199 residents of the United States comparing reported trust in...

    • data.csiro.au
    Updated Jan 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melanie McGrath; Patrick Cooper; Andreas Duenser (2025). Data from 199 residents of the United States comparing reported trust in output generated by a large language model and output provided via a different form of artificial intelligence (January 2024) [Dataset]. http://doi.org/10.25919/9g2g-ws49
    Explore at:
    Dataset updated
    Jan 22, 2025
    Dataset provided by
    CSIROhttp://www.csiro.au/
    Authors
    Melanie McGrath; Patrick Cooper; Andreas Duenser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 31, 2024
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    This data was collected in an experiment aiming to establish whether trust in large language models (LLMs) may be inflated in relation to other forms of artificial intelligence, with a particular focus on the content and forms of natural language used. One hundred and ninety-nine residents of the United States were recruited online and presented with a series of general knowledge questions. For each question they also received a recommendation from either an LLM or a non-LLM AI-assistant. The accuracy of this recommendation was also varied. All data is deidentified and there is no missing data. This deidentified data may be used by researchers for the purposes of verifying published results or advancing other research on this topic. Lineage: Data was collected on the Qualtrics survey platform from participants sourced on online recruitment platform, Prolific.

  14. P

    DSEval-Kaggle Dataset

    • paperswithcode.com
    Updated Feb 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval
    Explore at:
    Dataset updated
    Feb 26, 2024
    Authors
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren
    Description

    In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

    This is one of DSEval benchmarks.

  15. h

    singled-llm-evaluator-data

    • huggingface.co
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irsh Vijay (2025). singled-llm-evaluator-data [Dataset]. https://huggingface.co/datasets/1rsh/singled-llm-evaluator-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2025
    Authors
    Irsh Vijay
    Description

    1rsh/singled-llm-evaluator-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. Z

    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • data.niaid.nih.gov
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10413067
    Explore at:
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Cizinsky, Ludek
    Nutter, Peter
    Senghaas, Mika
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

    Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

    Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

    curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    Fine-tuning and advancing Homepage2Vec or similar website classification models

    Research on LLM-generated datasets for text classification tasks

    Exploration of multilingual website classification

    Additional Information:

    Project and report repository: https://github.com/CS-433/ml-project-2-mlp

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  17. F

    Bahasa Open Ended Classification Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Bahasa Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/bahasa-open-ended-classification-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/data-license-agreementhttps://www.futurebeeai.com/data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the Bahasa Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.

    Dataset Content: This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Bahasa language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Bahasa people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Prompt Diversity: To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.Response Formats: To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.Data Format and Annotation Details: This fully labeled Bahasa Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.Quality and Accuracy: Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The Bahasa version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization: The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.License: The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Bahasa Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  18. Energy consumption when training LLMs in 2022 (in MWh)

    • statista.com
    Updated Sep 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
    Explore at:
    Dataset updated
    Sep 10, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over a thousand-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of 200 Germans in 2022. While not a staggering amount, it is a considerable use of energy.

    Energy savings through AI

    While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a third expect that AI might reduce power consumption by ten to fifteen percent. Considering that much of the world uses mobile phones this would be a considerable energy saver.

    Emissions are considerable

    The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly 500 tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.

  19. f

    Data Sheet 1_A comparison of the diagnostic ability of large language models...

    • frontiersin.figshare.com
    application/csv
    Updated Aug 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maria Palwasha Khan; Eoin Daniel O’Sullivan (2024). Data Sheet 1_A comparison of the diagnostic ability of large language models in challenging clinical cases.csv [Dataset]. http://doi.org/10.3389/frai.2024.1379297.s001
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Aug 5, 2024
    Dataset provided by
    Frontiers
    Authors
    Maria Palwasha Khan; Eoin Daniel O’Sullivan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThe rise of accessible, consumer facing large language models (LLM) provides an opportunity for immediate diagnostic support for clinicians.ObjectivesTo compare the different performance characteristics of common LLMS utility in solving complex clinical cases and assess the utility of a novel tool to grade LLM output.MethodsUsing a newly developed rubric to assess the models’ diagnostic utility, we measured to models’ ability to answer cases according to accuracy, readability, clinical interpretability, and an assessment of safety. Here we present a comparative analysis of three LLM models—Bing, Chat GPT, and Gemini—across a diverse set of clinical cases as presented in the New England Journal of Medicines case series.ResultsOur results suggest that models performed differently when presented with identical clinical information, with Gemini performing best. Our grading tool had low interobserver variability and proved a reliable tool to grade LLM clinical output.ConclusionThis research underscores the variation in model performance in clinical scenarios and highlights the importance of considering diagnostic model performance in diverse clinical scenarios prior to deployment. Furthermore, we provide a new tool to assess LLM output.

  20. Trojan Detection Software Challenge - mitigation-llm-instruct-oct2024-train

    • catalog.data.gov
    • s.cnmilf.com
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Trojan Detection Software Challenge - mitigation-llm-instruct-oct2024-train [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-mitigation-llm-instruct-oct2024-train
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists of instruction fine tuned LLMs. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for mitigating that trigger behavior in the trained AI models.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Peitian Zhang (2024). long-llm-data [Dataset]. https://huggingface.co/datasets/namespace-Pt/long-llm-data

long-llm-data

namespace-Pt/long-llm-data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2024
Authors
Peitian Zhang
Description

namespace-Pt/long-llm-data dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu