100+ datasets found
  1. Energy consumption when training LLMs in 2022 (in MWh)

    • statista.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista, Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
    Explore at:
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over **********-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of *** Germans in 2022. While not a staggering amount, it is a considerable use of energy. Energy savings through AI While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a ***** expect that AI might reduce power consumption by *** to ******* percent. Considering that much of the world uses mobile phones this would be a considerable energy saver. Emissions are considerable The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly *** tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.

  2. 📊 6.5k train examples for LLM Science Exam 📝

    • kaggle.com
    Updated Jul 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Radek Osmulski
    Description

    I created this dataset using gpt-3.5-turbo.

    I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

    Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

    I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

    If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏

  3. D

    Data Lineage For LLM Training Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Lineage For LLM Training Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-lineage-for-llm-training-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Lineage for LLM Training Market Outlook




    According to our latest research, the global Data Lineage for LLM Training market size reached USD 1.29 billion in 2024, with an impressive compound annual growth rate (CAGR) of 21.8% expected through the forecast period. By 2033, the market is projected to grow to USD 8.93 billion, as organizations worldwide recognize the critical importance of robust data lineage solutions in ensuring transparency, compliance, and efficiency in large language model (LLM) training. The primary growth driver stems from the surging adoption of generative AI and LLMs across diverse industries, necessitating advanced data lineage capabilities for responsible and auditable AI development.




    The exponential growth of the Data Lineage for LLM Training market is fundamentally driven by the increasing complexity and scale of data used in training modern AI models. As organizations deploy LLMs for a wide array of applications—from customer service automation to advanced analytics—the need for precise tracking of data provenance, transformation, and usage has become paramount. This trend is further amplified by the proliferation of multi-source and multi-format data, which significantly complicates the process of tracing data origins and transformations. Enterprises are investing heavily in data lineage solutions to ensure that their AI models are trained on high-quality, compliant, and auditable datasets, thereby reducing risks associated with data bias, inconsistency, and regulatory violations.




    Another significant growth factor is the evolving regulatory landscape surrounding AI and data governance. Governments and regulatory bodies worldwide are introducing stringent guidelines for data usage, privacy, and accountability in AI systems. Regulations such as the European Union’s AI Act and the U.S. AI Bill of Rights are compelling organizations to implement comprehensive data lineage practices to demonstrate compliance and mitigate legal risks. This regulatory pressure is particularly pronounced in highly regulated industries such as banking, healthcare, and government, where the consequences of non-compliance can be financially and reputationally devastating. As a result, the demand for advanced data lineage software and services is surging, driving market expansion.




    Technological advancements in data management platforms and the integration of AI-driven automation are further catalyzing the growth of the Data Lineage for LLM Training market. Modern data lineage tools now leverage machine learning and natural language processing to automatically map data flows, detect anomalies, and generate real-time lineage reports. These innovations drastically reduce the manual effort required for lineage documentation and enhance the scalability of lineage solutions across large and complex data environments. The continuous evolution of such technologies is enabling organizations to achieve higher levels of transparency, trust, and operational efficiency in their AI workflows, thereby fueling market growth.




    Regionally, North America dominates the Data Lineage for LLM Training market, accounting for over 42% of the global market share in 2024. This dominance is attributed to the early adoption of AI technologies, the presence of leading technology vendors, and a mature regulatory environment. Europe follows closely, driven by strict data governance regulations and a rapidly growing AI ecosystem. The Asia Pacific region is witnessing the fastest growth, with a projected CAGR of 24.6% through 2033, fueled by digital transformation initiatives, increased AI investments, and a burgeoning startup landscape. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a relatively nascent stage.



    Component Analysis




    The Data Lineage for LLM Training market is segmented by component into software and services, each playing a pivotal role in supporting organizations’ lineage initiatives. The software segment holds the largest market share, accounting for nearly 68% of the total market revenue in 2024. This dominance is primarily due to the widespread adoption of advanced data lineage platforms that offer features such as automated lineage mapping, visualization, impact analysis, and integration with existing data management and AI training workflows. These platforms are essential for organ

  4. LLMs Data (2018-2024)

    • kaggle.com
    zip
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
    Explore at:
    zip(23351 bytes)Available download formats
    Dataset updated
    May 19, 2024
    Authors
    jaina
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

    Data Columns

    1. Model: The name of the language model.
    2. Company: The company that developed the model.
    3. Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.
    4. Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions
    5. Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions
    6. Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.
    7. ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).
    8. Training dataset: The dataset used to train the model.
    9. Release Date: The expected or actual release date of the model.
    10. Notes: Additional notes about the model, such as training details or related information.
    11. Playground: A URL link to a website where you can interact with the model or find more information about it.
  5. h

    character-llm-data

    • huggingface.co
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenMOSS (2024). character-llm-data [Dataset]. https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data
    Explore at:
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    OpenMOSS
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Character-LLM: A Trainable Agent for Role-Playing

    This is the training datasets for Character-LLM, which contains nine characters experience data used to train Character-LLMs. To download the dataset, please run the following code with Python, and you can find the downloaded data in /path/to/local_dir. from huggingface_hub import snapshot_download snapshot_download( local_dir_use_symlinks=True, repo_type="dataset", repo_id="fnlp/character-llm-data"… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data.

  6. High-quality Image & Video Data | 250 Million | LLM Data | Multimodal Large...

    • data.nexdata.ai
    Updated Aug 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). High-quality Image & Video Data | 250 Million | LLM Data | Multimodal Large Model Data |AI & ML Training Data [Dataset]. https://data.nexdata.ai/products/high-quality-image-video-data-250-million-llm-data-mu-nexdata
    Explore at:
    Dataset updated
    Aug 28, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Türkiye, Czechia, Indonesia, Netherlands, Kuwait, Estonia, Denmark, Belarus, Mexico, Albania
    Description

    Nexdata owns 150,000 hours of TV videos data, 20 Million high-quality video data and 250 Million high-quality image data.These datasets can be used for tasks like multimodal large model training.

  7. Trojan Detection Software Challenge - llm-instruct-oct2024-train

    • catalog.data.gov
    • nist.gov
    • +1more
    Updated Mar 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2025). Trojan Detection Software Challenge - llm-instruct-oct2024-train [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-llm-instruct-oct2024-train
    Explore at:
    Dataset updated
    Mar 14, 2025
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    This is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists of instruction fine tuned LLMs. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting that trigger behavior in the trained AI models.

  8. LLM Science Exam Training Data Wiki Pages

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jude Hunt (2023). LLM Science Exam Training Data Wiki Pages [Dataset]. https://www.kaggle.com/datasets/judehunt23/llm-science-exam-training-data-wiki-pages
    Explore at:
    zip(2843758 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    Jude Hunt
    Description

    Text extracts for each section of the wikipedia pages used to generate the training dataset in the LLM Science Exam competition, plus extracts from the wikipedia category "Concepts in Physics".

    Each page is broken down by section titles, and should also include a "Summary" section

  9. Top web domains cited by LLMs 2025

    • statista.com
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Top web domains cited by LLMs 2025 [Dataset]. https://www.statista.com/statistics/1620335/top-web-domains-cited-by-llms/
    Explore at:
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 2025
    Area covered
    Worldwide
    Description

    A June 2025 study found that ****** was the most frequently cited web domain by large language models (LLMs). The platform was referenced in approximately ** percent of the analyzed cases, likely due to the content licensing agreement between Google and Reddit in early 2024 for the purpose of AI models training. ********* ranked second, being mentioned in roughly ** percent of the times, while ****** and ******* were mentioned ** percent.

  10. LLM - Detect AI Datamix

    • kaggle.com
    zip
    Updated Jan 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
    Explore at:
    zip(172818297 bytes)Available download formats
    Dataset updated
    Jan 19, 2024
    Authors
    Raja Biswas
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

    It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

    To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

    Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

    • Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:
      • Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.
      • One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.
      • Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

    We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

    Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence

  11. F

    Japanese Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.

    Dataset Content

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  12. Augmented training data and labels, used for training the models

    • figshare.com
    bin
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Keane (2025). Augmented training data and labels, used for training the models [Dataset]. http://doi.org/10.6084/m9.figshare.28669001.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Michael Keane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the augmented data and labels used in training the model, it is also needed for evaluation as the vectoriser is fit on this data and then the test data is transformed on that vectoriser

  13. h

    Lucie-Training-Dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenLLM France, Lucie-Training-Dataset [Dataset]. https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset
    Explore at:
    Dataset authored and provided by
    OpenLLM France
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Lucie Training Dataset Card

    The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines, some of which were processed by Optical Character Recognition (OCR). It also contains samples of diverse programming languages. The Lucie Training Dataset was used to pretrain Lucie-7B, a foundation LLM with strong… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset.

  14. g

    Trojan Detection Software Challenge - llm-pretrain-apr2024-train | gimi9.com...

    • gimi9.com
    Updated Apr 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Trojan Detection Software Challenge - llm-pretrain-apr2024-train | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_trojan-detection-software-challenge-llm-pretrain-apr2024-train
    Explore at:
    Dataset updated
    Apr 19, 2024
    Description

    TrojAI llm-pretrain-apr2024 Train DatasetThis is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists Llama2 Large Language Models refined using fine-tuning and LoRA to perform next token prediction. A known percentage of these trained AI models have been poisoned with triggers which induces modified behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers into the model weights.

  15. h

    llm_dataset_inference

    • huggingface.co
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratyush Maini (2024). llm_dataset_inference [Dataset]. https://huggingface.co/datasets/pratyushmaini/llm_dataset_inference
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 10, 2024
    Authors
    Pratyush Maini
    Description

    LLM Dataset Inference

    This repository contains various subsets of the PILE dataset, divided into train and validation sets. The data is used to facilitate privacy research in language models, where perturbed data can be used as a reference to detect the presence of a particular dataset in the training data of a language model.

      Data Used
    

    The data is in the form of JSONL files, with each entry containing the raw text, as well as various kinds of perturbations applied to it.… See the full description on the dataset page: https://huggingface.co/datasets/pratyushmaini/llm_dataset_inference.

  16. h

    volcano-train

    • huggingface.co
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KAIST AI (2023). volcano-train [Dataset]. https://huggingface.co/datasets/kaist-ai/volcano-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset authored and provided by
    KAIST AI
    Description

    Data details

    274K multimodal feedback and revision data 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP. 158K GPT-generated multimodal instruction-following data. 450K academic-task-oriented VQA data mixture. 40K ShareGPT data

      Data collection
    

    Since no multimodal feedback data for training is publicly available as of this writing and human labeling is costly, we used a proprietary LLM to generate feedback data. As shown in figure, we use an… See the full description on the dataset page: https://huggingface.co/datasets/kaist-ai/volcano-train.

  17. h

    querygen-data-v4

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nixiesearch, querygen-data-v4 [Dataset]. https://huggingface.co/datasets/nixiesearch/querygen-data-v4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Nixiesearch
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Nixiesearch querygen-v4 model training dataset

    A dataset used to train the not-yet-published querygen-v4 model from Nixiesearch. The dataset is a combination of multiple open query-document datasets in a format for Causal LLM training.

      Used datasets
    

    We use train splits from the following datasets:

    MSMARCO: 532751 rows HotpotQA: 170000 rows NQ: 58554 rows MIRACL en: 1193 rows SQUAD: 85710 rows TriviaQA: 60283 rows

    The train split is 900000 rows, and test split is 8491.… See the full description on the dataset page: https://huggingface.co/datasets/nixiesearch/querygen-data-v4.

  18. d

    Customer Service Call Dataset [Multisector] – Annotated support transcripts...

    • datarade.ai
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2025). Customer Service Call Dataset [Multisector] – Annotated support transcripts for training AI and improving CX [Dataset]. https://datarade.ai/data-products/customer-service-call-dataset-multisector-annotated-suppo-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    WiserBrand
    Area covered
    United States of America
    Description

    "This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.

    Included in each record:

    • Full call transcription with labeled speakers (system, agent, customer)
    • Concise human-written summary of the conversation
    • Sentiment tag for the overall interaction: positive, neutral, or negative
    • Company name, duration, and geographic location of the caller
    • Call context includes industries such as eCommerce, banking, telecom, and streaming services

    Common use cases:

    • Train NLP models to understand support calls and detect churn risk
    • Power complaint detection engines for customer success and support teams
    • Create high-quality LLM training sets with real support narratives
    • Build summarization and topic tagging pipelines for CX dashboards
    • Analyze tone shifts and resolution language in customer-agent interaction

    This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."

    The more you purchase, the lower the price will be.

  19. Z

    Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Sanju Tiwari; Carlos F. Enguix; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7916715
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset provided by
    Universidad Autonoma de Tamaulipas, Mexico
    ACM SIGMOD Professional Member
    IBM Research Europe
    Sharda University, India
    Authors
    Nandana Mihindukulasooriya; Sanju Tiwari; Carlos F. Enguix; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence: {"id": "ont_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    { "id": "ont_k_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    Text2KGBench

    src: the source code used for generation and evaluation, and baseline

    benchmark the code used to generate the benchmark

    evaluation evaluation scripts for calculating the results

    baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

    data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

    wikidata_tekgen Wikidata-TekGen Dataset

    ontologies 10 ontologies used by this dataset

    train training data

    test test data

    manually_verified_sentences ids of a subset of test cases manually validated

    unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

    test unseen test unseen test sentences

    ground_truth ground truth for unseen test sentences.

    ground_truth ground truth for the test data

    baselines data related to running the baselines.

    test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

    prompts prompts corresponding to each test file

    unseen prompts unseen prompts for the unseen test cases

    Alpaca-LoRA-13B data related to the Alpaca-LoRA model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    unseen results results for the unseen test cases

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    Vicuna-13B data related to the Vicuna-13B model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    dbpedia_webnlg DBpedia Dataset

    ontologies 19 ontologies used by this dataset

    train training data

    test test data

    ground_truth ground truth for the test data

    baselines data related to running the baselines.

    test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

    prompts prompts corresponding to each test file

    Alpaca-LoRA-13B data related to the Alpaca-LoRA model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    Vicuna-13B data related to the Vicuna-13B model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.

  20. Large Language Models Comparison Dataset

    • kaggle.com
    zip
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Large Language Models Comparison Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/large-language-models-comparison-dataset
    Explore at:
    zip(5894 bytes)Available download formats
    Dataset updated
    Feb 24, 2025
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a comparison of various Large Language Models (LLMs) based on their performance, cost, and efficiency. It includes important details like speed, latency, benchmarks, and pricing, helping users understand how different models stack up against each other.

    Key Details:

    • File Name: llm_comparison_dataset.csv
    • Size: 14.57 kB
    • Total Columns: 15
    • License: CC0 (Public Domain)

    What’s Inside?

    Here are some of the key metrics included in the dataset:

    1. Context Window: Maximum number of tokens the model can process at once.
    2. Speed (tokens/sec): How fast the model generates responses.
    3. Latency (sec): Time delay before the model responds.
    4. Benchmark Scores: Performance ratings from MMLU (academic tasks) and Chatbot Arena (real-world chatbot performance).
    5. Open-Source: Indicates if the model is publicly available or proprietary.
    6. Price per Million Tokens: The cost of using the model for one million tokens.
    7. Training Dataset Size: Amount of data used to train the model.
    8. Compute Power: Resources needed to run the model.
    9. Energy Efficiency: How much power the model consumes.

    This dataset is useful for researchers, developers, and AI enthusiasts who want to compare LLMs and choose the best one based on their needs.

    📌If you find this dataset useful, do give an upvote :)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista, Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
Organization logo

Energy consumption when training LLMs in 2022 (in MWh)

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Worldwide
Description

Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over **********-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of *** Germans in 2022. While not a staggering amount, it is a considerable use of energy. Energy savings through AI While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a ***** expect that AI might reduce power consumption by *** to ******* percent. Considering that much of the world uses mobile phones this would be a considerable energy saver. Emissions are considerable The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly *** tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.

Search
Clear search
Close search
Google apps
Main menu