100+ datasets found

Energy consumption when training LLMs in 2022 (in MWh)
statista.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista, Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
Explore at:
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Worldwide
Description
Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over **********-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of *** Germans in 2022. While not a staggering amount, it is a considerable use of energy. Energy savings through AI While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a ***** expect that AI might reduce power consumption by *** to ******* percent. Considering that much of the world uses mobile phones this would be a considerable energy saver. Emissions are considerable The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly *** tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.
📊 6.5k train examples for LLM Science Exam 📝
kaggle.com
Updated Jul 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Radek Osmulski
Description
I created this dataset using gpt-3.5-turbo.

I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏
D
Data Lineage For LLM Training Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Data Lineage For LLM Training Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-lineage-for-llm-training-market
Explore at:
pdf, pptx, csvAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Data Lineage for LLM Training Market Outlook

According to our latest research, the global Data Lineage for LLM Training market size reached USD 1.29 billion in 2024, with an impressive compound annual growth rate (CAGR) of 21.8% expected through the forecast period. By 2033, the market is projected to grow to USD 8.93 billion, as organizations worldwide recognize the critical importance of robust data lineage solutions in ensuring transparency, compliance, and efficiency in large language model (LLM) training. The primary growth driver stems from the surging adoption of generative AI and LLMs across diverse industries, necessitating advanced data lineage capabilities for responsible and auditable AI development.

The exponential growth of the Data Lineage for LLM Training market is fundamentally driven by the increasing complexity and scale of data used in training modern AI models. As organizations deploy LLMs for a wide array of applications—from customer service automation to advanced analytics—the need for precise tracking of data provenance, transformation, and usage has become paramount. This trend is further amplified by the proliferation of multi-source and multi-format data, which significantly complicates the process of tracing data origins and transformations. Enterprises are investing heavily in data lineage solutions to ensure that their AI models are trained on high-quality, compliant, and auditable datasets, thereby reducing risks associated with data bias, inconsistency, and regulatory violations.

Another significant growth factor is the evolving regulatory landscape surrounding AI and data governance. Governments and regulatory bodies worldwide are introducing stringent guidelines for data usage, privacy, and accountability in AI systems. Regulations such as the European Union’s AI Act and the U.S. AI Bill of Rights are compelling organizations to implement comprehensive data lineage practices to demonstrate compliance and mitigate legal risks. This regulatory pressure is particularly pronounced in highly regulated industries such as banking, healthcare, and government, where the consequences of non-compliance can be financially and reputationally devastating. As a result, the demand for advanced data lineage software and services is surging, driving market expansion.

Technological advancements in data management platforms and the integration of AI-driven automation are further catalyzing the growth of the Data Lineage for LLM Training market. Modern data lineage tools now leverage machine learning and natural language processing to automatically map data flows, detect anomalies, and generate real-time lineage reports. These innovations drastically reduce the manual effort required for lineage documentation and enhance the scalability of lineage solutions across large and complex data environments. The continuous evolution of such technologies is enabling organizations to achieve higher levels of transparency, trust, and operational efficiency in their AI workflows, thereby fueling market growth.

Regionally, North America dominates the Data Lineage for LLM Training market, accounting for over 42% of the global market share in 2024. This dominance is attributed to the early adoption of AI technologies, the presence of leading technology vendors, and a mature regulatory environment. Europe follows closely, driven by strict data governance regulations and a rapidly growing AI ecosystem. The Asia Pacific region is witnessing the fastest growth, with a projected CAGR of 24.6% through 2033, fueled by digital transformation initiatives, increased AI investments, and a burgeoning startup landscape. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a relatively nascent stage.

Component Analysis

The Data Lineage for LLM Training market is segmented by component into software and services, each playing a pivotal role in supporting organizations’ lineage initiatives. The software segment holds the largest market share, accounting for nearly 68% of the total market revenue in 2024. This dominance is primarily due to the widespread adoption of advanced data lineage platforms that offer features such as automated lineage mapping, visualization, impact analysis, and integration with existing data management and AI training workflows. These platforms are essential for organ
LLMs Data (2018-2024)
kaggle.com
zip
Updated May 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
Explore at:
zip(23351 bytes)Available download formats
Dataset updated
May 19, 2024
Authors
jaina
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

Data Columns

Model: The name of the language model.

Company: The company that developed the model.

Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.

Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions

Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions

Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.

ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).

Training dataset: The dataset used to train the model.

Release Date: The expected or actual release date of the model.

Notes: Additional notes about the model, such as training details or related information.

Playground: A URL link to a website where you can interact with the model or find more information about it.
h
character-llm-data
huggingface.co
Updated Jun 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenMOSS (2024). character-llm-data [Dataset]. https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data
Explore at:
Dataset updated
Jun 8, 2024
Dataset authored and provided by
OpenMOSS
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Character-LLM: A Trainable Agent for Role-Playing

This is the training datasets for Character-LLM, which contains nine characters experience data used to train Character-LLMs. To download the dataset, please run the following code with Python, and you can find the downloaded data in /path/to/local_dir. from huggingface_hub import snapshot_download snapshot_download( local_dir_use_symlinks=True, repo_type="dataset", repo_id="fnlp/character-llm-data"… See the full description on the dataset page: https://huggingface.co/datasets/OpenMOSS-Team/character-llm-data.
High-quality Image & Video Data | 250 Million | LLM Data | Multimodal Large...
data.nexdata.ai
Updated Aug 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). High-quality Image & Video Data | 250 Million | LLM Data | Multimodal Large Model Data |AI & ML Training Data [Dataset]. https://data.nexdata.ai/products/high-quality-image-video-data-250-million-llm-data-mu-nexdata
Explore at:
Dataset updated
Aug 28, 2025
Dataset authored and provided by
Nexdata
Area covered
Türkiye, Czechia, Indonesia, Netherlands, Kuwait, Estonia, Denmark, Belarus, Mexico, Albania
Description
Nexdata owns 150,000 hours of TV videos data, 20 Million high-quality video data and 250 Million high-quality image data.These datasets can be used for tasks like multimodal large model training.
Trojan Detection Software Challenge - llm-instruct-oct2024-train
catalog.data.gov
nist.gov
+1more
Updated Mar 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2025). Trojan Detection Software Challenge - llm-instruct-oct2024-train [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-llm-instruct-oct2024-train
Explore at:
Dataset updated
Mar 14, 2025
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
This is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists of instruction fine tuned LLMs. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting that trigger behavior in the trained AI models.
LLM Science Exam Training Data Wiki Pages
kaggle.com
zip
Updated Jul 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jude Hunt (2023). LLM Science Exam Training Data Wiki Pages [Dataset]. https://www.kaggle.com/datasets/judehunt23/llm-science-exam-training-data-wiki-pages
Explore at:
zip(2843758 bytes)Available download formats
Dataset updated
Jul 18, 2023
Authors
Jude Hunt
Description
Text extracts for each section of the wikipedia pages used to generate the training dataset in the LLM Science Exam competition, plus extracts from the wikipedia category "Concepts in Physics".

Each page is broken down by section titles, and should also include a "Summary" section
Top web domains cited by LLMs 2025
statista.com
Updated Jun 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Top web domains cited by LLMs 2025 [Dataset]. https://www.statista.com/statistics/1620335/top-web-domains-cited-by-llms/
Explore at:
Dataset updated
Jun 29, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 2025
Area covered
Worldwide
Description
A June 2025 study found that ****** was the most frequently cited web domain by large language models (LLMs). The platform was referenced in approximately ** percent of the analyzed cases, likely due to the content licensing agreement between Google and Reddit in early 2024 for the purpose of AI models training. ********* ranked second, being mentioned in roughly ** percent of the times, while ****** and ******* were mentioned ** percent.
LLM - Detect AI Datamix
kaggle.com
zip
Updated Jan 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raja Biswas (2024). LLM - Detect AI Datamix [Dataset]. https://www.kaggle.com/datasets/conjuring92/ai-mix-v26
Explore at:
zip(172818297 bytes)Available download formats
Dataset updated
Jan 19, 2024
Authors
Raja Biswas
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the datamix created by Team 🔍 📝 🕵️‍♂️ 🤖 during the LLM - Detect AI Generated Text competition. This dataset helped us to win the competition. It facilitates a text-classification task to separate LLM generate essays from the student written ones.

It was developed in an incremental way focusing on size, diversity and complexity. For each datamix iteration, we attempted to plug blindspots of the previous generation models while maintaining robustness.

To maximally leverage in-domain human texts, we used the entire Persuade corpus comprising all 15 prompts. We also included diverse human texts from sources such as OpenAI GPT2 output dataset, ELLIPSE corpus, NarrativeQA, wikipedia, NLTK Brown corpus and IMDB movie reviews.

Sources for our generated essays can be grouped under four categories: - Proprietary LLMs (gpt-3.5, gpt-4, claude, cohere, gemini, palm) - Open source LLMs (llama, falcon, mistral, mixtral) - Existing LLM generated text datasets - Synthetic dataset made by T5 - DAIGT V2 subset - OUTFOX - Ghostbuster - gpt-2-output-dataset

Fine-tuned open-source LLMs (mistral, llama, falcon, deci-lm, t5, pythia, OPT, BLOOM, GPT2). For LLM fine-tuning, we leveraged the PERSUADE corpus in different ways:

Instruction tuning: Instructions were composed of different metadata e.g. prompt name, holistic essay score, ELL status and grade level. Responses were the corresponding student essays.

One topic held out: LLMs fine-tuned on PERSUADE essays with one prompt held out. When generating, only the held out prompt essays were generated. This was done to encourage new writing styles.

Span wise generation: Generate one span (discourse) at a time conditioned on the remaining essay.

We used a wide variety of generation configs and prompting strategies to promote diversity & complexity to the data. Generated essays leveraged a combination of the following: - Contrastive search - Use of Guidance scale, typical_p, suppress_tokens - High temperature & large values of top-k - Prompting to fill-in-the-blank: randomly mask words in an essay and asking LLM to reconstruct the original essay (similar to MLM) - Prompting without source texts - Prompting with source texts - Prompting to rewrite existing essays

Finally, we incorporated augmented essays to make our models aware of typical attacks on LLM content detection systems and obfuscations present in the provided training data. We mainly used a combination of the following augmentations on a random subset of essays: - Spelling correction - Deletion/insertion/swapping of characters - Replacement with synonym - Introduce obfuscations - Back translation - Random capitalization - Swap sentence
F
Japanese Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.
Dataset Content
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
Augmented training data and labels, used for training the models
figshare.com
bin
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Keane (2025). Augmented training data and labels, used for training the models [Dataset]. http://doi.org/10.6084/m9.figshare.28669001.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28669001.v1
Dataset updated
Mar 26, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Michael Keane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the augmented data and labels used in training the model, it is also needed for evaluation as the vectoriser is fit on this data and then the test data is transformed on that vectoriser
h
Lucie-Training-Dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenLLM France, Lucie-Training-Dataset [Dataset]. https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset
Explore at:
Dataset authored and provided by
OpenLLM France
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Lucie Training Dataset Card

The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines, some of which were processed by Optical Character Recognition (OCR). It also contains samples of diverse programming languages. The Lucie Training Dataset was used to pretrain Lucie-7B, a foundation LLM with strong… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset.
g
Trojan Detection Software Challenge - llm-pretrain-apr2024-train | gimi9.com...
gimi9.com
Updated Apr 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Trojan Detection Software Challenge - llm-pretrain-apr2024-train | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_trojan-detection-software-challenge-llm-pretrain-apr2024-train
Explore at:
Dataset updated
Apr 19, 2024
Description
TrojAI llm-pretrain-apr2024 Train DatasetThis is the training data used to create and evaluate trojan detection software solutions. This data, generated at NIST, consists Llama2 Large Language Models refined using fine-tuning and LoRA to perform next token prediction. A known percentage of these trained AI models have been poisoned with triggers which induces modified behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers into the model weights.
h
llm_dataset_inference
huggingface.co
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pratyush Maini (2024). llm_dataset_inference [Dataset]. https://huggingface.co/datasets/pratyushmaini/llm_dataset_inference
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 10, 2024
Authors
Pratyush Maini
Description
LLM Dataset Inference

This repository contains various subsets of the PILE dataset, divided into train and validation sets. The data is used to facilitate privacy research in language models, where perturbed data can be used as a reference to detect the presence of a particular dataset in the training data of a language model.

Data Used

The data is in the form of JSONL files, with each entry containing the raw text, as well as various kinds of perturbations applied to it.… See the full description on the dataset page: https://huggingface.co/datasets/pratyushmaini/llm_dataset_inference.
h
volcano-train
huggingface.co
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KAIST AI (2023). volcano-train [Dataset]. https://huggingface.co/datasets/kaist-ai/volcano-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset authored and provided by
KAIST AI
Description
Data details

274K multimodal feedback and revision data 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP. 158K GPT-generated multimodal instruction-following data. 450K academic-task-oriented VQA data mixture. 40K ShareGPT data

Data collection

Since no multimodal feedback data for training is publicly available as of this writing and human labeling is costly, we used a proprietary LLM to generate feedback data. As shown in figure, we use an… See the full description on the dataset page: https://huggingface.co/datasets/kaist-ai/volcano-train.
h
querygen-data-v4
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nixiesearch, querygen-data-v4 [Dataset]. https://huggingface.co/datasets/nixiesearch/querygen-data-v4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Nixiesearch
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Nixiesearch querygen-v4 model training dataset

A dataset used to train the not-yet-published querygen-v4 model from Nixiesearch. The dataset is a combination of multiple open query-document datasets in a format for Causal LLM training.

Used datasets

We use train splits from the following datasets:

MSMARCO: 532751 rows HotpotQA: 170000 rows NQ: 58554 rows MIRACL en: 1193 rows SQUAD: 85710 rows TriviaQA: 60283 rows

The train split is 900000 rows, and test split is 8491.… See the full description on the dataset page: https://huggingface.co/datasets/nixiesearch/querygen-data-v4.
d
Customer Service Call Dataset [Multisector] – Annotated support transcripts...
datarade.ai
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2025). Customer Service Call Dataset [Multisector] – Annotated support transcripts for training AI and improving CX [Dataset]. https://datarade.ai/data-products/customer-service-call-dataset-multisector-annotated-suppo-wiserbrand-com
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Apr 11, 2025
Dataset provided by
WiserBrand
Area covered
United States of America
Description
"This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.

Included in each record:

Full call transcription with labeled speakers (system, agent, customer)

Concise human-written summary of the conversation

Sentiment tag for the overall interaction: positive, neutral, or negative

Company name, duration, and geographic location of the caller

Call context includes industries such as eCommerce, banking, telecom, and streaming services

Common use cases:

Train NLP models to understand support calls and detect churn risk

Power complaint detection engines for customer success and support teams

Create high-quality LLM training sets with real support narratives

Build summarization and topic tagging pipelines for CX dashboards

Analyze tone shifts and resolution language in customer-agent interaction

This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."

The more you purchase, the lower the price will be.
Z
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
data.niaid.nih.gov
zenodo.org
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Sanju Tiwari; Carlos F. Enguix; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7916715
Explore at:
Dataset updated
May 23, 2023
Dataset provided by
Universidad Autonoma de Tamaulipas, Mexico
ACM SIGMOD Professional Member
IBM Research Europe
Sharda University, India
Authors
Nandana Mihindukulasooriya; Sanju Tiwari; Carlos F. Enguix; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.
Large Language Models Comparison Dataset
kaggle.com
zip
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samay Ashar (2025). Large Language Models Comparison Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/large-language-models-comparison-dataset
Explore at:
zip(5894 bytes)Available download formats
Dataset updated
Feb 24, 2025
Authors
Samay Ashar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset provides a comparison of various Large Language Models (LLMs) based on their performance, cost, and efficiency. It includes important details like speed, latency, benchmarks, and pricing, helping users understand how different models stack up against each other.

Key Details:

File Name: llm_comparison_dataset.csv

Size: 14.57 kB

Total Columns: 15

License: CC0 (Public Domain)

What’s Inside?

Here are some of the key metrics included in the dataset:

Context Window: Maximum number of tokens the model can process at once.

Speed (tokens/sec): How fast the model generates responses.

Latency (sec): Time delay before the model responds.

Benchmark Scores: Performance ratings from MMLU (academic tasks) and Chatbot Arena (real-world chatbot performance).

Open-Source: Indicates if the model is publicly available or proprietary.

Price per Million Tokens: The cost of using the model for one million tokens.

Training Dataset Size: Amount of data used to train the model.

Compute Power: Resources needed to run the model.

Energy Efficiency: How much power the model consumes.

This dataset is useful for researchers, developers, and AI enthusiasts who want to compare LLMs and choose the best one based on their needs.

📌If you find this dataset useful, do give an upvote :)

Facebook

Twitter

Click to copy link

Link copied

Cite

Statista, Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/

Energy consumption when training LLMs in 2022 (in MWh)

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

Dataset authored and provided by

Statistahttp://statista.com/

Time period covered

2022

Area covered

Worldwide

Description

Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over **********-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of *** Germans in 2022. While not a staggering amount, it is a considerable use of energy. Energy savings through AI While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a ***** expect that AI might reduce power consumption by *** to ******* percent. Considering that much of the world uses mobile phones this would be a considerable energy saver. Emissions are considerable The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly *** tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.

Clear search

Close search

Google apps

Main menu

Energy consumption when training LLMs in 2022 (in MWh)

📊 6.5k train examples for LLM Science Exam 📝

Data Lineage For LLM Training Market Research Report 2033

Data Lineage for LLM Training Market Outlook

Component Analysis

LLMs Data (2018-2024)

Data Columns

character-llm-data

High-quality Image & Video Data | 250 Million | LLM Data | Multimodal Large...

Trojan Detection Software Challenge - llm-instruct-oct2024-train

LLM Science Exam Training Data Wiki Pages

Top web domains cited by LLMs 2025

LLM - Detect AI Datamix

Japanese Closed Ended Question Answer Text Dataset

Dataset Content

Question Diversity

Answer Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License:

Augmented training data and labels, used for training the models

Lucie-Training-Dataset

Trojan Detection Software Challenge - llm-pretrain-apr2024-train | gimi9.com...

llm_dataset_inference

volcano-train

querygen-data-v4

Customer Service Call Dataset [Multisector] – Annotated support transcripts...

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

Large Language Models Comparison Dataset

Key Details:

What’s Inside?

Energy consumption when training LLMs in 2022 (in MWh)See More Versions

Energy consumption when training LLMs in 2022 (in MWh)