Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Hemanthh Velliyangirie
Released under Apache 2.0
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Lucie Training Dataset Card
The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines, some of which were processed by Optical Character Recognition (OCR). It also contains samples of diverse programming languages. The Lucie Training Dataset was used to pretrain Lucie-7B, a foundation LLM with strong… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Grade-Aware LLM Training Dataset
Dataset Description
This dataset contains 1,107,690 high-quality instruction-tuning examples for grade-aware text simplification, designed for fine-tuning large language models to simplify text to specific reading grade levels with precision and semantic consistency.
Dataset Summary
Total Examples: 1,107,690 Task: Text simplification with precise grade-level targeting Language: English Grade Range: 1-12+ (precise 2-decimal… See the full description on the dataset page: https://huggingface.co/datasets/yimingwang123/grade-aware-llm-training-data.
Facebook
TwitterI created this dataset using gpt-3.5-turbo.
I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳
Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.
I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.
If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Data Lineage for LLM Training market size reached USD 1.29 billion in 2024, with an impressive compound annual growth rate (CAGR) of 21.8% expected through the forecast period. By 2033, the market is projected to grow to USD 8.93 billion, as organizations worldwide recognize the critical importance of robust data lineage solutions in ensuring transparency, compliance, and efficiency in large language model (LLM) training. The primary growth driver stems from the surging adoption of generative AI and LLMs across diverse industries, necessitating advanced data lineage capabilities for responsible and auditable AI development.
The exponential growth of the Data Lineage for LLM Training market is fundamentally driven by the increasing complexity and scale of data used in training modern AI models. As organizations deploy LLMs for a wide array of applications—from customer service automation to advanced analytics—the need for precise tracking of data provenance, transformation, and usage has become paramount. This trend is further amplified by the proliferation of multi-source and multi-format data, which significantly complicates the process of tracing data origins and transformations. Enterprises are investing heavily in data lineage solutions to ensure that their AI models are trained on high-quality, compliant, and auditable datasets, thereby reducing risks associated with data bias, inconsistency, and regulatory violations.
Another significant growth factor is the evolving regulatory landscape surrounding AI and data governance. Governments and regulatory bodies worldwide are introducing stringent guidelines for data usage, privacy, and accountability in AI systems. Regulations such as the European Union’s AI Act and the U.S. AI Bill of Rights are compelling organizations to implement comprehensive data lineage practices to demonstrate compliance and mitigate legal risks. This regulatory pressure is particularly pronounced in highly regulated industries such as banking, healthcare, and government, where the consequences of non-compliance can be financially and reputationally devastating. As a result, the demand for advanced data lineage software and services is surging, driving market expansion.
Technological advancements in data management platforms and the integration of AI-driven automation are further catalyzing the growth of the Data Lineage for LLM Training market. Modern data lineage tools now leverage machine learning and natural language processing to automatically map data flows, detect anomalies, and generate real-time lineage reports. These innovations drastically reduce the manual effort required for lineage documentation and enhance the scalability of lineage solutions across large and complex data environments. The continuous evolution of such technologies is enabling organizations to achieve higher levels of transparency, trust, and operational efficiency in their AI workflows, thereby fueling market growth.
Regionally, North America dominates the Data Lineage for LLM Training market, accounting for over 42% of the global market share in 2024. This dominance is attributed to the early adoption of AI technologies, the presence of leading technology vendors, and a mature regulatory environment. Europe follows closely, driven by strict data governance regulations and a rapidly growing AI ecosystem. The Asia Pacific region is witnessing the fastest growth, with a projected CAGR of 24.6% through 2033, fueled by digital transformation initiatives, increased AI investments, and a burgeoning startup landscape. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a relatively nascent stage.
The Data Lineage for LLM Training market is segmented by component into software and services, each playing a pivotal role in supporting organizations’ lineage initiatives. The software segment holds the largest market share, accounting for nearly 68% of the total market revenue in 2024. This dominance is primarily due to the widespread adoption of advanced data lineage platforms that offer features such as automated lineage mapping, visualization, impact analysis, and integration with existing data management and AI training workflows. These platforms are essential for organ
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.
Facebook
TwitterText extracts for each section of the wikipedia pages used to generate the training dataset in the LLM Science Exam competition, plus extracts from the wikipedia category "Concepts in Physics".
Each page is broken down by section titles, and should also include a "Summary" section
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.
Facebook
TwitterThis dataset was created by Mehmet Deniz Kaya
Released under Other (specified in description)
Facebook
TwitterA June 2025 study found that ****** was the most frequently cited web domain by large language models (LLMs). The platform was referenced in approximately ** percent of the analyzed cases, likely due to the content licensing agreement between Google and Reddit in early 2024 for the purpose of AI models training. ********* ranked second, being mentioned in roughly ** percent of the times, while ****** and ******* were mentioned ** percent.
Facebook
TwitterOveriview Off-the-shelf 50 Million pre-training text data, covering test question, textbook, ebooks, journal and papers, multi-round dialog text and etc.
About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.
It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.
This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.
messages: An array of chat messages with roles (system, user, assistant) and their respective content.{
"messages": [
{"role": "system", "content": "You are an informative assistant."},
{"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"},
{"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."}
]
}
Fine-tune with OpenAI:
bash
openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl
openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset provides a comparison of various Large Language Models (LLMs) based on their performance, cost, and efficiency. It includes important details like speed, latency, benchmarks, and pricing, helping users understand how different models stack up against each other.
Here are some of the key metrics included in the dataset:
This dataset is useful for researchers, developers, and AI enthusiasts who want to compare LLMs and choose the best one based on their needs.
📌If you find this dataset useful, do give an upvote :)
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
Facebook
TwitterNew release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'
These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset
All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)
Enjoy ❤️
Version 2 update:
- removed NaNs and duplicated/short generations
- applied cleaning prodedure from @nbroad's notebook - give it an upvote please!
- added model column to indicate model family used in generations
Facebook
TwitterWe’ve developed another annotated dataset designed specifically for conversational AI and companion AI model training.
What you have here on Kaggle is our free sample - Think Salon Kitty meets AI
The 'Time Waster Identification & Retreat Model Dataset', enables AI handler agents to detect when users are likely to churn—saving valuable tokens and preventing wasted compute cycles in conversational models.
This batch has 167 entries annotated for sentiment, intent, user risk flagging (via behavioural tracking), user Recovery Potential per statement; among others. This dataset is designed to be a niche micro dataset for a specific use case: Time Waster Identification and Retreat.
👉 Buy the updated version: https://lifebricksglobal.gumroad.com/l/Time-WasterDetection-Dataset
This dataset is perfect for:
It is designed for AI researchers and developers building:
Use case:
This batch has 167 entries annotated for sentiment, intent, user risk flagging (via behavioural tracking), user Recovery Potential per statement; among others. This dataset is designed to be a niche micro dataset for a specific use case: Time Waster Identification and Retreat.
👉 Good for teams working on conversational AI, companion AI, fraud detectors and those integrating routing logic for voice/chat agents
👉 Buy the updated version: https://lifebricksglobal.gumroad.com/l/Time-WasterDetection-Dataset
Contact us on LinkedIn: Life Bricks Global.
License:
This dataset is provided under a custom license. By using the dataset, you agree to the following terms:
Usage: You are allowed to use the dataset for non-commercial purposes, including research, development, and machine learning model training.
Modification: You may modify the dataset for your own use.
Redistribution: Redistribution of the dataset in its original or modified form is not allowed without permission.
Attribution: Proper attribution must be given when using or referencing this dataset.
No Warranty: The dataset is provided "as-is" without any warranties, express or implied, regarding its accuracy, completeness, or fitness for a particular purpose.
Facebook
TwitterComprehensive training data on 1M+ stores across the US & Canada. Includes detailed menus, inventory, pricing, and availability. Ideal for AI/ML models, powering retrieval-augmented generation, search, and personalization systems.
Facebook
Twitter300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.
Facebook
TwitterFor the high-quality training data required in unsupervised learning and supervised learning, Nexdata provides flexible and customized Large Language Model(LLM) Data Data annotation services for tasks such as supervised fine-tuning (SFT) , and reinforcement learning from human feedback (RLHF).
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Customer Service Tagged Training Dataset for LLM-based Virtual Assistants Overview This dataset can be used to train Large Language Models such as GPT, Llama2 and Falcon, both for Fine Tuning and Domain Adaptation.
The dataset has the following specs:
Use Case: Intent Detection Vertical: Customer Service 27 intents assigned to 10 categories 26872 question/answer pairs, around 1000 per intent 30 entity/slot types 12 different types of language generation tags The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:
Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, Wealth Management
Fields of the Dataset Each entry in the dataset contains the following fields:
flags: tags (explained below in the Language Generation Tags section) instruction: a user request from the Customer Service domain category: the high-level semantic category for the intent intent: the intent corresponding to the user instruction response: an example expected response from the virtual assistant Categories and Intents The categories and intents covered by the dataset are:
ACCOUNT: create_account, delete_account, edit_account, switch_account CANCELLATION_FEE: check_cancellation_fee DELIVERY: delivery_options FEEDBACK: complaint, review INVOICE: check_invoice, get_invoice NEWSLETTER: newsletter_subscription ORDER: cancel_order, change_order, place_order PAYMENT: check_payment_methods, payment_issue REFUND: check_refund_policy, track_refund SHIPPING_ADDRESS: change_shipping_address, set_up_shipping_address Entities The entities covered by the dataset are:
{{Order Number}}, typically present in: Intents: cancel_order, change_order, change_shipping_address, check_invoice, check_refund_policy, complaint, delivery_options, delivery_period, get_invoice, get_refund, place_order, track_order, track_refund {{Invoice Number}}, typically present in: Intents: check_invoice, get_invoice {{Online Order Interaction}}, typically present in: Intents: cancel_order, change_order, check_refund_policy, delivery_period, get_refund, review, track_order, track_refund {{Online Payment Interaction}}, typically present in: Intents: cancel_order, check_payment_methods {{Online Navigation Step}}, typically present in: Intents: complaint, delivery_options {{Online Customer Support Channel}}, typically present in: Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, registration_problems, switch_account {{Profile}}, typically present in: Intent: switch_account {{Profile Type}}, typically present in: Intent: switch_account {{Settings}}, typically present in: Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, contact_human_agent, delete_account, delivery_options, edit_account, get_invoice, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, set_up_shipping_address, switch_account, track_order, track_refund {{Online Company Portal Info}}, typically present in: Intents: cancel_order, edit_account {{Date}}, typically present in: Intents: check_invoice, check_refund_policy, get_refund, track_order, track_refund {{Date Range}}, typically present in: Intents: check_cancellation_fee, check_invoice, get_invoice {{Shipping Cut-off Time}}, typically present in: Intent: delivery_options {{Delivery City}}, typically present in: Intent: delivery_options {{Delivery Country}}, typically present in: Intents: check_payment_methods, check_refund_policy, delivery_options, review, switch_account {{Salutation}}, typically present in: Intents: cancel_order, check_payment_methods, check_refund_policy, create_account, delete_account, delivery_options, get_refund, recover_password, review, set_up_shipping_address, switch_account, track_refund {{Client First Name}}, typically present in: Intents: check_invoice, get_invoice {{Client Last Name}}, typically present in: Intents: check_invoice, create_account, get_invoice {{Customer Support Phone Number}}, typically present in: Intents: change_shipping_address, contact_customer_service, contact_human_agent, payment_issue {{Customer Support Email}}, typically present in: Intents: cancel_order, change_shipping_address, check_invoice, check_refund_policy, complaint, contact_customer_service, contact_human_agent, get_invoice, get_refund, newsletter_subscription, payment_issue, recover_password, registration_problems, review, set_up_shipping_address, switch_account...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Hemanthh Velliyangirie
Released under Apache 2.0