100+ datasets found
  1. Sample-Training-Data-LLM

    • kaggle.com
    zip
    Updated May 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hemanthh Velliyangirie (2024). Sample-Training-Data-LLM [Dataset]. https://www.kaggle.com/datasets/hemanthhvv/sample-training-data-llm
    Explore at:
    zip(2164 bytes)Available download formats
    Dataset updated
    May 4, 2024
    Authors
    Hemanthh Velliyangirie
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Hemanthh Velliyangirie

    Released under Apache 2.0

    Contents

  2. h

    Lucie-Training-Dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenLLM France, Lucie-Training-Dataset [Dataset]. https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset
    Explore at:
    Dataset authored and provided by
    OpenLLM France
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Lucie Training Dataset Card

    The Lucie Training Dataset is a curated collection of text data in English, French, German, Spanish and Italian culled from a variety of sources including: web data, video subtitles, academic papers, digital books, newspapers, and magazines, some of which were processed by Optical Character Recognition (OCR). It also contains samples of diverse programming languages. The Lucie Training Dataset was used to pretrain Lucie-7B, a foundation LLM with strong… See the full description on the dataset page: https://huggingface.co/datasets/OpenLLM-France/Lucie-Training-Dataset.

  3. h

    grade-aware-llm-training-data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yiming Wang, grade-aware-llm-training-data [Dataset]. https://huggingface.co/datasets/yimingwang123/grade-aware-llm-training-data
    Explore at:
    Authors
    Yiming Wang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Grade-Aware LLM Training Dataset

      Dataset Description
    

    This dataset contains 1,107,690 high-quality instruction-tuning examples for grade-aware text simplification, designed for fine-tuning large language models to simplify text to specific reading grade levels with precision and semantic consistency.

      Dataset Summary
    

    Total Examples: 1,107,690 Task: Text simplification with precise grade-level targeting Language: English Grade Range: 1-12+ (precise 2-decimal… See the full description on the dataset page: https://huggingface.co/datasets/yimingwang123/grade-aware-llm-training-data.

  4. 📊 6.5k train examples for LLM Science Exam 📝

    • kaggle.com
    Updated Jul 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Radek Osmulski (2023). 📊 6.5k train examples for LLM Science Exam 📝 [Dataset]. https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Radek Osmulski
    Description

    I created this dataset using gpt-3.5-turbo.

    I put a lot of effort into making this dataset high quality which allows you to achieve the highest score among the publicly available notebooks available at the moment! 🥳

    Originally, I only uploaded 500 examples (they were used as train data in the notebook I mention above). They are stored in extra_train_set.csv.

    I am now uploading another 6k (6000_train_examples.csv) completely new train examples which brings the total to 6.5k.

    If you find this dataset useful, please leave an upvote! 😊 Thank you! 🙏🙏🙏

  5. D

    Data Lineage For LLM Training Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data Lineage For LLM Training Market Research Report 2033 [Dataset]. https://dataintelo.com/report/data-lineage-for-llm-training-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Lineage for LLM Training Market Outlook




    According to our latest research, the global Data Lineage for LLM Training market size reached USD 1.29 billion in 2024, with an impressive compound annual growth rate (CAGR) of 21.8% expected through the forecast period. By 2033, the market is projected to grow to USD 8.93 billion, as organizations worldwide recognize the critical importance of robust data lineage solutions in ensuring transparency, compliance, and efficiency in large language model (LLM) training. The primary growth driver stems from the surging adoption of generative AI and LLMs across diverse industries, necessitating advanced data lineage capabilities for responsible and auditable AI development.




    The exponential growth of the Data Lineage for LLM Training market is fundamentally driven by the increasing complexity and scale of data used in training modern AI models. As organizations deploy LLMs for a wide array of applications—from customer service automation to advanced analytics—the need for precise tracking of data provenance, transformation, and usage has become paramount. This trend is further amplified by the proliferation of multi-source and multi-format data, which significantly complicates the process of tracing data origins and transformations. Enterprises are investing heavily in data lineage solutions to ensure that their AI models are trained on high-quality, compliant, and auditable datasets, thereby reducing risks associated with data bias, inconsistency, and regulatory violations.




    Another significant growth factor is the evolving regulatory landscape surrounding AI and data governance. Governments and regulatory bodies worldwide are introducing stringent guidelines for data usage, privacy, and accountability in AI systems. Regulations such as the European Union’s AI Act and the U.S. AI Bill of Rights are compelling organizations to implement comprehensive data lineage practices to demonstrate compliance and mitigate legal risks. This regulatory pressure is particularly pronounced in highly regulated industries such as banking, healthcare, and government, where the consequences of non-compliance can be financially and reputationally devastating. As a result, the demand for advanced data lineage software and services is surging, driving market expansion.




    Technological advancements in data management platforms and the integration of AI-driven automation are further catalyzing the growth of the Data Lineage for LLM Training market. Modern data lineage tools now leverage machine learning and natural language processing to automatically map data flows, detect anomalies, and generate real-time lineage reports. These innovations drastically reduce the manual effort required for lineage documentation and enhance the scalability of lineage solutions across large and complex data environments. The continuous evolution of such technologies is enabling organizations to achieve higher levels of transparency, trust, and operational efficiency in their AI workflows, thereby fueling market growth.




    Regionally, North America dominates the Data Lineage for LLM Training market, accounting for over 42% of the global market share in 2024. This dominance is attributed to the early adoption of AI technologies, the presence of leading technology vendors, and a mature regulatory environment. Europe follows closely, driven by strict data governance regulations and a rapidly growing AI ecosystem. The Asia Pacific region is witnessing the fastest growth, with a projected CAGR of 24.6% through 2033, fueled by digital transformation initiatives, increased AI investments, and a burgeoning startup landscape. Latin America and the Middle East & Africa are also emerging as promising markets, albeit at a relatively nascent stage.



    Component Analysis




    The Data Lineage for LLM Training market is segmented by component into software and services, each playing a pivotal role in supporting organizations’ lineage initiatives. The software segment holds the largest market share, accounting for nearly 68% of the total market revenue in 2024. This dominance is primarily due to the widespread adoption of advanced data lineage platforms that offer features such as automated lineage mapping, visualization, impact analysis, and integration with existing data management and AI training workflows. These platforms are essential for organ

  6. h

    Bitext-travel-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.

  7. LLM Science Exam Training Data Wiki Pages

    • kaggle.com
    zip
    Updated Jul 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jude Hunt (2023). LLM Science Exam Training Data Wiki Pages [Dataset]. https://www.kaggle.com/datasets/judehunt23/llm-science-exam-training-data-wiki-pages
    Explore at:
    zip(2843758 bytes)Available download formats
    Dataset updated
    Jul 18, 2023
    Authors
    Jude Hunt
    Description

    Text extracts for each section of the wikipedia pages used to generate the training dataset in the LLM Science Exam competition, plus extracts from the wikipedia category "Concepts in Physics".

    Each page is broken down by section titles, and should also include a "Summary" section

  8. LLMs Data (2018-2024)

    • kaggle.com
    zip
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jaina (2024). LLMs Data (2018-2024) [Dataset]. https://www.kaggle.com/datasets/jainaru/llms-data-2018-2024
    Explore at:
    zip(23351 bytes)Available download formats
    Dataset updated
    May 19, 2024
    Authors
    jaina
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Every major LLM and chatbot released since 2018 with company and no. of billion parameters used to train.

    Data Columns

    1. Model: The name of the language model.
    2. Company: The company that developed the model.
    3. Arch: The architecture of the model (e.g., Transformer, RNN).TBA means To Be Announced.
    4. Parameters: The number of parameters (weights) in the model, which is a measure of its complexity. In Billions
    5. Tokens: The number of tokens (sub-word units) the model can process or was trained on. Here, some values are TBA. In Billions
    6. Ratio: Likely the ratio of parameters to tokens, or some other relevant ratio. In this table, it is specified only for Olympus as 20:01.
    7. ALScore: ALScore is a quick and dirty rating of the model's power. The formula is: Sqr Root of (Parameters x Tokens).
    8. Training dataset: The dataset used to train the model.
    9. Release Date: The expected or actual release date of the model.
    10. Notes: Additional notes about the model, such as training details or related information.
    11. Playground: A URL link to a website where you can interact with the model or find more information about it.
  9. LLM Training Dataset

    • kaggle.com
    zip
    Updated May 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehmet Deniz Kaya (2025). LLM Training Dataset [Dataset]. https://www.kaggle.com/datasets/mehmetdenizkaya/llm-training-dataset
    Explore at:
    zip(1773182 bytes)Available download formats
    Dataset updated
    May 18, 2025
    Authors
    Mehmet Deniz Kaya
    Description

    Dataset

    This dataset was created by Mehmet Deniz Kaya

    Released under Other (specified in description)

    Contents

  10. Top web domains cited by LLMs 2025

    • statista.com
    Updated Jun 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Top web domains cited by LLMs 2025 [Dataset]. https://www.statista.com/statistics/1620335/top-web-domains-cited-by-llms/
    Explore at:
    Dataset updated
    Jun 29, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 2025
    Area covered
    Worldwide
    Description

    A June 2025 study found that ****** was the most frequently cited web domain by large language models (LLMs). The platform was referenced in approximately ** percent of the analyzed cases, likely due to the content licensing agreement between Google and Reddit in early 2024 for the purpose of AI models training. ********* ranked second, being mentioned in roughly ** percent of the times, while ****** and ******* were mentioned ** percent.

  11. Pre-training Text Data | 50 Millions | Unsupervised Text Data | Large...

    • datarade.ai
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Pre-training Text Data | 50 Millions | Unsupervised Text Data | Large Language Model(LLM) Data [Dataset]. https://datarade.ai/data-products/nexdata-unsupervised-text-data-1-pb-foundation-model-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Mexico, United Kingdom, France, Philippines, Germany, Korea (Republic of), Spain, China, Malaysia, Taiwan
    Description
    1. Overiview Off-the-shelf 50 Million pre-training text data, covering test question, textbook, ebooks, journal and papers, multi-round dialog text and etc.

    2. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

  12. resume-screening-llm-training-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). resume-screening-llm-training-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/resume-screening-llm-training-dataset
    Explore at:
    zip(60353 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Resume Screening & HR Conversations Dataset for LLM Training

    Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.

    It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.

    By Syncora.ai, enabling privacy-safe, high-quality synthetic data for smarter AI.

    ✅ Why This Dataset?

    • Accelerate HR AI development – No need to source or anonymize real resumes.
    • Ready for LLM Training – Structured in OpenAI-compatible JSONL format.
    • Privacy-Safe & Scalable – 100% synthetic, zero PII.

    📂 Dataset Description

    This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.

    • Format: JSONL (OpenAI fine-tuning compatible)
    • Schema:
      • messages: An array of chat messages with roles (system, user, assistant) and their respective content.

    Example:

    {
     "messages": [
      {"role": "system", "content": "You are an informative assistant."},
      {"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"},
      {"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."}
     ]
    }
    

    🔍 What's Inside

    • Resume Screening Q&A
      • Job titles, experience, role responsibilities.
    • Career Guidance Questions
      • Tech trends, upskilling, and career planning.

    👥 Who Should Use This Dataset?

    • Recruitment Tech Startups – Build automated candidate screeners.
    • HR Teams – Deploy internal career chatbots.
    • AI Engineers & Researchers – Fine-tune LLMs for HR applications.
    • EdTech Platforms – Power career advisory assistants.

    🚀 How To Use

    Fine-tune with OpenAI: bash openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"

    ✅ Why Synthetic?

    • No compliance headaches (GDPR, PII safe)
    • Faster experimentation for LLM applications
    • High-fidelity HR scenarios for realistic model behavior

    🔗 Start Generating Your Own Synthetic Data

    → Generate your own synthetic datasets now

  13. Large Language Models Comparison Dataset

    • kaggle.com
    zip
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samay Ashar (2025). Large Language Models Comparison Dataset [Dataset]. https://www.kaggle.com/datasets/samayashar/large-language-models-comparison-dataset
    Explore at:
    zip(5894 bytes)Available download formats
    Dataset updated
    Feb 24, 2025
    Authors
    Samay Ashar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset provides a comparison of various Large Language Models (LLMs) based on their performance, cost, and efficiency. It includes important details like speed, latency, benchmarks, and pricing, helping users understand how different models stack up against each other.

    Key Details:

    • File Name: llm_comparison_dataset.csv
    • Size: 14.57 kB
    • Total Columns: 15
    • License: CC0 (Public Domain)

    What’s Inside?

    Here are some of the key metrics included in the dataset:

    1. Context Window: Maximum number of tokens the model can process at once.
    2. Speed (tokens/sec): How fast the model generates responses.
    3. Latency (sec): Time delay before the model responds.
    4. Benchmark Scores: Performance ratings from MMLU (academic tasks) and Chatbot Arena (real-world chatbot performance).
    5. Open-Source: Indicates if the model is publicly available or proprietary.
    6. Price per Million Tokens: The cost of using the model for one million tokens.
    7. Training Dataset Size: Amount of data used to train the model.
    8. Compute Power: Resources needed to run the model.
    9. Energy Efficiency: How much power the model consumes.

    This dataset is useful for researchers, developers, and AI enthusiasts who want to compare LLMs and choose the best one based on their needs.

    📌If you find this dataset useful, do give an upvote :)

  14. h

    Bitext-retail-ecommerce-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.

  15. daigt-v3-train-dataset

    • kaggle.com
    zip
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2023). daigt-v3-train-dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset
    Explore at:
    zip(86685168 bytes)Available download formats
    Dataset updated
    Dec 28, 2023
    Authors
    Darek Kłeczek
    Description

    New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'

    These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset

    All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)

    Enjoy ❤️

    Version 2 update: - removed NaNs and duplicated/short generations - applied cleaning prodedure from @nbroad's notebook - give it an upvote please! - added model column to indicate model family used in generations

  16. LLM RAG Chatbot Training Dataset

    • kaggle.com
    zip
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Life Bricks Global (2025). LLM RAG Chatbot Training Dataset [Dataset]. https://www.kaggle.com/datasets/lifebricksglobal/llm-rag-chatbot-training-dataset
    Explore at:
    zip(199960 bytes)Available download formats
    Dataset updated
    May 20, 2025
    Authors
    Life Bricks Global
    Description

    We’ve developed another annotated dataset designed specifically for conversational AI and companion AI model training.

    Watch: How To Use The Dataset

    What you have here on Kaggle is our free sample - Think Salon Kitty meets AI

    The 'Time Waster Identification & Retreat Model Dataset', enables AI handler agents to detect when users are likely to churn—saving valuable tokens and preventing wasted compute cycles in conversational models.

    This batch has 167 entries annotated for sentiment, intent, user risk flagging (via behavioural tracking), user Recovery Potential per statement; among others. This dataset is designed to be a niche micro dataset for a specific use case: Time Waster Identification and Retreat.

    👉 Buy the updated version: https://lifebricksglobal.gumroad.com/l/Time-WasterDetection-Dataset

    This dataset is perfect for:

    • Fine-tuning LLM routing logic
    • Building intelligent AI agents for customer engagement
    • Companion AI training + moderation modelling
    • This is part of a broader series of human-agent interaction datasets we are releasing under our independent data licensing program.

    It is designed for AI researchers and developers building:

    • Conversational AI agents
    • Companion AI models
    • Human-agent interaction simulators
    • LLM routing optimization models

    Use case:

    • Conversational AI
    • Companion AI
    • Defence & Aerospace
    • Customer Support AI
    • Gaming / Virtual Worlds
    • LLM Safety Research
    • AI Orchestration Platforms

    This batch has 167 entries annotated for sentiment, intent, user risk flagging (via behavioural tracking), user Recovery Potential per statement; among others. This dataset is designed to be a niche micro dataset for a specific use case: Time Waster Identification and Retreat.

    👉 Good for teams working on conversational AI, companion AI, fraud detectors and those integrating routing logic for voice/chat agents

    👉 Buy the updated version: https://lifebricksglobal.gumroad.com/l/Time-WasterDetection-Dataset

    Contact us on LinkedIn: Life Bricks Global.

    License:

    This dataset is provided under a custom license. By using the dataset, you agree to the following terms:

    Usage: You are allowed to use the dataset for non-commercial purposes, including research, development, and machine learning model training.

    Modification: You may modify the dataset for your own use.

    Redistribution: Redistribution of the dataset in its original or modified form is not allowed without permission.

    Attribution: Proper attribution must be given when using or referencing this dataset.

    No Warranty: The dataset is provided "as-is" without any warranties, express or implied, regarding its accuracy, completeness, or fitness for a particular purpose.

  17. m

    Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...

    • data.mealme.ai
    Updated Jan 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://data.mealme.ai/products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme
    Explore at:
    Dataset updated
    Jan 23, 2025
    Dataset authored and provided by
    MealMe
    Area covered
    Venezuela, Wallis and Futuna, South Sudan, Madagascar, Somalia, Uzbekistan, Sao Tome and Principe, Greenland, Austria, Bosnia and Herzegovina
    Description

    Comprehensive training data on 1M+ stores across the US & Canada. Includes detailed menus, inventory, pricing, and availability. Ideal for AI/ML models, powering retrieval-augmented generation, search, and personalization systems.

  18. 300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI...

    • m.nexdata.ai
    • nexdata.ai
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). 300M Image-Caption Pairs – Large-Scale Vision-Language Dataset for AI Training [Dataset]. https://m.nexdata.ai/datasets/llm/1451?source=Github
    Explore at:
    Dataset updated
    Jan 30, 2025
    Dataset authored and provided by
    Nexdata
    Variables measured
    Data size, Data types, Data content, Data formats, Data resolution, Description languages
    Description

    300 Million Pairs of High-Quality Image-Caption Dataset includes a large-scale collection of photographic and vector images paired with English textual descriptions. The complete image library comprises nearly 300 million images, with a curated subset of 100 million high-quality image-caption pairs available for generative AI and vision-language model training. All images are authentic and legally licensed works created by professional photographers. The dataset primarily features English captions with minimal Chinese, offering diverse scenes, objects, and compositions suitable for tasks such as image captioning, visual question answering (VQA), image-text retrieval, and multimodal foundation model pretraining. The dataset supports large-scale LLM and VLM applications and complies with global data privacy and copyright regulations, including GDPR, CCPA, and PIPL.

  19. Foundation Model Data Collection and Data Annotation | Large Language...

    • data.nexdata.ai
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://data.nexdata.ai/products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
    Explore at:
    Dataset updated
    Aug 15, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Estonia, Nepal, Denmark, Costa Rica, Pakistan, Iran, Barbados, Lebanon, Grenada, Croatia
    Description

    For the high-quality training data required in unsupervised learning and supervised learning, Nexdata provides flexible and customized Large Language Model(LLM) Data Data annotation services for tasks such as supervised fine-tuning (SFT) , and reinforcement learning from human feedback (RLHF).

  20. Customer support training data

    • kaggle.com
    zip
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Talaviya Bhavik (2024). Customer support training data [Dataset]. https://www.kaggle.com/datasets/talaviyabhavik/customer-support-training-data
    Explore at:
    zip(3007673 bytes)Available download formats
    Dataset updated
    Feb 23, 2024
    Authors
    Talaviya Bhavik
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Customer Service Tagged Training Dataset for LLM-based Virtual Assistants Overview This dataset can be used to train Large Language Models such as GPT, Llama2 and Falcon, both for Fine Tuning and Domain Adaptation.

    The dataset has the following specs:

    Use Case: Intent Detection Vertical: Customer Service 27 intents assigned to 10 categories 26872 question/answer pairs, around 1000 per intent 30 entity/slot types 12 different types of language generation tags The categories and intents have been selected from Bitext's collection of 20 vertical-specific datasets, covering the intents that are common across all 20 verticals. The verticals are:

    Automotive, Retail Banking, Education, Events & Ticketing, Field Services, Healthcare, Hospitality, Insurance, Legal Services, Manufacturing, Media Streaming, Mortgages & Loans, Moving & Storage, Real Estate/Construction, Restaurant & Bar Chains, Retail/E-commerce, Telecommunications, Travel, Utilities, Wealth Management

    Fields of the Dataset Each entry in the dataset contains the following fields:

    flags: tags (explained below in the Language Generation Tags section) instruction: a user request from the Customer Service domain category: the high-level semantic category for the intent intent: the intent corresponding to the user instruction response: an example expected response from the virtual assistant Categories and Intents The categories and intents covered by the dataset are:

    ACCOUNT: create_account, delete_account, edit_account, switch_account CANCELLATION_FEE: check_cancellation_fee DELIVERY: delivery_options FEEDBACK: complaint, review INVOICE: check_invoice, get_invoice NEWSLETTER: newsletter_subscription ORDER: cancel_order, change_order, place_order PAYMENT: check_payment_methods, payment_issue REFUND: check_refund_policy, track_refund SHIPPING_ADDRESS: change_shipping_address, set_up_shipping_address Entities The entities covered by the dataset are:

    {{Order Number}}, typically present in: Intents: cancel_order, change_order, change_shipping_address, check_invoice, check_refund_policy, complaint, delivery_options, delivery_period, get_invoice, get_refund, place_order, track_order, track_refund {{Invoice Number}}, typically present in: Intents: check_invoice, get_invoice {{Online Order Interaction}}, typically present in: Intents: cancel_order, change_order, check_refund_policy, delivery_period, get_refund, review, track_order, track_refund {{Online Payment Interaction}}, typically present in: Intents: cancel_order, check_payment_methods {{Online Navigation Step}}, typically present in: Intents: complaint, delivery_options {{Online Customer Support Channel}}, typically present in: Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, registration_problems, switch_account {{Profile}}, typically present in: Intent: switch_account {{Profile Type}}, typically present in: Intent: switch_account {{Settings}}, typically present in: Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, contact_human_agent, delete_account, delivery_options, edit_account, get_invoice, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, set_up_shipping_address, switch_account, track_order, track_refund {{Online Company Portal Info}}, typically present in: Intents: cancel_order, edit_account {{Date}}, typically present in: Intents: check_invoice, check_refund_policy, get_refund, track_order, track_refund {{Date Range}}, typically present in: Intents: check_cancellation_fee, check_invoice, get_invoice {{Shipping Cut-off Time}}, typically present in: Intent: delivery_options {{Delivery City}}, typically present in: Intent: delivery_options {{Delivery Country}}, typically present in: Intents: check_payment_methods, check_refund_policy, delivery_options, review, switch_account {{Salutation}}, typically present in: Intents: cancel_order, check_payment_methods, check_refund_policy, create_account, delete_account, delivery_options, get_refund, recover_password, review, set_up_shipping_address, switch_account, track_refund {{Client First Name}}, typically present in: Intents: check_invoice, get_invoice {{Client Last Name}}, typically present in: Intents: check_invoice, create_account, get_invoice {{Customer Support Phone Number}}, typically present in: Intents: change_shipping_address, contact_customer_service, contact_human_agent, payment_issue {{Customer Support Email}}, typically present in: Intents: cancel_order, change_shipping_address, check_invoice, check_refund_policy, complaint, contact_customer_service, contact_human_agent, get_invoice, get_refund, newsletter_subscription, payment_issue, recover_password, registration_problems, review, set_up_shipping_address, switch_account...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Hemanthh Velliyangirie (2024). Sample-Training-Data-LLM [Dataset]. https://www.kaggle.com/datasets/hemanthhvv/sample-training-data-llm
Organization logo

Sample-Training-Data-LLM

Explore at:
zip(2164 bytes)Available download formats
Dataset updated
May 4, 2024
Authors
Hemanthh Velliyangirie
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset

This dataset was created by Hemanthh Velliyangirie

Released under Apache 2.0

Contents

Search
Clear search
Close search
Google apps
Main menu