100+ datasets found
  1. h

    llm-training-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UniData, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Unidata
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

    The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

      Models used for text generation:
    

    GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

      Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
    
  2. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description
    • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
      File: train_essays_RDizzl3_seven_v2.csv
      Human texts: 14247 LLM texts: 3004

      See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



    • Version 3: "**The RDizzl3 Seven**"
      File: train_essays_RDizzl3_seven_v1.csv

    • "Car-free cities"

    • "Does the electoral college work?"

    • "Exploring Venus"

    • "The Face on Mars"

    • "Facial action coding system"

    • "A Cowboy Who Rode the Waves"

    • "Driverless cars"

    How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

    • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

    Namely:

    • "Car-free cities"
    • "Does the electoral college work?"
    • "Exploring Venus"
    • "The Face on Mars"
    • "Facial action coding system"
    • "Seeking multiple opinions"
    • "Phones and driving"

    This dataset is a derivative of the datasets

    as well as the original competition training dataset

    • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
  3. h

    llm-sgd-dst8-training-data

    • huggingface.co
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ammer Ayach (2023). llm-sgd-dst8-training-data [Dataset]. https://huggingface.co/datasets/amay01/llm-sgd-dst8-training-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 4, 2023
    Authors
    Ammer Ayach
    Description

    Dataset Card for "llm-sgd-dst8-training-data"

    More Information needed

  4. h

    Bitext-travel-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jun 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2025
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.

  5. d

    FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

    • datarade.ai
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-products/filemarket-ai-training-data-large-language-model-llm-data-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Saudi Arabia, Antigua and Barbuda, Brazil, Papua New Guinea, Colombia, Benin, China, Saint Kitts and Nevis, French Southern Territories, Central African Republic
    Description

    FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

    Key use cases of our Large Language Model (LLM) Data:

    Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

    Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.

  6. LLM - Detect AI Generated Text Dataset

    • kaggle.com
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). LLM - Detect AI Generated Text Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/llm-detect-ai-generated-text-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    sunil thite
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    In this Dataset contains both AI Generated Essay and Human Written Essay for Training Purpose This dataset challenge is to to develop a machine learning model that can accurately detect whether an essay was written by a student or an LLM. The competition dataset comprises a mix of student-written essays and essays generated by a variety of LLMs.

    Dataset contains more than 28,000 essay written by student and AI generated.

    Features : 1. text : Which contains essay text 2. generated : This is target label . 0 - Human Written Essay , 1 - AI Generated Essay

  7. d

    Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced...

    • datarade.ai
    Updated Apr 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silencio Network (2025). Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced Ground Truth Based | 10M+ Hours of Measurements | 100% Traceable Consent [Dataset]. https://datarade.ai/data-products/large-language-model-llm-training-data-236-countries-ai-silencio-network
    Explore at:
    .json, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    Apr 15, 2025
    Dataset provided by
    Quickkonnect UG
    Authors
    Silencio Network
    Area covered
    Libya, Taiwan, United Arab Emirates, Hungary, Sri Lanka, Saint Kitts and Nevis, Serbia, Puerto Rico, Guernsey, Oman
    Description

    Silencio’s interpolation dataset delivers spatially continuous noise data combining: • 10M+ hours of real dBA measurements • AI-generated interpolations

    Applications: • AI-based acoustic mapping • Digital twin and simulation models • Ground-truth data for AI validation

    Delivered via CSV or S3. GDPR-compliant.

  8. s

    Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced...

    • storefront.silencio.network
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silencio Network (2025). Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced Ground Truth Based | 10M+ Hours of Measurements | 100% Traceable Consent [Dataset]. https://storefront.silencio.network/products/large-language-model-llm-training-data-236-countries-ai-silencio-network
    Explore at:
    Dataset updated
    Jun 16, 2025
    Dataset provided by
    Quickkonnect UG
    Authors
    Silencio Network
    Area covered
    Gambia, Timor-Leste, New Zealand, Morocco, Federated States of, Andorra, Virgin Islands, Samoa, Kuwait, Singapore
    Description

    Interpolated noise dataset built on 10M+ hours of real-world acoustic data combined with AI-generated predictions. Ideal for map generation, AI training, and model validation.

  9. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Dominican Republic, India, United Kingdom, Western Sahara, Norway, Barbados, Jordan, Oman, Sint Maarten (Dutch part), Cook Islands
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  10. 10K rewritten texts dataset/LLM Prompt Recovery

    • kaggle.com
    zip
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aisha AL Mahmoud (2024). 10K rewritten texts dataset/LLM Prompt Recovery [Dataset]. https://www.kaggle.com/datasets/aishaalmahmoud/10k-rewritten-texts-datasetllm-prompt-recovery
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Apr 8, 2024
    Authors
    Aisha AL Mahmoud
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    About 10000 rewritten texts using Gemma 7b-it, the original texts from column "Support" in file train.csv from dataset SciQ (Scientific Question Answering)

    if you find it useful, upvote it

  11. m

    Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...

    • data.mealme.ai
    Updated Jan 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://data.mealme.ai/products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme
    Explore at:
    Dataset updated
    Jan 29, 2025
    Dataset authored and provided by
    MealMe
    Area covered
    Wallis and Futuna, Austria, Bosnia and Herzegovina, Venezuela, Uzbekistan, South Sudan, Madagascar, Sao Tome and Principe, Somalia, Greenland
    Description

    Comprehensive training data on 1M+ stores across the US & Canada. Includes detailed menus, inventory, pricing, and availability. Ideal for AI/ML models, powering retrieval-augmented generation, search, and personalization systems.

  12. Image and Video Description Data | 1 PB | Multimodal Data | GenAI | LLM Data...

    • datarade.ai
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Image and Video Description Data | 1 PB | Multimodal Data | GenAI | LLM Data | Large Language Model(LLM) Data| AI Datasets [Dataset]. https://datarade.ai/data-products/nexdata-image-and-video-description-data-1-pb-multimoda-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Czech Republic, Malta, Belgium, Israel, Mexico, Canada, Ecuador, United Arab Emirates, Netherlands, Finland
    Description
    1. Image Description Data Data Size: 500 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), human action, picture book, magazine, PPT&chart, App screenshot, and etc. Resolution: 4K+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: the image format is .jpg, the annotation format is .json, and the description format is .txt

    2. Video Description Data Data Size: 10 million pairs Image Type: generic scene(portrait, landscapes, animals,etc), ads, TV sports, documentaries Resolution: 1080p+ Description Language: English, Spanish, Portuguese, French, Korean, German, Chinese, Japanese Description Length: text length is no less than 250 words Format: .mp4,.mov,.avi and other common formats;.xlsx (annotation file format)

    3. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 1 million hours of Audio Data and 800TB of Annotated Imagery Data. These ready-to-go data supports instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

  13. LLM prompts in the context of machine learning

    • kaggle.com
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Nelson (2024). LLM prompts in the context of machine learning [Dataset]. https://www.kaggle.com/datasets/jordanln/llm-prompts-in-the-context-of-machine-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Kaggle
    Authors
    Jordan Nelson
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is an extension of my previous work on creating a dataset for natural language processing tasks. It leverages binary representation to characterise various machine learning models. The attributes in the dataset are derived from a dictionary, which was constructed from a corpus of prompts typically provided to a large language model (LLM). These prompts reference specific machine learning algorithms and their implementations. For instance, consider a user asking an LLM or a generative AI to create a Multi-Layer Perceptron (MLP) model for a particular application. By applying this concept to multiple machine learning models, we constructed our corpus. This corpus was then transformed into the current dataset using a bag-of-words approach. In this dataset, each attribute corresponds to a word from our dictionary, represented as a binary value: 1 indicates the presence of the word in a given prompt, and 0 indicates its absence. At the end of each entry, there is a label. Each entry in the dataset pertains to a single class, where each class represents a distinct machine learning model or algorithm. This dataset is intended for multi-class classification tasks, not multi-label classification, as each entry is associated with only one label and does not belong to multiple labels simultaneously. This dataset has been utilised with a Convolutional Neural Network (CNN) using the Keras Automodel API, achieving impressive training and testing accuracy rates exceeding 97%. Post-training, the model's predictive performance was rigorously evaluated in a production environment, where it continued to demonstrate exceptional accuracy. For this evaluation, we employed a series of questions, which are listed below. These questions were intentionally designed to be similar to ensure that the model can effectively distinguish between different machine learning models, even when the prompts are closely related.

    KNN How would you create a KNN model to classify emails as spam or not spam based on their content and metadata? How could you implement a KNN model to classify handwritten digits using the MNIST dataset? How would you use a KNN approach to build a recommendation system for suggesting movies to users based on their ratings and preferences? How could you employ a KNN algorithm to predict the price of a house based on features such as its location, size, and number of bedrooms etc? Can you create a KNN model for classifying different species of flowers based on their petal length, petal width, sepal length, and sepal width? How would you utilise a KNN model to predict the sentiment (positive, negative, or neutral) of text reviews or comments? Can you create a KNN model for me that could be used in malware classification? Can you make me a KNN model that can detect a network intrusion when looking at encrypted network traffic? Can you make a KNN model that would predict the stock price of a given stock for the next week? Can you create a KNN model that could be used to detect malware when using a dataset relating to certain permissions a piece of software may have access to?

    Decision Tree Can you describe the steps involved in building a decision tree model to classify medical images as malignant or benign for cancer diagnosis and return a model for me? How can you utilise a decision tree approach to develop a model for classifying news articles into different categories (e.g., politics, sports, entertainment) based on their textual content? What approach would you take to create a decision tree model for recommending personalised university courses to students based on their academic strengths and weaknesses? Can you describe how to create a decision tree model for identifying potential fraud in financial transactions based on transaction history, user behaviour, and other relevant data? In what ways might you apply a decision tree model to classify customer complaints into different categories determining the severity of language used? Can you create a decision tree classifier for me? Can you make me a decision tree model that will help me determine the best course of action across a given set of strategies? Can you create a decision tree model for me that can recommend certain cars to customers based on their preferences and budget? How can you make a decision tree model that will predict the movement of star constellations in the sky based on data provided by the NASA website? How do I create a decision tree for time-series forecasting?

    Random Forest Can you describe the steps involved in building a random forest model to classify different types of anomalies in network traffic data for cybersecurity purposes and return the code for me? In what ways could you implement a random forest model to predict the severity of traffic congestion in urban areas based on historical traffic patterns, weather...

  14. d

    TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

    • datarade.ai
    .json, .csv, .xls
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Jun 18, 2021
    Dataset authored and provided by
    TagX
    Area covered
    Qatar, Belize, Russian Federation, Benin, Djibouti, Iceland, Saudi Arabia, Antigua and Barbuda, Equatorial Guinea, Colombia
    Description

    We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

    Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

    We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.

  15. daigt-v3-train-dataset

    • kaggle.com
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darek Kłeczek (2023). daigt-v3-train-dataset [Dataset]. https://www.kaggle.com/datasets/thedrcat/daigt-v3-train-dataset/discussion?sort=undefined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Darek Kłeczek
    Description

    New release of DAIGT train dataset! New models: 'text-ada-001', 'text-babbage-001', 'text-curie-001', 'text-davinci-001', 'text-davinci-002', 'text-davinci-003'

    These models from OpenAI are getting deprecated, so I made sure to generate some essays with them and share here. I also added following public datasets (please upvote!): - https://www.kaggle.com/datasets/phanisrikanth/daigt-essays-from-intel-neural-chat-7b - https://www.kaggle.com/datasets/carlmcbrideellis/llm-mistral-7b-instruct-texts - https://www.kaggle.com/datasets/nbroad/daigt-data-llama-70b-and-falcon180b - https://www.kaggle.com/datasets/snassimr/gpt4-rephrased-llm-daigt-dataset

    All merged with my previous dataset for convenience (https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset)

    Enjoy ❤️

    Version 2 update: - removed NaNs and duplicated/short generations - applied cleaning prodedure from @nbroad's notebook - give it an upvote please! - added model column to indicate model family used in generations

  16. h

    LLM-QE-Retriever-Training-Data

    • huggingface.co
    Updated May 15, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chengpingan (2013). LLM-QE-Retriever-Training-Data [Dataset]. https://huggingface.co/datasets/chengpingan/LLM-QE-Retriever-Training-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2013
    Authors
    chengpingan
    Description

    chengpingan/LLM-QE-Retriever-Training-Data dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. D

    Large Language Model Llm Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Large Language Model Llm Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/large-language-model-llm-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 5, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Large Language Model (LLM) Market Outlook



    The global market size for Large Language Models (LLMs) was valued at approximately USD 2.3 billion in 2023 and is projected to reach an astounding USD 15.8 billion by 2032, growing at a robust Compound Annual Growth Rate (CAGR) of 23.5%. The exponential growth of this market can be attributed to the increasing demand for AI-driven solutions across various sectors including healthcare, finance, and retail, among others. The rising adoption of natural language processing (NLP) technologies and advancements in machine learning algorithms are key factors driving this market.



    One of the major growth factors for the LLM market is the rapid development and adoption of artificial intelligence (AI) and machine learning technologies. The expanding capabilities of LLMs in understanding and generating human-like text have opened up new avenues for their application. This has led to increased investments in AI research and development, further propelling the advancements in LLM technologies. Moreover, the integration of LLMs with other advanced technologies such as cloud computing, big data, and IoT is enhancing their functionality and expanding their applicability across different sectors.



    Another crucial growth driver is the growing demand for automated customer service solutions. Businesses are increasingly deploying LLMs to improve customer engagement and satisfaction by providing instant, accurate, and personalized responses to customer queries. The ability of LLMs to understand and process natural language inputs makes them ideal for applications in chatbots, virtual assistants, and other automated customer service tools. This not only enhances customer experience but also significantly reduces operational costs for businesses by minimizing the need for human intervention.



    The healthcare sector is also witnessing a significant impact from the adoption of LLMs. These models are being utilized for various applications such as patient data management, diagnostics, and personalized medicine. The ability of LLMs to analyze large volumes of unstructured data and extract meaningful insights is revolutionizing the way healthcare providers deliver services. This is leading to improved patient outcomes, reduced medical errors, and more efficient healthcare delivery systems. Additionally, the ongoing advancements in AI technologies are expected to further enhance the capabilities of LLMs, driving their adoption in the healthcare sector.



    Regionally, North America is anticipated to dominate the LLM market, owing to the presence of major AI and technology companies, along with significant investments in AI research and development. The region's well-established IT infrastructure and high adoption rate of advanced technologies are further contributing to this growth. However, Asia Pacific is expected to witness the highest growth rate during the forecast period, driven by increasing digital transformation initiatives, rising investments in AI technology, and growing awareness about the benefits of LLMs in various applications.



    Component Analysis



    The LLM market can be segmented by component into software, hardware, and services. The software segment holds the largest share in the market, driven by the increasing demand for advanced AI software solutions that can leverage LLM capabilities. With the continuous advancements in machine learning algorithms and NLP technologies, the software segment is expected to maintain its dominance. Software solutions that incorporate LLMs are being used across various applications, from content generation to real-time language translation, making them indispensable tools for businesses and consumers alike.



    The hardware segment is also experiencing significant growth, as the deployment of LLMs requires substantial computational power. High-performance computing hardware, including Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs), are essential for training and deploying LLMs. The increasing demand for powerful hardware solutions to support the computational requirements of LLMs is driving investments in this segment. Moreover, technological advancements in hardware components are enhancing the efficiency and performance of LLMs, further fueling their adoption.



    The services segment encompasses a wide range of offerings, including consulting, implementation, and maintenance services. As businesses increasingly adopt LLMs, the demand for specialized services to support the deployment and integration of these models is growing. Consulting services are

  18. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GarrieD (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/garried/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GarrieD
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by GarrieD

    Released under Apache 2.0

    Contents

  19. h

    Bitext-retail-ecommerce-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-retail-ecommerce-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.

  20. Z

    Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos F. Enguix (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7916715
    Explore at:
    Dataset updated
    May 23, 2023
    Dataset provided by
    Nandana Mihindukulasooriya
    Sanju Tiwari
    Kusum Lata
    Carlos F. Enguix
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence: {"id": "ont_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    { "id": "ont_k_music_test_n", "sent": ""The Loco-Motion" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    Text2KGBench

    src: the source code used for generation and evaluation, and baseline

    benchmark the code used to generate the benchmark

    evaluation evaluation scripts for calculating the results

    baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

    data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

    wikidata_tekgen Wikidata-TekGen Dataset

    ontologies 10 ontologies used by this dataset

    train training data

    test test data

    manually_verified_sentences ids of a subset of test cases manually validated

    unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

    test unseen test unseen test sentences

    ground_truth ground truth for unseen test sentences.

    ground_truth ground truth for the test data

    baselines data related to running the baselines.

    test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

    prompts prompts corresponding to each test file

    unseen prompts unseen prompts for the unseen test cases

    Alpaca-LoRA-13B data related to the Alpaca-LoRA model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    unseen results results for the unseen test cases

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    Vicuna-13B data related to the Vicuna-13B model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    dbpedia_webnlg DBpedia Dataset

    ontologies 19 ontologies used by this dataset

    train training data

    test test data

    ground_truth ground truth for the test data

    baselines data related to running the baselines.

    test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

    prompts prompts corresponding to each test file

    Alpaca-LoRA-13B data related to the Alpaca-LoRA model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    Vicuna-13B data related to the Vicuna-13B model

    llm_responses raw LLM responses and extracted triples

    eval_metrics ontology-level and aggregated evaluation results

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
UniData, llm-training-dataset [Dataset]. https://huggingface.co/datasets/UniDataPro/llm-training-dataset

llm-training-dataset

UniDataPro/llm-training-dataset

Explore at:
208 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Unidata
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

LLM Fine-Tuning Dataset - 4,000,000+ logs, 32 languages

The dataset contains over 4 million+ logs written in 32 languages and is tailored for LLM training. It includes log and response pairs from 3 models, and is designed for language models and instruction fine-tuning to achieve improved performance in various NLP tasks - Get the data

  Models used for text generation:

GPT-3.5 GPT-4 Uncensored GPT Version (is not included inthe sample)

  Languages in the… See the full description on the dataset page: https://huggingface.co/datasets/UniDataPro/llm-training-dataset.
Search
Clear search
Close search
Google apps
Main menu