100+ datasets found
  1. d

    Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...

    • datarade.ai
    Updated Jan 23, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://datarade.ai/data-products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 23, 2025
    Dataset authored and provided by
    MealMe
    Area covered
    Christmas Island, Trinidad and Tobago, Romania, Norfolk Island, Saint Lucia, Uruguay, Kosovo, Korea (Republic of), Andorra, Iceland
    Description

    A comprehensive dataset covering over 1 million stores in the US and Canada, designed for training and optimizing retrieval-augmented generation (RAG) models and other AI/ML systems. This dataset includes highly detailed, structured information such as:

    Menus: Restaurant menus with item descriptions, categories, and modifiers. Inventory: Grocery and retail product availability, SKUs, and detailed attributes like sizes, flavors, and variations.

    Pricing: Real-time and historical pricing data for dynamic pricing strategies and recommendations.

    Availability: Real-time stock status and fulfillment details for grocery, restaurant, and retail items.

    Applications: Retrieval-Augmented Generation (RAG): Train AI models to retrieve and generate contextually relevant information.

    Search Optimization: Build advanced, accurate search and recommendation engines. Personalization: Enable personalized shopping, ordering, and discovery experiences in apps.

    Data-Driven Insights: Develop AI systems for pricing analysis, consumer behavior studies, and logistics optimization.

    This dataset empowers businesses in marketplaces, grocery apps, delivery services, and retail platforms to scale their AI solutions with precision and reliability.

  2. h

    Bitext-insurance-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-insurance-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-insurance-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Insurance Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [insurance] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-insurance-llm-chatbot-training-dataset.

  3. d

    FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

    • datarade.ai
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2023). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-categories/deep-learning-dl-data/datasets
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Nov 20, 2023
    Dataset authored and provided by
    FileMarket
    Area covered
    China, Bonaire, Saint Kitts and Nevis, Moldova (Republic of), Anguilla, Central African Republic, Sweden, Saint Vincent and the Grenadines, Nauru, Greece
    Description

    FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

    Key use cases of our Large Language Model (LLM) Data:

    Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

    Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.

  4. i

    Prompt Datasets to Evaluate LLM Safety

    • ieee-dataport.org
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hima Thota (2024). Prompt Datasets to Evaluate LLM Safety [Dataset]. https://ieee-dataport.org/documents/prompt-datasets-evaluate-llm-safety
    Explore at:
    Dataset updated
    May 19, 2024
    Authors
    Hima Thota
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The rise in Generative Artificial Intelligence technology through applications like ChatGPT has increased awareness about the presence of biases within machine learning models themselves. The data that Large Language Models (LLMs) are trained upon contain inherent biases as they reflect societal biases and stereotypes. This can lead to the further propagation of biases. In this paper

  5. s

    Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced...

    • storefront.silencio.network
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silencio Network (2025). Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced Ground Truth Based | 10M+ Hours of Measurements | 100% Traceable Consent [Dataset]. https://storefront.silencio.network/products/large-language-model-llm-training-data-236-countries-ai-silencio-network
    Explore at:
    Dataset updated
    Jun 16, 2025
    Dataset authored and provided by
    Silencio Network
    Area covered
    Kuwait, Andorra, Gambia, Timor-Leste, New Zealand, Morocco, Samoa, Federated States of, Singapore, Virgin Islands
    Description

    Interpolated noise dataset built on 10M+ hours of real-world acoustic data combined with AI-generated predictions. Ideal for map generation, AI training, and model validation.

  6. F

    Japanese Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.

    Dataset Content:

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  7. F

    English Open Ended Classification Prompt & Response Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/english-open-ended-classification-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    Welcome to the English Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.

    Dataset Content:

    This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in English language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.

    These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native English people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Prompt Diversity:

    To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.

    Response Formats:

    To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled English Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.

    Quality and Accuracy:

    Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

    The English version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy English Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.

  8. RolePlay DataSet

    • kaggle.com
    Updated Feb 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vampelium (2025). RolePlay DataSet [Dataset]. https://www.kaggle.com/datasets/vampelium/roleplay-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vampelium
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)

    This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.

    Dataset Structure:

    Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.

    Example Entries: ```json

    {"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
    {"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
    {"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}

    How to Use:
      1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses.
      2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI.
      3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth.
    
    This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
    
  9. L

    Large Language Model (LLM) Report

    • marketreportanalytics.com
    doc, pdf, ppt
    Updated Apr 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Report Analytics (2025). Large Language Model (LLM) Report [Dataset]. https://www.marketreportanalytics.com/reports/large-language-model-llm-52461
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    Apr 2, 2025
    Dataset authored and provided by
    Market Report Analytics
    License

    https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Large Language Model (LLM) market is experiencing explosive growth, driven by advancements in artificial intelligence, increasing demand for natural language processing (NLP) applications, and the rising adoption of cloud computing. The market, estimated at $15 billion in 2025, is projected to exhibit a robust Compound Annual Growth Rate (CAGR) of 35% from 2025 to 2033, reaching approximately $120 billion by 2033. This growth is fueled by several key factors, including the development of more sophisticated and accurate LLMs, their integration into various business applications such as customer service chatbots, content generation tools, and personalized education platforms, and the increasing availability of large datasets for training these models. Furthermore, the ongoing research and development in areas like transfer learning and few-shot learning are contributing to improved efficiency and reduced training costs, making LLMs accessible to a wider range of businesses and developers. However, the market also faces certain challenges. High computational costs associated with training and deploying LLMs remain a significant hurdle, especially for smaller companies. Concerns regarding data privacy, bias in training data, and the ethical implications of using AI-generated content are also emerging as important considerations. Nevertheless, ongoing innovations in hardware, software, and algorithmic optimization are continuously mitigating these challenges. The segmentation of the market, based on application (e.g., chatbots, machine translation, text summarization) and type (e.g., transformer-based models, recurrent neural networks), reveals diverse growth opportunities. Geographical distribution shows strong growth across North America and Asia-Pacific, fueled by substantial investments in AI research and the presence of major technology companies. Continued technological advancements and increasing market adoption will continue to shape the future trajectory of the LLM market.

  10. L

    Large Language Model (LLM) Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Large Language Model (LLM) Report [Dataset]. https://www.marketresearchforecast.com/reports/large-language-model-llm-38890
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Mar 18, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Large Language Model (LLM) market is experiencing explosive growth, projected to reach a substantial size driven by advancements in artificial intelligence and increasing demand across diverse sectors. The market's compound annual growth rate (CAGR) of 34.5% from 2019 to 2024 indicates a rapid expansion, and this momentum is expected to continue through 2033. The 2024 market size of $11.38 billion (assuming the provided "11380" refers to billions of dollars) underscores the significant investment and adoption of LLMs. Key drivers include the increasing availability of large datasets for training, advancements in deep learning algorithms, and the growing need for sophisticated natural language processing capabilities across various applications. The market segmentation highlights the diverse applications of LLMs, with Medical, Financial, and Industrial sectors being prominent early adopters. The availability of LLMs with varying parameter counts ("Hundreds of Billions" and "Trillions") reflects the spectrum of capabilities and corresponding resource requirements, influencing the market's pricing and target user base. The presence of major technology companies like Google, Microsoft, Amazon, and Meta further solidifies the market's significance and competitive landscape. The rapid adoption of LLMs is further fueled by ongoing research and development, leading to improvements in model accuracy, efficiency, and accessibility. While the specific constraints are not provided, potential challenges could include the ethical implications of LLMs, concerns regarding data privacy and security, and the ongoing need for robust infrastructure to support computationally intensive model training and deployment. Geographical distribution shows a strong presence in North America and Asia Pacific, with Europe and other regions exhibiting significant growth potential. The forecast period (2025-2033) offers substantial opportunity for continued market expansion, particularly as LLMs become more integrated into everyday applications and services, transforming various industries. The diverse range of companies involved reflects the significant interest and investment in this transformative technology, promising further innovation and market expansion.

  11. n

    Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM |...

    • data.nexdata.ai
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM | Pre-training |Large Language Model(LLM) Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-unsupervised-speech-data-1-million-ho-nexdata
    Explore at:
    Dataset updated
    Feb 13, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    France
    Description

    Off-the-shelf 1 million hours of Unsupervised speech dataset, covering 10+ languages(English, French, German, Japanese, Arabic, Mandarin and etc. , 100,000 hours each). The content covers dialogues or monologues in 28 common domains, such as daily vlogs, travel, podcast, technology, beauty, etc.

  12. n

    Large Language Model content safety considerations text data

    • m.nexdata.ai
    Updated Oct 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Large Language Model content safety considerations text data [Dataset]. https://m.nexdata.ai/datasets/llm/1349
    Explore at:
    Dataset updated
    Oct 3, 2023
    Dataset provided by
    nexdata technology inc
    Authors
    Nexdata
    Variables measured
    Language, Data size, Data content, Storage format, Collecting type, Collecting method
    Description

    Large Language Model content safety considerations text data, about 570,000 in total, this dataset can be used for tasks such as LLM training, chatgpt

  13. Energy consumption when training LLMs in 2022 (in MWh)

    • statista.com
    Updated Sep 10, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
    Explore at:
    Dataset updated
    Sep 10, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2022
    Area covered
    Worldwide
    Description

    Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over a thousand-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of 200 Germans in 2022. While not a staggering amount, it is a considerable use of energy.

    Energy savings through AI

    While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a third expect that AI might reduce power consumption by ten to fifteen percent. Considering that much of the world uses mobile phones this would be a considerable energy saver.

    Emissions are considerable

    The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly 500 tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.

  14. s

    Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B...

    • storefront.silencio.network
    Updated Apr 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Silencio Network (2025). Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B + Data Points | 100% Traceable Consent [Dataset]. https://storefront.silencio.network/products/ai-training-data-global-hyper-local-average-noise-levels-silencio-network
    Explore at:
    Dataset updated
    Apr 9, 2025
    Dataset provided by
    Quickkonnect UG
    Authors
    Silencio Network
    Area covered
    Hungary, Svalbard and Jan Mayen, French Guiana, Anguilla, Mauritania, Central African Republic, Faroe Islands, Uzbekistan, Chile, Timor-Leste
    Description

    Silencio provides the world’s largest real-world street and venue noise-level dataset, combining over 35 billion datapoints with AI-powered interpolation. Fully anonymized, user-consented, and ready for AI training, urban analysis, and mobility insights. Available in raw format.

  15. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

  16. d

    FileMarket | Telegram Users Geolocation Data with IP & Consent | 50,000...

    • datarade.ai
    Updated Aug 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FileMarket (2024). FileMarket | Telegram Users Geolocation Data with IP & Consent | 50,000 Records | AI, ML, DL & LLM Training Data [Dataset]. https://datarade.ai/data-products/filemarket-telegram-users-geolocation-data-with-ip-consen-filemarket
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Aug 18, 2024
    Dataset authored and provided by
    FileMarket
    Area covered
    Korea (Republic of), Gambia, Martinique, Uzbekistan, Anguilla, Syrian Arab Republic, Portugal, Malaysia, Thailand, Kiribati
    Description

    This dataset offers a comprehensive collection of Telegram users' geolocation data, including IP addresses, with full user consent, covering 50,000 records. This data is specifically tailored for use in AI, ML, DL, and LLM models, as well as applications requiring Geographic Data and Social Media Data. The dataset provides critical geospatial information, making it a valuable resource for developing location-based services, targeted marketing strategies, and more.

    What Makes This Data Unique? This dataset is unique due to its focus on geolocation data tied to Telegram users, a platform with a global user base. It includes IP to Geolocation Data, offering precise geospatial insights that are essential for accurate geographic analysis. The inclusion of user consent ensures that the data is ethically sourced and legally compliant. The dataset's broad coverage across various regions makes it particularly valuable for AI and machine learning models that require diverse, real-world data inputs.

    Data Sourcing: The data is collected through a network of in-app tasks across different mini-apps within Telegram. Users participate in these tasks voluntarily, providing explicit consent to share their geolocation and IP information. The data is collected in real-time, capturing accurate geospatial details as users interact with various Telegram mini-apps. This method of data collection ensures that the information is both relevant and up-to-date, making it highly valuable for applications that require current location data.

    Primary Use-Cases: This dataset is highly versatile and can be applied across multiple categories, including:

    IP to Geolocation Data: The dataset provides precise mapping of IP addresses to geographical locations, making it ideal for applications that require accurate geolocation services. Geographic Data: The geospatial information contained in the dataset supports a wide range of geographic analysis, including regional behavior studies and location-based service optimization. Social Media Data: The dataset's integration with Telegram users' activities provides insights into social media behaviors across different regions, enhancing social media analytics and targeted marketing. Large Language Model (LLM) Data: The geolocation data can be used to train LLMs to better understand and generate content that is contextually relevant to specific regions. Deep Learning (DL) Data: The dataset is ideal for training deep learning models that require accurate and diverse geospatial inputs, such as those used in autonomous systems and advanced geographic analytics. Integration with Broader Data Offering: This geolocation dataset is a valuable addition to the broader data offerings from FileMarket. It can be combined with other datasets, such as web browsing behavior or social media activity data, to create comprehensive AI models that provide deep insights into user behaviors across different contexts. Whether used independently or as part of a larger data strategy, this dataset offers unique value for developers and data scientists focused on enhancing their models with precise, consented geospatial data.

  17. F

    Polish Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Polish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/polish-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Polish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Polish language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Polish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Polish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Polish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in Polish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Polish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  18. D

    Notable AI Models

    • epoch.ai
    csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Epoch AI, Notable AI Models [Dataset]. https://epoch.ai/data/notable-ai-models
    Explore at:
    csvAvailable download formats
    Dataset authored and provided by
    Epoch AI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Global
    Variables measured
    https://epoch.ai/data/notable-ai-models-documentation#records
    Measurement technique
    https://epoch.ai/data/notable-ai-models-documentation#records
    Description

    Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.

  19. llm-exam-deberta-v3-large_v1

    • kaggle.com
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhe Sun (2023). llm-exam-deberta-v3-large_v1 [Dataset]. https://www.kaggle.com/datasets/alex821/llm-exam-deberta-v3-large-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 25, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Zhe Sun
    Description

    Dataset

    This dataset was created by Zhe Sun

    Contents

  20. F

    Arabic Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Arabic Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/arabic-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Arabic Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Arabic language, advancing the field of artificial intelligence.

    Dataset Content:

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Arabic. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Arabic people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Arabic Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The Arabic versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Arabic Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://datarade.ai/data-products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme

Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores

Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 23, 2025
Dataset authored and provided by
MealMe
Area covered
Christmas Island, Trinidad and Tobago, Romania, Norfolk Island, Saint Lucia, Uruguay, Kosovo, Korea (Republic of), Andorra, Iceland
Description

A comprehensive dataset covering over 1 million stores in the US and Canada, designed for training and optimizing retrieval-augmented generation (RAG) models and other AI/ML systems. This dataset includes highly detailed, structured information such as:

Menus: Restaurant menus with item descriptions, categories, and modifiers. Inventory: Grocery and retail product availability, SKUs, and detailed attributes like sizes, flavors, and variations.

Pricing: Real-time and historical pricing data for dynamic pricing strategies and recommendations.

Availability: Real-time stock status and fulfillment details for grocery, restaurant, and retail items.

Applications: Retrieval-Augmented Generation (RAG): Train AI models to retrieve and generate contextually relevant information.

Search Optimization: Build advanced, accurate search and recommendation engines. Personalization: Enable personalized shopping, ordering, and discovery experiences in apps.

Data-Driven Insights: Develop AI systems for pricing analysis, consumer behavior studies, and logistics optimization.

This dataset empowers businesses in marketplaces, grocery apps, delivery services, and retail platforms to scale their AI solutions with precision and reliability.

Search
Clear search
Close search
Google apps
Main menu