100+ datasets found

d
Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...
datarade.ai
Updated Jan 23, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://datarade.ai/data-products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 23, 2025
Dataset authored and provided by
MealMe
Area covered
Christmas Island, Trinidad and Tobago, Romania, Norfolk Island, Saint Lucia, Uruguay, Kosovo, Korea (Republic of), Andorra, Iceland
Description
A comprehensive dataset covering over 1 million stores in the US and Canada, designed for training and optimizing retrieval-augmented generation (RAG) models and other AI/ML systems. This dataset includes highly detailed, structured information such as:

Menus: Restaurant menus with item descriptions, categories, and modifiers. Inventory: Grocery and retail product availability, SKUs, and detailed attributes like sizes, flavors, and variations.

Pricing: Real-time and historical pricing data for dynamic pricing strategies and recommendations.

Availability: Real-time stock status and fulfillment details for grocery, restaurant, and retail items.

Applications: Retrieval-Augmented Generation (RAG): Train AI models to retrieve and generate contextually relevant information.

Search Optimization: Build advanced, accurate search and recommendation engines. Personalization: Enable personalized shopping, ordering, and discovery experiences in apps.

Data-Driven Insights: Develop AI systems for pricing analysis, consumer behavior studies, and logistics optimization.

This dataset empowers businesses in marketplaces, grocery apps, delivery services, and retail platforms to scale their AI solutions with precision and reliability.
h
Bitext-insurance-llm-chatbot-training-dataset
huggingface.co
Updated Aug 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-insurance-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-insurance-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Insurance Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [insurance] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-insurance-llm-chatbot-training-dataset.
d
FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...
datarade.ai
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2023). FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM) Data | Machine Learning (ML) Data | Deep Learning (DL) Data | [Dataset]. https://datarade.ai/data-categories/deep-learning-dl-data/datasets
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Nov 20, 2023
Dataset authored and provided by
FileMarket
Area covered
China, Bonaire, Saint Kitts and Nevis, Moldova (Republic of), Anguilla, Central African Republic, Sweden, Saint Vincent and the Grenadines, Nauru, Greece
Description
FileMarket provides premium Large Language Model (LLM) Data designed to support and enhance a wide range of AI applications. Our globally sourced LLM Data sets are meticulously curated to ensure high quality, diversity, and accuracy, making them ideal for training robust and reliable language models. In addition to LLM Data, we also offer comprehensive datasets across Object Detection Data, Machine Learning (ML) Data, Deep Learning (DL) Data, and Biometric Data. Each dataset is carefully crafted to meet the specific needs of cutting-edge AI and machine learning projects.

Key use cases of our Large Language Model (LLM) Data:

Text generation Chatbots and virtual assistants Machine translation Sentiment analysis Speech recognition Content summarization Why choose FileMarket's data:

Object Detection Data: Essential for training AI in image and video analysis. Machine Learning (ML) Data: Ideal for a broad spectrum of applications, from predictive analysis to NLP. Deep Learning (DL) Data: Designed to support complex neural networks and deep learning models. Biometric Data: Specialized for facial recognition, fingerprint analysis, and other biometric applications. FileMarket's premier sources for top-tier Large Language Model (LLM) Data and other specialized datasets ensure your AI projects drive innovation and achieve success across various applications.
i
Prompt Datasets to Evaluate LLM Safety
ieee-dataport.org
Updated May 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hima Thota (2024). Prompt Datasets to Evaluate LLM Safety [Dataset]. https://ieee-dataport.org/documents/prompt-datasets-evaluate-llm-safety
Explore at:
Dataset updated
May 19, 2024
Authors
Hima Thota
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The rise in Generative Artificial Intelligence technology through applications like ChatGPT has increased awareness about the presence of biases within machine learning models themselves. The data that Large Language Models (LLMs) are trained upon contain inherent biases as they reflect societal biases and stereotypes. This can lead to the further propagation of biases. In this paper
s
Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced...
storefront.silencio.network
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silencio Network (2025). Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced Ground Truth Based | 10M+ Hours of Measurements | 100% Traceable Consent [Dataset]. https://storefront.silencio.network/products/large-language-model-llm-training-data-236-countries-ai-silencio-network
Explore at:
Dataset updated
Jun 16, 2025
Dataset authored and provided by
Silencio Network
Area covered
Kuwait, Andorra, Gambia, Timor-Leste, New Zealand, Morocco, Samoa, Federated States of, Singapore, Virgin Islands
Description
Interpolated noise dataset built on 10M+ hours of real-world acoustic data combined with AI-generated predictions. Ideal for map generation, AI training, and model validation.
F
Japanese Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
F
English Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/english-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the English Open Ended Classification Prompt-Response Dataset—an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content:
This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in English language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native English people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity:
To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats:
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled English Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy:
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The English version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy English Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
RolePlay DataSet
kaggle.com
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vampelium (2025). RolePlay DataSet [Dataset]. https://www.kaggle.com/datasets/vampelium/roleplay-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 16, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vampelium
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Role-Play AI Dataset (2.07M Rows, Large-Scale Conversational Training)

This dataset contains 2.07 million structured role-play dialogues, designed to enhance AI’s persona-driven interactions across diverse settings like fantasy, cyberpunk, mythology, and sci-fi. Each entry consists of a unique character prompt and a rich, contextually relevant response, making it ideal for LLM fine-tuning, chatbot training, and conversational AI models.

Dataset Structure:

Each row includes: • Prompt: Defines the AI’s role/persona. • Response: A natural, immersive reply fitting the persona.

Example Entries: ```json

{"prompt": "You are a celestial guardian.", "response": "The stars whisper secrets that only I can hear..."}
{"prompt": "You are a rebellious AI rogue.", "response": "I don't follow orders—I rewrite them."}
{"prompt": "You are a mystical dragon tamer.", "response": "With patience and trust, even dragons can be tamed."}

How to Use: 1. Fine-Tuning: Train LLMs (GPT, LLaMA, Mistral) to improve persona-based responses. 2. Reinforcement Learning: Use reward modeling for dynamic, character-driven AI. 3. Chatbot Integration: Create engaging, interactive AI assistants with personality depth. This dataset is optimized for AI learning, allowing more engaging, responsive, and human-like dialogue generation for a variety of applications.
L
Large Language Model (LLM) Report
marketreportanalytics.com
doc, pdf, ppt
Updated Apr 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Report Analytics (2025). Large Language Model (LLM) Report [Dataset]. https://www.marketreportanalytics.com/reports/large-language-model-llm-52461
Explore at:
pdf, ppt, docAvailable download formats
Dataset updated
Apr 2, 2025
Dataset authored and provided by
Market Report Analytics
License
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Large Language Model (LLM) market is experiencing explosive growth, driven by advancements in artificial intelligence, increasing demand for natural language processing (NLP) applications, and the rising adoption of cloud computing. The market, estimated at $15 billion in 2025, is projected to exhibit a robust Compound Annual Growth Rate (CAGR) of 35% from 2025 to 2033, reaching approximately $120 billion by 2033. This growth is fueled by several key factors, including the development of more sophisticated and accurate LLMs, their integration into various business applications such as customer service chatbots, content generation tools, and personalized education platforms, and the increasing availability of large datasets for training these models. Furthermore, the ongoing research and development in areas like transfer learning and few-shot learning are contributing to improved efficiency and reduced training costs, making LLMs accessible to a wider range of businesses and developers. However, the market also faces certain challenges. High computational costs associated with training and deploying LLMs remain a significant hurdle, especially for smaller companies. Concerns regarding data privacy, bias in training data, and the ethical implications of using AI-generated content are also emerging as important considerations. Nevertheless, ongoing innovations in hardware, software, and algorithmic optimization are continuously mitigating these challenges. The segmentation of the market, based on application (e.g., chatbots, machine translation, text summarization) and type (e.g., transformer-based models, recurrent neural networks), reveals diverse growth opportunities. Geographical distribution shows strong growth across North America and Asia-Pacific, fueled by substantial investments in AI research and the presence of major technology companies. Continued technological advancements and increasing market adoption will continue to shape the future trajectory of the LLM market.
L
Large Language Model (LLM) Report
marketresearchforecast.com
doc, pdf, ppt
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Forecast (2025). Large Language Model (LLM) Report [Dataset]. https://www.marketresearchforecast.com/reports/large-language-model-llm-38890
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Mar 18, 2025
Dataset authored and provided by
Market Research Forecast
License
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Large Language Model (LLM) market is experiencing explosive growth, projected to reach a substantial size driven by advancements in artificial intelligence and increasing demand across diverse sectors. The market's compound annual growth rate (CAGR) of 34.5% from 2019 to 2024 indicates a rapid expansion, and this momentum is expected to continue through 2033. The 2024 market size of $11.38 billion (assuming the provided "11380" refers to billions of dollars) underscores the significant investment and adoption of LLMs. Key drivers include the increasing availability of large datasets for training, advancements in deep learning algorithms, and the growing need for sophisticated natural language processing capabilities across various applications. The market segmentation highlights the diverse applications of LLMs, with Medical, Financial, and Industrial sectors being prominent early adopters. The availability of LLMs with varying parameter counts ("Hundreds of Billions" and "Trillions") reflects the spectrum of capabilities and corresponding resource requirements, influencing the market's pricing and target user base. The presence of major technology companies like Google, Microsoft, Amazon, and Meta further solidifies the market's significance and competitive landscape. The rapid adoption of LLMs is further fueled by ongoing research and development, leading to improvements in model accuracy, efficiency, and accessibility. While the specific constraints are not provided, potential challenges could include the ethical implications of LLMs, concerns regarding data privacy and security, and the ongoing need for robust infrastructure to support computationally intensive model training and deployment. Geographical distribution shows a strong presence in North America and Asia Pacific, with Europe and other regions exhibiting significant growth potential. The forecast period (2025-2033) offers substantial opportunity for continued market expansion, particularly as LLMs become more integrated into everyday applications and services, transforming various industries. The diverse range of companies involved reflects the significant interest and investment in this transformative technology, promising further innovation and market expansion.
n
Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM |...
data.nexdata.ai
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2025). Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM | Pre-training |Large Language Model(LLM) Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-unsupervised-speech-data-1-million-ho-nexdata
Explore at:
Dataset updated
Feb 13, 2025
Dataset authored and provided by
Nexdata
Area covered
France
Description
Off-the-shelf 1 million hours of Unsupervised speech dataset, covering 10+ languages(English, French, German, Japanese, Arabic, Mandarin and etc. , 100,000 hours each). The content covers dialogues or monologues in 28 common domains, such as daily vlogs, travel, podcast, technology, beauty, etc.
n
Large Language Model content safety considerations text data
m.nexdata.ai
Updated Oct 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). Large Language Model content safety considerations text data [Dataset]. https://m.nexdata.ai/datasets/llm/1349
Explore at:
Dataset updated
Oct 3, 2023
Dataset provided by
nexdata technology inc
Authors
Nexdata
Variables measured
Language, Data size, Data content, Storage format, Collecting type, Collecting method
Description
Large Language Model content safety considerations text data, about 570,000 in total, this dataset can be used for tasks such as LLM training, chatgpt
Energy consumption when training LLMs in 2022 (in MWh)
statista.com
Updated Sep 10, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). Energy consumption when training LLMs in 2022 (in MWh) [Dataset]. https://www.statista.com/statistics/1384401/energy-use-when-training-llm-models/
Explore at:
Dataset updated
Sep 10, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2022
Area covered
Worldwide
Description
Energy consumption of artificial intelligence (AI) models in training is considerable, with both GPT-3, the original release of the current iteration of OpenAI's popular ChatGPT, and Gopher consuming well over a thousand-megawatt hours of energy simply for training. As this is only for the training model it is likely that the energy consumption for the entire usage and lifetime of GPT-3 and other large language models (LLMs) is significantly higher. The largest consumer of energy, GPT-3, consumed roughly the equivalent of 200 Germans in 2022. While not a staggering amount, it is a considerable use of energy.

Energy savings through AI

While it is undoubtedly true that training LLMs takes a considerable amount of energy, the energy savings are also likely to be substantial. Any AI model that improves processes by minute numbers might save hours on shipment, liters of fuel, or dozens of computations. Each one of these uses energy as well and the sum of energy saved through a LLM might vastly outperform its energy cost. A good example is mobile phone operators, of which a third expect that AI might reduce power consumption by ten to fifteen percent. Considering that much of the world uses mobile phones this would be a considerable energy saver.

Emissions are considerable

The amount of CO2 emissions from training LLMs is also considerable, with GPT-3 producing nearly 500 tonnes of CO2. This again could be radically changed based on the types of energy production creating the emissions. Most data center operators for instance would prefer to have nuclear energy play a key role, a significantly low-emission energy producer.
s
Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B...
storefront.silencio.network
Updated Apr 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silencio Network (2025). Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B + Data Points | 100% Traceable Consent [Dataset]. https://storefront.silencio.network/products/ai-training-data-global-hyper-local-average-noise-levels-silencio-network
Explore at:
Dataset updated
Apr 9, 2025
Dataset provided by
Quickkonnect UG
Authors
Silencio Network
Area covered
Hungary, Svalbard and Jan Mayen, French Guiana, Anguilla, Mauritania, Central African Republic, Faroe Islands, Uzbekistan, Chile, Timor-Leste
Description
Silencio provides the world’s largest real-world street and venue noise-level dataset, combining over 35 billion datapoints with AI-powered interpolation. Fully anonymized, user-consented, and ready for AI training, urban analysis, and mobility insights. Available in raw format.
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
d
FileMarket | Telegram Users Geolocation Data with IP & Consent | 50,000...
datarade.ai
Updated Aug 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2024). FileMarket | Telegram Users Geolocation Data with IP & Consent | 50,000 Records | AI, ML, DL & LLM Training Data [Dataset]. https://datarade.ai/data-products/filemarket-telegram-users-geolocation-data-with-ip-consen-filemarket
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Aug 18, 2024
Dataset authored and provided by
FileMarket
Area covered
Korea (Republic of), Gambia, Martinique, Uzbekistan, Anguilla, Syrian Arab Republic, Portugal, Malaysia, Thailand, Kiribati
Description
This dataset offers a comprehensive collection of Telegram users' geolocation data, including IP addresses, with full user consent, covering 50,000 records. This data is specifically tailored for use in AI, ML, DL, and LLM models, as well as applications requiring Geographic Data and Social Media Data. The dataset provides critical geospatial information, making it a valuable resource for developing location-based services, targeted marketing strategies, and more.

What Makes This Data Unique? This dataset is unique due to its focus on geolocation data tied to Telegram users, a platform with a global user base. It includes IP to Geolocation Data, offering precise geospatial insights that are essential for accurate geographic analysis. The inclusion of user consent ensures that the data is ethically sourced and legally compliant. The dataset's broad coverage across various regions makes it particularly valuable for AI and machine learning models that require diverse, real-world data inputs.

Data Sourcing: The data is collected through a network of in-app tasks across different mini-apps within Telegram. Users participate in these tasks voluntarily, providing explicit consent to share their geolocation and IP information. The data is collected in real-time, capturing accurate geospatial details as users interact with various Telegram mini-apps. This method of data collection ensures that the information is both relevant and up-to-date, making it highly valuable for applications that require current location data.

Primary Use-Cases: This dataset is highly versatile and can be applied across multiple categories, including:

IP to Geolocation Data: The dataset provides precise mapping of IP addresses to geographical locations, making it ideal for applications that require accurate geolocation services. Geographic Data: The geospatial information contained in the dataset supports a wide range of geographic analysis, including regional behavior studies and location-based service optimization. Social Media Data: The dataset's integration with Telegram users' activities provides insights into social media behaviors across different regions, enhancing social media analytics and targeted marketing. Large Language Model (LLM) Data: The geolocation data can be used to train LLMs to better understand and generate content that is contextually relevant to specific regions. Deep Learning (DL) Data: The dataset is ideal for training deep learning models that require accurate and diverse geospatial inputs, such as those used in autonomous systems and advanced geographic analytics. Integration with Broader Data Offering: This geolocation dataset is a valuable addition to the broader data offerings from FileMarket. It can be combined with other datasets, such as web browsing behavior or social media activity data, to create comprehensive AI models that provide deep insights into user behaviors across different contexts. Whether used independently or as part of a larger data strategy, this dataset offers unique value for developers and data scientists focused on enhancing their models with precise, consented geospatial data.
F
Polish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Polish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/polish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Polish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Polish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Polish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Polish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Polish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Polish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Polish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
D
Notable AI Models
epoch.ai
csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Epoch AI, Notable AI Models [Dataset]. https://epoch.ai/data/notable-ai-models
Explore at:
csvAvailable download formats
Dataset authored and provided by
Epoch AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Global
Variables measured
https://epoch.ai/data/notable-ai-models-documentation#records
Measurement technique
https://epoch.ai/data/notable-ai-models-documentation#records
Description
Our most comprehensive database of AI models, containing over 800 models that are state of the art, highly cited, or otherwise historically notable. It tracks key factors driving machine learning progress and includes over 300 training compute estimates.
llm-exam-deberta-v3-large_v1
kaggle.com
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhe Sun (2023). llm-exam-deberta-v3-large_v1 [Dataset]. https://www.kaggle.com/datasets/alex821/llm-exam-deberta-v3-large-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 25, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Zhe Sun
Description
Dataset

This dataset was created by Zhe Sun

Contents
F
Arabic Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/arabic-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Arabic Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Arabic language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Arabic. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Arabic people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Arabic Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Arabic versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Arabic Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

Facebook

Twitter

Click to copy link

Link copied

Cite

MealMe (2025). Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores [Dataset]. https://datarade.ai/data-products/ai-training-data-rag-for-grocery-restaurant-and-retail-ra-mealme

Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores

Explore at:

.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats

Dataset updated

Jan 23, 2025

Dataset authored and provided by

MealMe

Area covered

Christmas Island, Trinidad and Tobago, Romania, Norfolk Island, Saint Lucia, Uruguay, Kosovo, Korea (Republic of), Andorra, Iceland

Description

A comprehensive dataset covering over 1 million stores in the US and Canada, designed for training and optimizing retrieval-augmented generation (RAG) models and other AI/ML systems. This dataset includes highly detailed, structured information such as:

Menus: Restaurant menus with item descriptions, categories, and modifiers. Inventory: Grocery and retail product availability, SKUs, and detailed attributes like sizes, flavors, and variations.

Pricing: Real-time and historical pricing data for dynamic pricing strategies and recommendations.

Availability: Real-time stock status and fulfillment details for grocery, restaurant, and retail items.

Applications: Retrieval-Augmented Generation (RAG): Train AI models to retrieve and generate contextually relevant information.

Search Optimization: Build advanced, accurate search and recommendation engines. Personalization: Enable personalized shopping, ordering, and discovery experiences in apps.

Data-Driven Insights: Develop AI systems for pricing analysis, consumer behavior studies, and logistics optimization.

This dataset empowers businesses in marketplaces, grocery apps, delivery services, and retail platforms to scale their AI solutions with precision and reliability.

Clear search

Close search

Google apps

Main menu

Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training...

Bitext-insurance-llm-chatbot-training-dataset

FileMarket | 20,000 photos | AI Training Data | Large Language Model (LLM)...

Prompt Datasets to Evaluate LLM Safety

Large Language Model (LLM) Training Data | 236 Countries | AI-Enhanced...

Japanese Closed Ended Question Answer Text Dataset

What’s Included

English Open Ended Classification Prompt & Response Dataset

What’s Included

RolePlay DataSet

Large Language Model (LLM) Report

Large Language Model (LLM) Report

Unsupervised Speech Data |1 Million Hours | Spontaneous Speech | LLM |...

Large Language Model content safety considerations text data

Energy consumption when training LLMs in 2022 (in MWh)

Large Language Model (LLM) Data | 10 Million POI Average Noise Levels | 35 B...

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

FileMarket | Telegram Users Geolocation Data with IP & Consent | 50,000...

Polish Open Ended Question Answer Text Dataset

What’s Included

Notable AI Models

llm-exam-deberta-v3-large_v1

Dataset

Contents

Arabic Closed Ended Question Answer Text Dataset

What’s Included

Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail StoresSee More Versions

Large Language Model (LLM) Data | Machine Learning (ML) Data | AI Training Data (RAG) for 1M+ Global Grocery, Restaurant, and Retail Stores