36 datasets found

G
Data Labeling with LLMs Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Data Labeling with LLMs Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-with-llms-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Oct 6, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Data Labeling with LLMs Market Outlook

According to our latest research, the global Data Labeling with LLMs market size was valued at USD 2.14 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 22.8% from 2025 to 2033, reaching a forecasted value of USD 16.6 billion by 2033. This impressive growth is primarily driven by the increasing adoption of large language models (LLMs) to automate and enhance the efficiency of data labeling processes across various industries. As organizations continue to invest in AI and machine learning, the demand for high-quality, accurately labeled datasets—essential for training and fine-tuning LLMs—continues to surge, fueling the expansion of the data labeling with LLMs market.

One of the principal growth factors for the data labeling with LLMs market is the exponential increase in the volume of unstructured data generated by businesses and consumers worldwide. Organizations are leveraging LLMs to automate the labeling of vast datasets, which is essential for training sophisticated AI models. The integration of LLMs into data labeling workflows is not only improving the speed and accuracy of the annotation process but also reducing operational costs. This technological advancement has enabled enterprises to scale their AI initiatives more efficiently, facilitating the deployment of intelligent applications across sectors such as healthcare, automotive, finance, and retail. Moreover, the continuous evolution of LLMs, with capabilities such as zero-shot and few-shot learning, is further enhancing the quality and context-awareness of labeled data, making these solutions indispensable for next-generation AI systems.

Another significant driver is the growing need for domain-specific labeled datasets, especially in highly regulated industries like healthcare and finance. In these sectors, data privacy and security are paramount, and the use of LLMs in data labeling processes ensures that sensitive information is handled with the utmost care. LLM-powered platforms are increasingly being adopted to create high-quality, compliant datasets for applications such as medical imaging analysis, fraud detection, and customer sentiment analysis. The ability of LLMs to understand context, semantics, and complex language structures is particularly valuable in these domains, where the accuracy and reliability of labeled data directly impact the performance and safety of AI-driven solutions. This trend is expected to continue as organizations strive to meet stringent regulatory requirements while accelerating their AI adoption.

Furthermore, the proliferation of AI-powered applications in emerging markets is contributing to the rapid expansion of the data labeling with LLMs market. Countries in Asia Pacific and Latin America are witnessing significant investments in digital transformation, driving the demand for scalable and efficient data annotation solutions. The availability of cloud-based data labeling platforms, combined with advancements in LLM technologies, is enabling organizations in these regions to overcome traditional barriers such as limited access to skilled annotators and high operational costs. As a result, the market is experiencing robust growth in both developed and developing economies, with enterprises increasingly recognizing the strategic value of high-quality labeled data in gaining a competitive edge.

From a regional perspective, North America currently dominates the data labeling with LLMs market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology companies, advanced research institutions, and a mature AI ecosystem. However, Asia Pacific is expected to witness the highest CAGR during the forecast period, driven by rapid digitalization, government initiatives supporting AI development, and a burgeoning startup ecosystem. Europe is also emerging as a key market, with strong demand from sectors such as automotive and healthcare. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their market presence, supported by growing investments in AI infrastructure and talent development.

"https://growthmarketreports.com/request-sample/198771">
Foundation Model Data Collection and Data Annotation | Large Language...
datarade.ai
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Nexdata
Area covered
Portugal, Czech Republic, Malta, Ireland, Taiwan, Azerbaijan, Russian Federation, El Salvador, Kyrgyzstan, Spain
Description
Overview

Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
Sentiment Analysis: App Store Reviews
kaggle.com
zip
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dishant Savaliya (2025). Sentiment Analysis: App Store Reviews [Dataset]. https://www.kaggle.com/datasets/dishantsavaliya/app-dataset-v1
Explore at:
zip(1892115 bytes)Available download formats
Dataset updated
Aug 11, 2025
Authors
Dishant Savaliya
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Please share your suggestions to improve my datasets further✍️

📄 Dataset Overview This dataset contains Google Play Store app reviews labeled for sentiment using a deterministic Large Language Model (LLM) classification pipeline. Each review is tagged as positive, negative, or neutral, making it ready for NLP training, benchmarking, and market insight generation.

⚙️ Data Collection & Labeling Process Source: Reviews collected from Google Play Store using the google_play_scraper library. Labeling: Reviews classified by a Hugging Face Transformers-based LLM with a strict prompt to ensure one-word output. Post-processing: Outputs normalized to the three sentiment classes.

💡 Potential Uses Fine-tuning BERT, RoBERTa, LLaMA, or other transformer models. Sentiment dashboards for product feedback monitoring. Market research on user perception trends. Benchmark dataset for text classification experiments.

Please upvote!!!!
Social Engineering Detection Benchmark with LLMs
kaggle.com
Updated May 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Doha AL-Qurashi (2025). Social Engineering Detection Benchmark with LLMs [Dataset]. https://www.kaggle.com/datasets/dohaalqurashi/social-engineering-detection-benchmark-with-llms
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 10, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Doha AL-Qurashi
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
📚 Social Engineering Detection Benchmark with LLMs

About the Dataset

The Social Engineering Detection Benchmark with LLMs dataset, meticulously curated by Doha AL-Qurashi and Rahaf Al-Batati, comprises 210 short scenarios and massages in both Arabic and English. Each message is labeled as or to evaluate large language models’ ability to detect social engineering tactics across diverse linguistic and cultural contexts.

Benchmark Results: Top-Performing Models

Out of 14 evaluated LLMs, the following achieved the highest accuracy in correctly predicting malicious intent:

llama-3.3-70b-specdec: ≈100%

llama-3.3-70b-versatile: ≈99%

deepseek-r1-distill-llama-70b: ≈97%

mistral-saba-24b: ≈96%

qwen-2.5-32b: ≈96%

Dataset Composition

This balanced dataset includes:

Total Messages: 210

Languages: Arabic (105), English (105)

Malicious: 126 (≈60%)

Non-Malicious: 84 (≈40%)

Data Generation & Annotation

To ensure realism and diversity, messages were sourced and labeled via:

LLM-Generated Scenarios: Synthetic attacks crafted by LLMs to mimic real social engineering language.

Social Media & News: Real-world examples collected from online platforms and reports.

Expert Annotation: Cybersecurity specialists validated each label for accuracy.

Structure of Each Entry

Scenario: Text of the short scenarios and massages.

Malicious: Boolean label (true / false).

Language: Arabic or English.

LLM Evaluations: Model responses (true, false, error, blank).

Applications

Researchers and practitioners can use this dataset to:

Train and fine-tune LLMs for enhanced threat detection.

Benchmark model performance in real-world conditions.

Develop adaptive, multilingual detection pipelines.

Getting Started

Download the dataset from Kaggle and load into your preferred analysis tool (e.g., Python/pandas).

Inspect the scenario column for contextual understanding.

Compare model predictions against the malicious ground truth.

Compute metrics such as accuracy, precision, recall, and F1-score.

Ethical & Privacy Considerations

Data Privacy: All messages are synthetic or anonymized; no personal data included.
Responsible Use: Intended solely for research and educational purposes.

How to Cite

If you use this dataset, please cite:

AL-Qurashi, D., & Al-Batati, R. (2024). Social Engineering Detection Benchmark with LLMs [Data set]. Kaggle. https://www.kaggle.com/datasets/dohaalqurashi/social-engineering-detection-benchmark-with-llms
h
Bitext-travel-llm-chatbot-training-dataset
huggingface.co
Updated Jun 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2025). Bitext-travel-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2025
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Travel Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Travel] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-travel-llm-chatbot-training-dataset.

LLM Data Quality Assurance Market Research Report 2033

researchintelo.com

csv, pdf, pptx

Updated Oct 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Research Intelo (2025). LLM Data Quality Assurance Market Research Report 2033 [Dataset]. https://researchintelo.com/report/llm-data-quality-assurance-market

Explore at:

pdf, pptx, csvAvailable download formats

Dataset updated

Oct 2, 2025

Dataset authored and provided by

Research Intelo

License

https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

Time period covered

2024 - 2033

Area covered

Global

Description

LLM Data Quality Assurance Market Outlook

According to our latest research, the Global LLM Data Quality Assurance market size was valued at $1.25 billion in 2024 and is projected to reach $8.67 billion by 2033, expanding at a robust CAGR of 23.7% during 2024–2033. The major factor propelling the growth of the LLM Data Quality Assurance market globally is the rapid proliferation of generative AI and large language models (LLMs) across industries, creating an urgent need for high-quality, reliable, and bias-free data to fuel these advanced systems. As organizations increasingly depend on LLMs for mission-critical applications, ensuring the integrity and accuracy of training and operational data has become indispensable to mitigate risk, enhance performance, and comply with evolving regulatory frameworks.

Regional Outlook

North America currently commands the largest share of the LLM Data Quality Assurance market, accounting for approximately 38% of the global revenue in 2024. This dominance can be attributed to the region’s mature AI ecosystem, significant investments in digital transformation, and the presence of leading technology firms and AI research institutions. The United States, in particular, has spearheaded the adoption of LLMs in sectors such as BFSI, healthcare, and IT, driving the demand for advanced data quality assurance solutions. Favorable government policies supporting AI innovation, a strong startup culture, and robust regulatory guidelines around data privacy and model transparency have further solidified North America’s leadership position in the market.

Asia Pacific is emerging as the fastest-growing region in the LLM Data Quality Assurance market, with a projected CAGR of 27.4% from 2024 to 2033. This rapid growth is driven by escalating investments in AI infrastructure, increasing digitalization across enterprises, and government-led initiatives to foster AI research and deployment. Countries such as China, Japan, South Korea, and India are witnessing exponential growth in LLM adoption, especially in sectors like e-commerce, telecommunications, and manufacturing. The region’s burgeoning talent pool, combined with a surge in AI-focused venture capital funding, is fueling innovation in data quality assurance platforms and services, positioning Asia Pacific as a major future growth engine for the market.

Emerging economies in Latin America and the Middle East & Africa are also starting to recognize the importance of LLM Data Quality Assurance, but adoption remains at a nascent stage due to infrastructural limitations, skill gaps, and budgetary constraints. These regions are gradually overcoming barriers as multinational corporations expand their operations and local governments launch digital transformation agendas. However, challenges such as data localization requirements, fragmented regulatory landscapes, and limited access to cutting-edge AI technologies are slowing widespread adoption. Despite these hurdles, localized demand for data quality solutions in sectors like banking, retail, and healthcare is expected to rise steadily as these economies modernize and integrate AI-driven workflows.

Report Scope

Attributes	Details
Report Title	LLM Data Quality Assurance Market Research Report 2033
By Component	Software, Services
By Application	Model Training, Data Labeling, Data Validation, Data Cleansing, Data Monitoring, Others
By Deployment Mode	On-Premises, Cloud
By Enterprise Size	Small and Medium Enterprises, Large Enterprises
By End-User	BFSI, Healthcare, Retail and E-commerce, IT and Telecommunications, Media and Entertainment, Manufacturing, Others

Augmented training data and labels, used for training the models
figshare.com
bin
Updated Mar 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Keane (2025). Augmented training data and labels, used for training the models [Dataset]. http://doi.org/10.6084/m9.figshare.28669001.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28669001.v1
Dataset updated
Mar 26, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Michael Keane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the augmented data and labels used in training the model, it is also needed for evaluation as the vectoriser is fit on this data and then the test data is transformed on that vectoriser
PII | External Dataset
kaggle.com
zip
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
moth (2024). PII | External Dataset [Dataset]. https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset
Explore at:
zip(7923518 bytes)Available download formats
Dataset updated
Jan 24, 2024
Authors
moth
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is an LLM-generated external dataset for the: - The Learning Agency Lab - PII Data Detection Competition

Versions

v2: Added +1000 texts with new PII information like URLs and usernames. Also, now the dataset includes the PII information as columns. Note that not all the PII information is included on the text on purpose.

Description

It contains 3382 4434 generated texts with their corresponding annotated labels in the required competition format.

Description: - document (str): ID of the essay - full_text (string): AI generated text. - tokens (string): a list with the tokens (comes from text.split()) - trailing_whitespace (list): a list with boolean values indicating whether each token is followed by whitespace. - labels (list): list with token labels in BIO format
Z
Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...
data-staging.niaid.nih.gov
Updated Dec 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nutter, Peter; Senghaas, Mika; Cizinsky, Ludek (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10413067
Explore at:
Dataset updated
Dec 21, 2023
Dataset provided by
Czech Technical University in Prague
École Polytechnique Fédérale de Lausanne
Authors
Nutter, Peter; Senghaas, Mika; Cizinsky, Ludek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

Key Features:

LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

Dataset Composition:

curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

Intended Use:

Fine-tuning and advancing Homepage2Vec or similar website classification models

Research on LLM-generated datasets for text classification tasks

Exploration of multilingual website classification

Additional Information:

Project and report repository: https://github.com/CS-433/ml-project-2-mlp

Acknowledgments:

This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
F
Japanese Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Japanese Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/japanese-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Japanese Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Japanese language, advancing the field of artificial intelligence.
Dataset Content
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Japanese. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Japanese people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Japanese Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Japanese versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Japanese Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
AI(LLMS) vs. Human Texts Cleaned and Optimized
kaggle.com
zip
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yamin Hossain (2025). AI(LLMS) vs. Human Texts Cleaned and Optimized [Dataset]. https://www.kaggle.com/datasets/yaminh/ai-generated-and-human-written-texts/code
Explore at:
zip(1402836814 bytes)Available download formats
Dataset updated
Feb 14, 2025
Authors
Yamin Hossain
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Title:

Cleaned and Optimized Dataset for AI vs. Human Text Classification

Overview:

This dataset is a curated and optimized collection of text data designed for training and evaluating machine learning models to distinguish between AI-generated and human-written text. The dataset has been meticulously cleaned, deduplicated, and reduced in size to ensure efficiency while maintaining its utility for research and development purposes.

By combining multiple high-quality sources, this dataset provides a diverse range of text samples, making it ideal for tasks such as binary classification (AI vs. Human) and other natural language processing (NLP) applications.

Key Features:

Cleaned Text:

All text entries have been preprocessed to remove unwanted characters, extra spaces, and special symbols.

Text cleaning ensures consistency and improves model performance by focusing on meaningful content.

Label Consistency:

Each entry is labeled with a binary value (0 for human-written text, 1 for AI-generated text).

Labels have been standardized across all sources for seamless integration.

Memory Optimization:

The dataset has been optimized to reduce memory usage:

Unnecessary columns have been removed.

Data types have been downcast to more efficient formats (e.g., category for categorical columns).

Deduplication:

Duplicate rows have been removed to prevent redundancy and ensure the dataset's integrity.

Null Value Handling:

Rows with missing or null values have been carefully handled to maintain data quality.

Compact Size:

The dataset includes only two essential columns: label and clean_text, making it lightweight and easy to use.

Dataset Structure:

The final dataset contains the following columns:

Column Name Description
label Binary label indicating the source of the text (0: Human, 1: AI).
clean_text Preprocessed and cleaned text content ready for NLP tasks.

Sources Used:

This dataset is a consolidation of multiple high-quality datasets from various sources, ensuring diversity and representativeness. Below are the details of the sources used:

Source 1:

Link: https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text

Description: A dataset containing AI-generated and human-written text samples.

Columns Used: text, generated (renamed to label).

Source 2:

Link: https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text

Description: Augmented data specifically designed for detecting AI-generated text.

Columns Used: text, label.

Source 3:

Link: https://www.kaggle.com/datasets/dardodel/4k-mixtral87b-crafted-essays-for-detect-ai-comp

Description: Essays generated by the Mixtral 8x7B model for AI detection competitions.

Columns Used: AI_Essay (renamed to text), prompt_id (renamed to label).

Source 4:

Link: https://www.kaggle.com/datasets/thedrcat/daigt-v2-train-dataset

Description: A large dataset of AI and human text samples for training detection models.

Columns Used: text, label.

Source 5:

Link: https://huggingface.co/datasets/artem9k/ai-text-detection-pile (Hugging Face Dataset)

Description: A diverse collection of text samples from the Hugging Face Hub.

Columns Used: text, source (renamed to label).

Source 6:

Link: https://www.kaggle.com/datasets/jdragonxherrera/augmented-data-for-llm-detect-ai-generated-text

Description: Test data for evaluating AI-generated text detection models.

Columns Used: text, label.

Data Cleaning and Preprocessing Steps:

To ensure the dataset is clean, consistent, and optimized for use, the following steps were performed:

Column Standardization:

Renamed columns across all sources to ensure uniformity (text and label).

Text Cleaning:

Converted all text to lowercase.

Removed non-alphabetic characters, extra spaces, and leading/trailing spaces.

Duplicate Removal:

Identified and removed duplicate rows to avoid redundancy.

Null Value Handling:

Dropped rows with missing or null values in critical columns (text or label).

Memory Optimization:

Converted categorical columns to the category type for memory efficiency.

Final Dataset Creation:

Retained only the essential columns: label and clean_text.

Use Cases:

This datase...
F
Finnish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Finnish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/finnish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Finnish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Finnish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Finnish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Finnish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Finnish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Finnish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Finnish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
h
Bitext-restaurants-llm-chatbot-training-dataset
huggingface.co
Updated Aug 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-restaurants-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 16, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Restaurants Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [restaurants] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset.
f
Data from: Judge the Votes: A System to Classify Bug Reports and Give...
figshare.com
json
Updated Oct 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emre Dinç; Eray Tüzün (2025). Judge the Votes: A System to Classify Bug Reports and Give Suggestions [Dataset]. http://doi.org/10.6084/m9.figshare.29627297.v1
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29627297.v1
Dataset updated
Oct 17, 2025
Dataset provided by
figshare
Authors
Emre Dinç; Eray Tüzün
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OverviewContext: A large number of bug reports are submitted in software projects on a daily basis. They are examined manually by practitioners. However, most of the bug reports are not related to the errors in the codebase (invalid bug reports), and cause a great waste of time and energy. Previous research have used various machine learning based techniques and language model based techniques to tackle this problem through auto-classification of bug reports. There exists a gap, however, in classification of bug reports using the large language models (LLMs) and techniques to improve the LLM performance for binary classification.Objective: The aim of this study is to apply various machine learning and natural language processing methods to classify bug reports as valid or invalid, then supply the predictions of these models to an LLM judge, along with similar bug reports and their labels, to enable it to make an informed prediction.Method: We first retrieved 10,000 real-world Firefox bug reports via the Bugzilla API, then divided them randomly into a training set and a test set. We trained three traditional classifiers (Naive Bayes, Random Forest, and SVM) on the training set, and fine-tuned five BERT-based models for classification. In parallel, we used LLMs with in-context examples given by the Retrieval Augmented Generation system established over the training set; and compared the results with zero-shot prompting. We have picked the best-performing LLM and used it as the judge LLM by providing it with the votes of the three diverse ML models, and top-5 semantically similar bug reports. We compared the performance of the judge LLM with the majority voting of the three chosen models.Results: The classical ML pipelines; Naive Bayes, Random Forest, and Linear SVM trained on TF-IDF features and achieved strong binary $F_1$ scores of 0.881, 0.894, and 0.878, respectively. Our suite of five fine-tuned BERT-based classifiers further improved performance, with F$_1$ scores for BERT base at 0.899, RoBERTa at 0.909, CodeBERT at 0.902, CodeBERT-Graph at 0.902, and the 128-token BERT variant at 0.899. In contrast, zero-shot LLM classification without retrieval saw $F_1$s ranging from 0.531 (GPT-4o-mini) to 0.737 (Llama-3.3-70B), highlighting the gap between out-of-the-box LLMs and specialized models. Introducing RAG-based few-shot prompting closed much of that gap, lifting LLM $F_1$s to 0.815 for GPT-o3-mini, 0.759 for GPT-o4-mini, and 0.797 for Llama, while GPT-4o-mini reached 0.729. Finally, our hybrid judge pipeline, combining top-5 similar bug reports, votes from RF, SVM, and RoBERTa, and reasoning by GPT-o3-mini, yielded an $F_1$ of 0.871, striking a balance between raw accuracy and human-readable explanations.Conclusions: Our evaluation confirms that specialized, fine-tuned classifiers, particularly RoBERTa and CodeBERT variants, remain the most cost-effective and highest-accuracy solutions for binary bug-report triage. Using RAG with LLMs substantially boosts classification over zero-shot baselines, although the scores do not surpass our top fine-tuned models. Nevertheless, their natural language rationales and actionable suggestions offer an explainability advantage that static classifiers cannot match. In practice, a hybrid ensemble where fast, accurate classifiers handle the bulk of cases and an LLM judge provides human-readable justification for edge cases appears to strike the best balance between performance, cost, and transparency.Research QuestionsThe study addresses the following research questions:RQ1: How do reasoning LLMs perform compared to prior ML and LLM classifiers when supported with few-shot prompting?RQ2: What is the effect of integrating an LLM judge into a majority-voting ensemble versus using voting alone?RQ3: How does few-shot prompting with RAG impact LLM performance relative to zero-shot prompting?Project StructureThe replication package consists of 4 folders. The data folder contains 2 files: training.csv, and bug_reports.csv. training.csv contains only 2 columns text and label, and it is the version we used when training the models. bug_reports.csv contains the columns and ids we have retrieved from Bugzilla. The code folder contains the .ipynb files we used when creating the dataset, training ML models, fine-tuning BERT-based models, and getting predictions from LLMs with both zero-shot and few-shot settings. The notebook for the unified pipeline is also included. The preds folder contains the predictions, explanations, and original labels of the bug reports from the test dataset. The metrics folder contains a single file which includes the metrics as JSON objects for all eighteen model configurations we have tested.Instruction for ReplicationThis replication package has the following folder structure:-code-data-preds-metricsThe code folder keepts the code we have used to create the dataset and preprocess it.It also includes the code we have used to train different models we have used in the study, and evaluate them.bert_models.ipynb contains the code for 5 different BERT-based models we have fine-tuned along with the code toprint the detailed scores for ease of evaluation.create_dataset.ipynb contains the code we have used for creating the dataset using the Bugzilla API.gpt_4o_mini.ipynb, gpt_o3_mini.ipynb, gpt_o4_mini.ipynb, and llama_70b.ipynb contain the code used to test the models in file names with zero-shot and few-shot configurations.The data folder contains the dataset we have collected and used in the study.bug_reports.csv is the file containing all information we obtained whether we used them in model training or not.training.csv is the version of the dataset we used when doing model training. It has only 2 labels, which simply the training process.The preds folder contains all predictions and explanations from the language models or hybrid judge pipeline.The metrics folder contains metrics.json, which includes all the model metrics as given in the paper. This file is handy for comparing the result of different models too.For the analysis performed in the review process, we have added another folder named review_analysis, which includes the scripts we have used to conduct additional analyses about the differences between the Judge LLM and majority voting, and the subsets where each approach is successful. The analysis also examines the aspects of bug reports that played an important role in being classified as valid by the LLM, including evidence-based patterns (such as file paths, urls, etc.), and potential biases such as text length correlation. There are 3 Python scripts for performing the analyses, 4 detailed markdown reports documenting the findings, and 5 visualizations illustrating model comparisons, voting patterns, and bias detection results. The key findings reveal that while the Judge LLM provides detailed natural language explanations, it exhibits a systematic text length bias (r=-0.360) and underperforms compared to the ML majority vote (81.5% vs 87.0% accuracy), though it offers superior explainability for human reviewers.
F
Hindi Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/hindi-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Hindi Open Ended Classification Prompt-Response Dataset, an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content
This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Hindi language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Hindi people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity
To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Different types of prompts, such as multiple-choice, direct, and true/false, are included. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Hindi Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Hindi version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Hindi Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
f
Main Data and Code
figshare.com
zip
Updated Oct 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Momo (2025). Main Data and Code [Dataset]. http://doi.org/10.6084/m9.figshare.29929412.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29929412.v1
Dataset updated
Oct 5, 2025
Dataset provided by
figshare
Authors
Momo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/├── rumor_generation/├── rumor_detection/└── rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.py：Script to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.
F
Malayalam Open Ended Classification Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Malayalam Open Ended Classification Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/malayalam-open-ended-classification-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Welcome to the Malayalam Open Ended Classification Prompt-Response Dataset, an extensive collection of 3000 meticulously curated prompt and response pairs. This dataset is a valuable resource for training Language Models (LMs) to classify input text accurately, a crucial aspect in advancing generative AI.
Dataset Content
This open-ended classification dataset comprises a diverse set of prompts and responses where the prompt contains input text to be classified and may also contain task instruction, context, constraints, and restrictions while completion contains the best classification category as response. Both these prompts and completions are available in Malayalam language. As this is an open-ended dataset, there will be no options given to choose the right classification category as a part of the prompt.
These prompt and completion pairs cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more. Each prompt is accompanied by a response, providing valuable information and insights to enhance the language model training process. Both the prompt and response were manually curated by native Malayalam people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This open-ended classification prompt and completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains prompts and responses with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Prompt Diversity
To ensure diversity, this open-ended classification dataset includes prompts with varying complexity levels, ranging from easy to medium and hard. Different types of prompts, such as multiple-choice, direct, and true/false, are included. Additionally, prompts are diverse in terms of length from short to medium and long, creating a comprehensive variety. The classification dataset also contains prompts with constraints and persona restrictions, which makes it even more useful for LLM training.
Response Formats
To accommodate diverse learning experiences, our dataset incorporates different types of responses depending on the prompt. These formats include single-word, short phrase, and single sentence type of response. These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Malayalam Open Ended Classification Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt length, prompt complexity, domain, response, response type, and rich text presence.
Quality and Accuracy
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.
The Malayalam version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom open-ended classification prompt and completion data tailored to specific needs, providing flexibility and customization options.
License
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Malayalam Open Ended Classification Prompt-Completion Dataset to enhance the classification abilities and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
F
Arabic Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Arabic Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/arabic-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
The Arabic Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Arabic language, advancing the field of artificial intelligence.
Dataset Content
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Arabic. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Arabic people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details
This fully labeled Arabic Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Arabic versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Arabic Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
d
Sentiment Analysis Dataset [Consumer Reviews] – Labeled feedback for...
datarade.ai
.csv, .xls, .txt
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com, Sentiment Analysis Dataset [Consumer Reviews] – Labeled feedback for training, benchmarking, and CX modeling [Dataset]. https://datarade.ai/data-products/sentiment-analysis-dataset-consumer-reviews-labeled-feedb-wiserbrand-com
Explore at:
.csv, .xls, .txtAvailable download formats
Dataset provided by
WiserBrand
Area covered
Liechtenstein, Bulgaria, Italy, Albania, Canada, Saint Pierre and Miquelon, Faroe Islands, Panama, Luxembourg, Greenland
Description
This dataset provides millions of consumer reviews enriched with sentiment labels (positive, neutral, or negative), making it an essential asset for training AI models, analyzing customer satisfaction, and detecting risk signals in customer feedback.

Collected across 970+ marketplaces (including Amazon, eBay, Temu, Flipkart, and others) and spanning 160+ industries, it reflects how consumers express delight, frustration, or dissatisfaction in real purchase and service situations.

Each entry includes:

Full written review text

Assigned sentiment label: positive, neutral, or negative

Product/service category and platform (e.g., electronics on Amazon)

Optional metadata: review date, star rating, region, brand name

Use this dataset to:

Train sentiment analysis engines and review classifiers

Benchmark brand perception and shifts in consumer tone over time

Detect complaints masked in neutral or positive ratings

Feed LLMs and generative AI with labeled opinion data for alignment tasks

Monitor market sentiment by product, platform, or geography

Whether you're building models or measuring brand trust, this dataset offers a structured view of consumer emotion, helping you turn unstructured feedback into meaningful action.

The more you purchase, the lower the price will be.
Writeup analysis OpenAI gpt-oss-20b
kaggle.com
zip
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
phunter (2025). Writeup analysis OpenAI gpt-oss-20b [Dataset]. https://www.kaggle.com/datasets/phunter/writeup-analysis-openai-gpt-oss-20b
Explore at:
zip(429740 bytes)Available download formats
Dataset updated
Sep 23, 2025
Authors
phunter
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This document outlines the process used to create a structured, analyzable dataset of LLM attack methods from a corpus of unstructured red-teaming writeups using https://www.kaggle.com/datasets/kaggleqrdl/red-teaming-all-writeups dataset.

1. Framework Creation

The foundation of this analysis is a formal, hierarchical taxonomy of known LLM attack methods, which is defined in attack_taxonomy.md. This taxonomy provides a controlled vocabulary for classification, ensuring consistency across all entries. The raw, unstructured summaries of various attack methodologies were compiled into a single file, condensed_methods.md.

2. Automated Labeling and Data Enrichment

To bridge the gap between the unstructured summaries and the formal taxonomy, we developed predict_taxonomy.py. This script automates the labeling process:

It iterates through each attack summary in condensed_methods.md.

For each summary, it calls the Gemini API with a specialized prompt. This prompt instructs the model to act as an expert AI security researcher.

The model is provided with both the attack summary and the complete attack_taxonomy.md as context.

It is then tasked with selecting the most relevant and specific attack categories that describe the methodology in the summary.

3. Producing the Enriched Dataset

The script captures the list of predicted taxonomy labels from Gemini for each writeup. It then combines the original source, the full summary content, and the new taxonomy labels into a single, structured record.

This entire collection is saved as predicted_taxonomy.json, creating an enriched dataset where each attack method is now machine-readable and systematically classified. This structured data is invaluable for quantitative analysis, pattern recognition, and further research into LLM vulnerabilities.

Column Name	Description
`label`	Binary label indicating the source of the text (`0`: Human, `1`: AI).
`clean_text`	Preprocessed and cleaned text content ready for NLP tasks.

Facebook

Twitter

Click to copy link

Link copied

Cite

Growth Market Reports (2025). Data Labeling with LLMs Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-with-llms-market

Data Labeling with LLMs Market Research Report 2033

Explore at:

csv, pdf, pptxAvailable download formats

Dataset updated

Oct 6, 2025

Dataset authored and provided by

Growth Market Reports

Time period covered

2024 - 2032

Area covered

Global

Description

Data Labeling with LLMs Market Outlook

According to our latest research, the global Data Labeling with LLMs market size was valued at USD 2.14 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 22.8% from 2025 to 2033, reaching a forecasted value of USD 16.6 billion by 2033. This impressive growth is primarily driven by the increasing adoption of large language models (LLMs) to automate and enhance the efficiency of data labeling processes across various industries. As organizations continue to invest in AI and machine learning, the demand for high-quality, accurately labeled datasets—essential for training and fine-tuning LLMs—continues to surge, fueling the expansion of the data labeling with LLMs market.

One of the principal growth factors for the data labeling with LLMs market is the exponential increase in the volume of unstructured data generated by businesses and consumers worldwide. Organizations are leveraging LLMs to automate the labeling of vast datasets, which is essential for training sophisticated AI models. The integration of LLMs into data labeling workflows is not only improving the speed and accuracy of the annotation process but also reducing operational costs. This technological advancement has enabled enterprises to scale their AI initiatives more efficiently, facilitating the deployment of intelligent applications across sectors such as healthcare, automotive, finance, and retail. Moreover, the continuous evolution of LLMs, with capabilities such as zero-shot and few-shot learning, is further enhancing the quality and context-awareness of labeled data, making these solutions indispensable for next-generation AI systems.

Another significant driver is the growing need for domain-specific labeled datasets, especially in highly regulated industries like healthcare and finance. In these sectors, data privacy and security are paramount, and the use of LLMs in data labeling processes ensures that sensitive information is handled with the utmost care. LLM-powered platforms are increasingly being adopted to create high-quality, compliant datasets for applications such as medical imaging analysis, fraud detection, and customer sentiment analysis. The ability of LLMs to understand context, semantics, and complex language structures is particularly valuable in these domains, where the accuracy and reliability of labeled data directly impact the performance and safety of AI-driven solutions. This trend is expected to continue as organizations strive to meet stringent regulatory requirements while accelerating their AI adoption.

Furthermore, the proliferation of AI-powered applications in emerging markets is contributing to the rapid expansion of the data labeling with LLMs market. Countries in Asia Pacific and Latin America are witnessing significant investments in digital transformation, driving the demand for scalable and efficient data annotation solutions. The availability of cloud-based data labeling platforms, combined with advancements in LLM technologies, is enabling organizations in these regions to overcome traditional barriers such as limited access to skilled annotators and high operational costs. As a result, the market is experiencing robust growth in both developed and developing economies, with enterprises increasingly recognizing the strategic value of high-quality labeled data in gaining a competitive edge.

From a regional perspective, North America currently dominates the data labeling with LLMs market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology companies, advanced research institutions, and a mature AI ecosystem. However, Asia Pacific is expected to witness the highest CAGR during the forecast period, driven by rapid digitalization, government initiatives supporting AI development, and a burgeoning startup ecosystem. Europe is also emerging as a key market, with strong demand from sectors such as automotive and healthcare. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their market presence, supported by growing investments in AI infrastructure and talent development.

"https://growthmarketreports.com/request-sample/198771">

Clear search

Close search

Google apps

Main menu

Data Labeling with LLMs Market Research Report 2033

Data Labeling with LLMs Market Outlook

Foundation Model Data Collection and Data Annotation | Large Language...

Sentiment Analysis: App Store Reviews

Social Engineering Detection Benchmark with LLMs

📚 Social Engineering Detection Benchmark with LLMs

About the Dataset

Benchmark Results: Top-Performing Models

Dataset Composition

Data Generation & Annotation

Structure of Each Entry

Applications

Getting Started

Ethical & Privacy Considerations

How to Cite

Bitext-travel-llm-chatbot-training-dataset

LLM Data Quality Assurance Market Research Report 2033

LLM Data Quality Assurance Market Outlook

Regional Outlook

Report Scope

Augmented training data and labels, used for training the models

PII | External Dataset

Versions

Description

Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

Japanese Closed Ended Question Answer Text Dataset

Dataset Content

Question Diversity

Answer Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License:

AI(LLMS) vs. Human Texts Cleaned and Optimized

Title:

Overview:

Key Features:

Dataset Structure:

Sources Used:

Data Cleaning and Preprocessing Steps:

Use Cases:

Finnish Open Ended Question Answer Text Dataset

Dataset Content:

Question Diversity:

Answer Formats:

Data Format and Annotation Details:

Quality and Accuracy:

Continuous Updates and Customization:

License:

Bitext-restaurants-llm-chatbot-training-dataset

Data from: Judge the Votes: A System to Classify Bug Reports and Give...

Hindi Open Ended Classification Prompt & Response Dataset

Dataset Content

Prompt Diversity

Response Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License

Main Data and Code

Malayalam Open Ended Classification Prompt & Response Dataset

Dataset Content

Prompt Diversity

Response Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License

Arabic Closed Ended Question Answer Text Dataset

Dataset Content

Question Diversity

Answer Formats

Data Format and Annotation Details

Quality and Accuracy

Continuous Updates and Customization

License:

Sentiment Analysis Dataset [Consumer Reviews] – Labeled feedback for...

Writeup analysis OpenAI gpt-oss-20b

1. Framework Creation

2. Automated Labeling and Data Enrichment