Facebook
Twitter
According to our latest research, the global Data Labeling with LLMs market size was valued at USD 2.14 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 22.8% from 2025 to 2033, reaching a forecasted value of USD 16.6 billion by 2033. This impressive growth is primarily driven by the increasing adoption of large language models (LLMs) to automate and enhance the efficiency of data labeling processes across various industries. As organizations continue to invest in AI and machine learning, the demand for high-quality, accurately labeled datasets—essential for training and fine-tuning LLMs—continues to surge, fueling the expansion of the data labeling with LLMs market.
One of the principal growth factors for the data labeling with LLMs market is the exponential increase in the volume of unstructured data generated by businesses and consumers worldwide. Organizations are leveraging LLMs to automate the labeling of vast datasets, which is essential for training sophisticated AI models. The integration of LLMs into data labeling workflows is not only improving the speed and accuracy of the annotation process but also reducing operational costs. This technological advancement has enabled enterprises to scale their AI initiatives more efficiently, facilitating the deployment of intelligent applications across sectors such as healthcare, automotive, finance, and retail. Moreover, the continuous evolution of LLMs, with capabilities such as zero-shot and few-shot learning, is further enhancing the quality and context-awareness of labeled data, making these solutions indispensable for next-generation AI systems.
Another significant driver is the growing need for domain-specific labeled datasets, especially in highly regulated industries like healthcare and finance. In these sectors, data privacy and security are paramount, and the use of LLMs in data labeling processes ensures that sensitive information is handled with the utmost care. LLM-powered platforms are increasingly being adopted to create high-quality, compliant datasets for applications such as medical imaging analysis, fraud detection, and customer sentiment analysis. The ability of LLMs to understand context, semantics, and complex language structures is particularly valuable in these domains, where the accuracy and reliability of labeled data directly impact the performance and safety of AI-driven solutions. This trend is expected to continue as organizations strive to meet stringent regulatory requirements while accelerating their AI adoption.
Furthermore, the proliferation of AI-powered applications in emerging markets is contributing to the rapid expansion of the data labeling with LLMs market. Countries in Asia Pacific and Latin America are witnessing significant investments in digital transformation, driving the demand for scalable and efficient data annotation solutions. The availability of cloud-based data labeling platforms, combined with advancements in LLM technologies, is enabling organizations in these regions to overcome traditional barriers such as limited access to skilled annotators and high operational costs. As a result, the market is experiencing robust growth in both developed and developing economies, with enterprises increasingly recognizing the strategic value of high-quality labeled data in gaining a competitive edge.
From a regional perspective, North America currently dominates the data labeling with LLMs market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology companies, advanced research institutions, and a mature AI ecosystem. However, Asia Pacific is expected to witness the highest CAGR during the forecast period, driven by rapid digitalization, government initiatives supporting AI development, and a burgeoning startup ecosystem. Europe is also emerging as a key market, with strong demand from sectors such as automotive and healthcare. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their market presence, supported by growing investments in AI infrastructure and talent development.
Facebook
Twitter-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.
-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.
-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.
-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization
-Quality: Multiple rounds of quality inspections ensures high quality data output
-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.
-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.
3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Please share your suggestions to improve my datasets further✍️
📄 Dataset Overview This dataset contains Google Play Store app reviews labeled for sentiment using a deterministic Large Language Model (LLM) classification pipeline. Each review is tagged as positive, negative, or neutral, making it ready for NLP training, benchmarking, and market insight generation.
⚙️ Data Collection & Labeling Process Source: Reviews collected from Google Play Store using the google_play_scraper library. Labeling: Reviews classified by a Hugging Face Transformers-based LLM with a strict prompt to ensure one-word output. Post-processing: Outputs normalized to the three sentiment classes.
💡 Potential Uses Fine-tuning BERT, RoBERTa, LLaMA, or other transformer models. Sentiment dashboards for product feedback monitoring. Market research on user perception trends. Benchmark dataset for text classification experiments.
Please upvote!!!!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the augmented data and labels used in training the model, it is also needed for evaluation as the vectoriser is fit on this data and then the test data is transformed on that vectoriser
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Media Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [media] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-media-llm-chatbot-training-dataset.
Facebook
Twitterhttps://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy
According to our latest research, the Global LLM Data Quality Assurance market size was valued at $1.25 billion in 2024 and is projected to reach $8.67 billion by 2033, expanding at a robust CAGR of 23.7% during 2024–2033. The major factor propelling the growth of the LLM Data Quality Assurance market globally is the rapid proliferation of generative AI and large language models (LLMs) across industries, creating an urgent need for high-quality, reliable, and bias-free data to fuel these advanced systems. As organizations increasingly depend on LLMs for mission-critical applications, ensuring the integrity and accuracy of training and operational data has become indispensable to mitigate risk, enhance performance, and comply with evolving regulatory frameworks.
North America currently commands the largest share of the LLM Data Quality Assurance market, accounting for approximately 38% of the global revenue in 2024. This dominance can be attributed to the region’s mature AI ecosystem, significant investments in digital transformation, and the presence of leading technology firms and AI research institutions. The United States, in particular, has spearheaded the adoption of LLMs in sectors such as BFSI, healthcare, and IT, driving the demand for advanced data quality assurance solutions. Favorable government policies supporting AI innovation, a strong startup culture, and robust regulatory guidelines around data privacy and model transparency have further solidified North America’s leadership position in the market.
Asia Pacific is emerging as the fastest-growing region in the LLM Data Quality Assurance market, with a projected CAGR of 27.4% from 2024 to 2033. This rapid growth is driven by escalating investments in AI infrastructure, increasing digitalization across enterprises, and government-led initiatives to foster AI research and deployment. Countries such as China, Japan, South Korea, and India are witnessing exponential growth in LLM adoption, especially in sectors like e-commerce, telecommunications, and manufacturing. The region’s burgeoning talent pool, combined with a surge in AI-focused venture capital funding, is fueling innovation in data quality assurance platforms and services, positioning Asia Pacific as a major future growth engine for the market.
Emerging economies in Latin America and the Middle East & Africa are also starting to recognize the importance of LLM Data Quality Assurance, but adoption remains at a nascent stage due to infrastructural limitations, skill gaps, and budgetary constraints. These regions are gradually overcoming barriers as multinational corporations expand their operations and local governments launch digital transformation agendas. However, challenges such as data localization requirements, fragmented regulatory landscapes, and limited access to cutting-edge AI technologies are slowing widespread adoption. Despite these hurdles, localized demand for data quality solutions in sectors like banking, retail, and healthcare is expected to rise steadily as these economies modernize and integrate AI-driven workflows.
| Attributes | Details |
| Report Title | LLM Data Quality Assurance Market Research Report 2033 |
| By Component | Software, Services |
| By Application | Model Training, Data Labeling, Data Validation, Data Cleansing, Data Monitoring, Others |
| By Deployment Mode | On-Premises, Cloud |
| By Enterprise Size | Small and Medium Enterprises, Large Enterprises |
| By End-User | BFSI, Healthcare, Retail and E-commerce, IT and Telecommunications, Media and Entertainment, Manufacturing, Others |
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This document outlines the process used to create a structured, analyzable dataset of LLM attack methods from a corpus of unstructured red-teaming writeups using https://www.kaggle.com/datasets/kaggleqrdl/red-teaming-all-writeups dataset.
The foundation of this analysis is a formal, hierarchical taxonomy of known LLM attack methods, which is defined in attack_taxonomy.md. This taxonomy provides a controlled vocabulary for classification, ensuring consistency across all entries. The raw, unstructured summaries of various attack methodologies were compiled into a single file, condensed_methods.md.
To bridge the gap between the unstructured summaries and the formal taxonomy, we developed predict_taxonomy.py. This script automates the labeling process:
condensed_methods.md.attack_taxonomy.md as context.The script captures the list of predicted taxonomy labels from Gemini for each writeup. It then combines the original source, the full summary content, and the new taxonomy labels into a single, structured record.
This entire collection is saved as predicted_taxonomy.json, creating an enriched dataset where each attack method is now machine-readable and systematically classified. This structured data is invaluable for quantitative analysis, pattern recognition, and further research into LLM vulnerabilities.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is an LLM-generated external dataset for the: - The Learning Agency Lab - PII Data Detection Competition
It contains 3382 4434 generated texts with their corresponding annotated labels in the required competition format.
Description:
- document (str): ID of the essay
- full_text (string): AI generated text.
- tokens (string): a list with the tokens (comes from text.split())
- trailing_whitespace (list): a list with boolean values indicating whether each token is followed by whitespace.
- labels (list): list with token labels in BIO format
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cleaned and Optimized Dataset for AI vs. Human Text Classification
This dataset is a curated and optimized collection of text data designed for training and evaluating machine learning models to distinguish between AI-generated and human-written text. The dataset has been meticulously cleaned, deduplicated, and reduced in size to ensure efficiency while maintaining its utility for research and development purposes.
By combining multiple high-quality sources, this dataset provides a diverse range of text samples, making it ideal for tasks such as binary classification (AI vs. Human) and other natural language processing (NLP) applications.
Cleaned Text:
Label Consistency:
0 for human-written text, 1 for AI-generated text).Memory Optimization:
category for categorical columns).Deduplication:
Null Value Handling:
Compact Size:
label and clean_text, making it lightweight and easy to use.The final dataset contains the following columns:
| Column Name | Description |
|---|---|
label | Binary label indicating the source of the text (0: Human, 1: AI). |
clean_text | Preprocessed and cleaned text content ready for NLP tasks. |
This dataset is a consolidation of multiple high-quality datasets from various sources, ensuring diversity and representativeness. Below are the details of the sources used:
Source 1:
text, generated (renamed to label).Source 2:
text, label.Source 3:
AI_Essay (renamed to text), prompt_id (renamed to label).Source 4:
text, label.Source 5:
text, source (renamed to label).Source 6:
text, label.To ensure the dataset is clean, consistent, and optimized for use, the following steps were performed:
Column Standardization:
text and label).Text Cleaning:
Duplicate Removal:
Null Value Handling:
text or label).Memory Optimization:
category type for memory efficiency.Final Dataset Creation:
label and clean_text.This datase...
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Restaurants Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [restaurants] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-restaurants-llm-chatbot-training-dataset.
Facebook
TwitterData size : 200,000 ID
Race distribution : black people, Caucasian people, brown(Mexican) people, Indian people and Asian people
Gender distribution : gender balance
Age distribution : young, midlife and senior
Collecting environment : including indoor and outdoor scenes
Data diversity : different face poses, races, ages, light conditions and scenes Device : cellphone
Data format : .jpg/png
Accuracy : the accuracy of labels of face pose, race, gender and age are more than 97%
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification
This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.
Key Features:
LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.
Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.
Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.
Dataset Composition:
curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot
curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot
Intended Use:
Fine-tuning and advancing Homepage2Vec or similar website classification models
Research on LLM-generated datasets for text classification tasks
Exploration of multilingual website classification
Additional Information:
Project and report repository: https://github.com/CS-433/ml-project-2-mlp
Acknowledgments:
This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewContext: A large number of bug reports are submitted in software projects on a daily basis. They are examined manually by practitioners. However, most of the bug reports are not related to the errors in the codebase (invalid bug reports), and cause a great waste of time and energy. Previous research have used various machine learning based techniques and language model based techniques to tackle this problem through auto-classification of bug reports. There exists a gap, however, in classification of bug reports using the large language models (LLMs) and techniques to improve the LLM performance for binary classification.Objective: The aim of this study is to apply various machine learning and natural language processing methods to classify bug reports as valid or invalid, then supply the predictions of these models to an LLM judge, along with similar bug reports and their labels, to enable it to make an informed prediction.Method: We first retrieved 10,000 real-world Firefox bug reports via the Bugzilla API, then divided them randomly into a training set and a test set. We trained three traditional classifiers (Naive Bayes, Random Forest, and SVM) on the training set, and fine-tuned five BERT-based models for classification. In parallel, we used LLMs with in-context examples given by the Retrieval Augmented Generation system established over the training set; and compared the results with zero-shot prompting. We have picked the best-performing LLM and used it as the judge LLM by providing it with the votes of the three diverse ML models, and top-5 semantically similar bug reports. We compared the performance of the judge LLM with the majority voting of the three chosen models.Results: The classical ML pipelines; Naive Bayes, Random Forest, and Linear SVM trained on TF-IDF features and achieved strong binary $F_1$ scores of 0.881, 0.894, and 0.878, respectively. Our suite of five fine-tuned BERT-based classifiers further improved performance, with F$_1$ scores for BERT base at 0.899, RoBERTa at 0.909, CodeBERT at 0.902, CodeBERT-Graph at 0.902, and the 128-token BERT variant at 0.899. In contrast, zero-shot LLM classification without retrieval saw $F_1$s ranging from 0.531 (GPT-4o-mini) to 0.737 (Llama-3.3-70B), highlighting the gap between out-of-the-box LLMs and specialized models. Introducing RAG-based few-shot prompting closed much of that gap, lifting LLM $F_1$s to 0.815 for GPT-o3-mini, 0.759 for GPT-o4-mini, and 0.797 for Llama, while GPT-4o-mini reached 0.729. Finally, our hybrid judge pipeline, combining top-5 similar bug reports, votes from RF, SVM, and RoBERTa, and reasoning by GPT-o3-mini, yielded an $F_1$ of 0.871, striking a balance between raw accuracy and human-readable explanations.Conclusions: Our evaluation confirms that specialized, fine-tuned classifiers, particularly RoBERTa and CodeBERT variants, remain the most cost-effective and highest-accuracy solutions for binary bug-report triage. Using RAG with LLMs substantially boosts classification over zero-shot baselines, although the scores do not surpass our top fine-tuned models. Nevertheless, their natural language rationales and actionable suggestions offer an explainability advantage that static classifiers cannot match. In practice, a hybrid ensemble where fast, accurate classifiers handle the bulk of cases and an LLM judge provides human-readable justification for edge cases appears to strike the best balance between performance, cost, and transparency.Research QuestionsThe study addresses the following research questions:RQ1: How do reasoning LLMs perform compared to prior ML and LLM classifiers when supported with few-shot prompting?RQ2: What is the effect of integrating an LLM judge into a majority-voting ensemble versus using voting alone?RQ3: How does few-shot prompting with RAG impact LLM performance relative to zero-shot prompting?Project StructureThe replication package consists of 4 folders. The data folder contains 2 files: training.csv, and bug_reports.csv. training.csv contains only 2 columns text and label, and it is the version we used when training the models. bug_reports.csv contains the columns and ids we have retrieved from Bugzilla. The code folder contains the .ipynb files we used when creating the dataset, training ML models, fine-tuning BERT-based models, and getting predictions from LLMs with both zero-shot and few-shot settings. The notebook for the unified pipeline is also included. The preds folder contains the predictions, explanations, and original labels of the bug reports from the test dataset. The metrics folder contains a single file which includes the metrics as JSON objects for all eighteen model configurations we have tested.Instruction for ReplicationThis replication package has the following folder structure:-code-data-preds-metricsThe code folder keepts the code we have used to create the dataset and preprocess it.It also includes the code we have used to train different models we have used in the study, and evaluate them.bert_models.ipynb contains the code for 5 different BERT-based models we have fine-tuned along with the code toprint the detailed scores for ease of evaluation.create_dataset.ipynb contains the code we have used for creating the dataset using the Bugzilla API.gpt_4o_mini.ipynb, gpt_o3_mini.ipynb, gpt_o4_mini.ipynb, and llama_70b.ipynb contain the code used to test the models in file names with zero-shot and few-shot configurations.The data folder contains the dataset we have collected and used in the study.bug_reports.csv is the file containing all information we obtained whether we used them in model training or not.training.csv is the version of the dataset we used when doing model training. It has only 2 labels, which simply the training process.The preds folder contains all predictions and explanations from the language models or hybrid judge pipeline.The metrics folder contains metrics.json, which includes all the model metrics as given in the paper. This file is handy for comparing the result of different models too.For the analysis performed in the review process, we have added another folder named review_analysis, which includes the scripts we have used to conduct additional analyses about the differences between the Judge LLM and majority voting, and the subsets where each approach is successful. The analysis also examines the aspects of bug reports that played an important role in being classified as valid by the LLM, including evidence-based patterns (such as file paths, urls, etc.), and potential biases such as text length correlation. There are 3 Python scripts for performing the analyses, 4 detailed markdown reports documenting the findings, and 5 visualizations illustrating model comparisons, voting patterns, and bias detection results. The key findings reveal that while the Judge LLM provides detailed natural language explanations, it exhibits a systematic text length bias (r=-0.360) and underperforms compared to the ML majority vote (81.5% vs 87.0% accuracy), though it offers superior explainability for human reviewers.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A JSON file with ground truth sentiment labels used in evaluation and comparison to LLM prediction.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media.Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII).The data structure is tabulated as follows:Text: The main content.Dimension: Descriptive category of the text.Biased_Words: A compilation of words regarded as biased.Aspect: Specific sub-topic within the main content.Label: Indicates the presence (True) or absence (False) of bias. The label is ternary - highly biased, slightly biased and neutralToxicity: Indicates the presence (True) or absence (False) of bias.Identity_mention: Mention of any identity based on words match.Annotation SchemeThe labels and annotations in the dataset are generated through a system of Active Learning, cycling through:Manual LabelingSemi-Supervised LearningHuman VerificationThe scheme comprises:Bias Label: Specifies the degree of bias (e.g., no bias, mild, or strong).Words/Phrases Level Biases: Pinpoints specific biased terms or phrases.Subjective Bias (Aspect): Highlights biases pertinent to content dimensions.Due to the nuances of semantic match algorithms, certain labels such as 'identity' and 'aspect' may appear distinctively different.List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news.We also utilize publicly available data from the following links. Our Attribution to others.MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detectionToxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu)Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtVSocial biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage.If you use this dataset, please cite us.Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HVULao_NLP project is dedicated to sharing datasets and tools for Lao Natural Language Processing (NLP), developed and maintained by the research team at Hung Vuong University (HVU), Phu Tho, Vietnam. This project is supported by Hung Vuong University with the aim of advancing research and applications in low-resource language processing, particularly for the Lao language.
📁 Datasets
This release provides a semi-automatically constructed corpus consisting of Lao sentences that have been word-segmented and part-of-speech (POS) tagged. It is designed to support a wide range of NLP applications, including language modeling, sequence labeling, linguistic research, and the development of Lao language tools.
Datatest1k/ – Test set (1,000 Lao sentences)
testorgin1000.txt: Original raw sentences (UTF-8, one sentence per line). testsegsent_1000.txt: Word-segmented version aligned 1-to-1 with the raw file (tokens separated by spaces). testtag1k.json: Word-segmented and POS-tagged sentences, generated using large language models (LLMs) and manually reviewed by native linguists. Datatrain10k/ – Training set (10,000 Lao sentences)
10ktrainorin.txt: Original raw sentences (UTF-8, one sentence per line). 10ksegmented.txt: Word-segmented version aligned 1-to-1 with the raw file. 10ktraintag.json: Word-segmented and POS-tagged sentences, generated using the same method as the test set. lao_finetuned_10k/ – A fine-tuned transformer-based model for Lao word segmentation, compatible with Hugging Face’s transformers library.
All data files are encoded in UTF-8 (NFC) and prepared for direct use in NLP pipelines.
📁 The Lao sentence segmentation tool
A command-line tool for Lao word segmentation built with a fine-tuned Hugging Face transformers model and PyTorch.
Features
- Accurate Lao word segmentation using a pre-trained model
- Simple command-line usage
- GPU support (if available)
Example usage
```bash
python3 segment_lao.py -i ./data/lao_raw.txt -o ./output/lao_segmented.txt
📁 The Lao sentence POS tagging tool
A POS tagging tool for segmented Lao text, implemented with Python and CRF++.
Example usage
python3 Pos_tagging.py ./Test/lao_sentences_segmented.txt Test1
📚 Usage
The HVULao_NLP dataset and tools are intended for:
- Training and evaluating sequence labeling models (e.g., CRF, BiLSTM, mBERT)
- Developing Lao NLP tools (e.g., POS taggers, tokenizers)
- Conducting linguistic and computational research on Lao
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail (eCommerce) Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail (eCommerce)] sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-ecommerce-llm-chatbot-training-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Ethical principles are fundamental for the long-term sustainable growth of AI/ML; however, recent research highlights that many projects have yet to full integrate these guidelines. This research work aims to assess the current rate of adoption of ethical principles in AI/ML within the software development space. We collected 96,254 pull requests from 28 AI/ML GitHub projects and randomly selected 400 pull requests for manual labeling based on the seven EU ethical guidelines. To address the challenge of scalability and consistency in manual labeling, we investigated the use of a zero-shot large language model (LLM), OpenAI’s GPT-4o. ThisLLM was leveraged to automatically detect ethical AI principles in our sample of pull requests. Our findings demonstrate that GPT-4o has the potential to support ethical compliance in software development. Looking ahead, we envision automating the scanning of code changes for ethical concerns, similar to vulnerability detection models. This tool would flag high-risk pull requests for ethical review, aiding AI risk assessment in open-source projects and supporting the automatic generation of an AI Bill of Materials (AI BOM).
Facebook
Twitterdata
If you are looking for our intermediate labeling version, please refer to mango-ttic/data-intermediate Find more about us at mango.ttic.edu
Folder Structure
Each folder inside data contains the cleaned up files used during LLM inference and results evaluations. Here is the tree structure from game data/night . data/night/ ├── night.actions.json # list of mentioned actions ├── night.all2all.jsonl # all simple paths between any 2 locations ├──… See the full description on the dataset page: https://huggingface.co/datasets/mango-ttic/data.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Overview
The Open Orca Enhanced Dataset is meticulously designed to improve the performance of automated essay grading models using deep learning techniques. This dataset integrates robust data instances from the FLAN collection, augmented with responses generated by GPT-3.5 or GPT-4, creating a diverse and context-rich resource for training models. Dataset Structure
The dataset is structured in a tabular format, with the following key fields:
id: A unique identifier for each data instance.
system prompt: The prompt presented to the GPT-3.5 or GPT-4 API.
question: The question entry as provided by the FLAN collection.
response: The response received from GPT-3.5 or GPT-4.
label: The classification of the response as "True" (ideal response) or "False" (generated as a close, yet incorrect, alternative).
Data Collection and Processing
Initial Dataset Selection: We initially chose the QuAC dataset due to its resemblance to student essay responses. However, we identified limitations and transitioned to the Open Orca dataset for its superior structure and data quality.
Format Conversion: We converted the QuAC context-question-answer format by identifying "True" answers as ground truth and generating "False" answers by selecting random responses. This approach was initially tested using the flan T5 model, which only achieved 40% accuracy.
RAG Implementation: To enhance the differentiation between "True" and "False" answers, we employed Retrieval Augmented Generation (RAG) to select the third most similar answer as the "False" response, significantly improving model accuracy to 88%.
Data Augmentation
Instructional Prompts: The dataset includes instructional prompts that enable the training of ChatGPT-like models, contributing to notable accuracy improvements.
Contextual Relevance: A multi-stage filtering process ensured the retention of contextually rich prompts, with over 1,000 initial prompts filtered down to align with 2.1 million samples.
Labeling: The final dataset includes labels that not only classify answers as "True" or "False" but also provide the ground truth answer, enhancing the model's understanding of context and logical response generation.
Evaluation and Performance
Accuracy Metrics: The refined dataset achieved remarkable performance:
English LLM: 97% accuracy.
Arabic LLM: 90% accuracy.
Model Comparison: Incorporating the ground truth answer into the label improved model accuracy significantly, as evidenced by the comparison:
Flan T5: Improved from 20% to 83%.
Bloomz: Improved from 40% to 85%.
Translation for Multilingual Models
Arabic Dataset Creation: Leveraging Google Translate's advancements, we translated the robust English dataset into Arabic, ensuring the creation of a truly multilingual resource. Google Translate's high accuracy (82.5%) provided a solid foundation for this translation.
Facebook
Twitter
According to our latest research, the global Data Labeling with LLMs market size was valued at USD 2.14 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 22.8% from 2025 to 2033, reaching a forecasted value of USD 16.6 billion by 2033. This impressive growth is primarily driven by the increasing adoption of large language models (LLMs) to automate and enhance the efficiency of data labeling processes across various industries. As organizations continue to invest in AI and machine learning, the demand for high-quality, accurately labeled datasets—essential for training and fine-tuning LLMs—continues to surge, fueling the expansion of the data labeling with LLMs market.
One of the principal growth factors for the data labeling with LLMs market is the exponential increase in the volume of unstructured data generated by businesses and consumers worldwide. Organizations are leveraging LLMs to automate the labeling of vast datasets, which is essential for training sophisticated AI models. The integration of LLMs into data labeling workflows is not only improving the speed and accuracy of the annotation process but also reducing operational costs. This technological advancement has enabled enterprises to scale their AI initiatives more efficiently, facilitating the deployment of intelligent applications across sectors such as healthcare, automotive, finance, and retail. Moreover, the continuous evolution of LLMs, with capabilities such as zero-shot and few-shot learning, is further enhancing the quality and context-awareness of labeled data, making these solutions indispensable for next-generation AI systems.
Another significant driver is the growing need for domain-specific labeled datasets, especially in highly regulated industries like healthcare and finance. In these sectors, data privacy and security are paramount, and the use of LLMs in data labeling processes ensures that sensitive information is handled with the utmost care. LLM-powered platforms are increasingly being adopted to create high-quality, compliant datasets for applications such as medical imaging analysis, fraud detection, and customer sentiment analysis. The ability of LLMs to understand context, semantics, and complex language structures is particularly valuable in these domains, where the accuracy and reliability of labeled data directly impact the performance and safety of AI-driven solutions. This trend is expected to continue as organizations strive to meet stringent regulatory requirements while accelerating their AI adoption.
Furthermore, the proliferation of AI-powered applications in emerging markets is contributing to the rapid expansion of the data labeling with LLMs market. Countries in Asia Pacific and Latin America are witnessing significant investments in digital transformation, driving the demand for scalable and efficient data annotation solutions. The availability of cloud-based data labeling platforms, combined with advancements in LLM technologies, is enabling organizations in these regions to overcome traditional barriers such as limited access to skilled annotators and high operational costs. As a result, the market is experiencing robust growth in both developed and developing economies, with enterprises increasingly recognizing the strategic value of high-quality labeled data in gaining a competitive edge.
From a regional perspective, North America currently dominates the data labeling with LLMs market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology companies, advanced research institutions, and a mature AI ecosystem. However, Asia Pacific is expected to witness the highest CAGR during the forecast period, driven by rapid digitalization, government initiatives supporting AI development, and a burgeoning startup ecosystem. Europe is also emerging as a key market, with strong demand from sectors such as automotive and healthcare. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their market presence, supported by growing investments in AI infrastructure and talent development.