38 datasets found
  1. G

    Data Labeling with LLMs Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Data Labeling with LLMs Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-with-llms-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Oct 6, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data Labeling with LLMs Market Outlook



    According to our latest research, the global Data Labeling with LLMs market size was valued at USD 2.14 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 22.8% from 2025 to 2033, reaching a forecasted value of USD 16.6 billion by 2033. This impressive growth is primarily driven by the increasing adoption of large language models (LLMs) to automate and enhance the efficiency of data labeling processes across various industries. As organizations continue to invest in AI and machine learning, the demand for high-quality, accurately labeled datasets—essential for training and fine-tuning LLMs—continues to surge, fueling the expansion of the data labeling with LLMs market.




    One of the principal growth factors for the data labeling with LLMs market is the exponential increase in the volume of unstructured data generated by businesses and consumers worldwide. Organizations are leveraging LLMs to automate the labeling of vast datasets, which is essential for training sophisticated AI models. The integration of LLMs into data labeling workflows is not only improving the speed and accuracy of the annotation process but also reducing operational costs. This technological advancement has enabled enterprises to scale their AI initiatives more efficiently, facilitating the deployment of intelligent applications across sectors such as healthcare, automotive, finance, and retail. Moreover, the continuous evolution of LLMs, with capabilities such as zero-shot and few-shot learning, is further enhancing the quality and context-awareness of labeled data, making these solutions indispensable for next-generation AI systems.




    Another significant driver is the growing need for domain-specific labeled datasets, especially in highly regulated industries like healthcare and finance. In these sectors, data privacy and security are paramount, and the use of LLMs in data labeling processes ensures that sensitive information is handled with the utmost care. LLM-powered platforms are increasingly being adopted to create high-quality, compliant datasets for applications such as medical imaging analysis, fraud detection, and customer sentiment analysis. The ability of LLMs to understand context, semantics, and complex language structures is particularly valuable in these domains, where the accuracy and reliability of labeled data directly impact the performance and safety of AI-driven solutions. This trend is expected to continue as organizations strive to meet stringent regulatory requirements while accelerating their AI adoption.




    Furthermore, the proliferation of AI-powered applications in emerging markets is contributing to the rapid expansion of the data labeling with LLMs market. Countries in Asia Pacific and Latin America are witnessing significant investments in digital transformation, driving the demand for scalable and efficient data annotation solutions. The availability of cloud-based data labeling platforms, combined with advancements in LLM technologies, is enabling organizations in these regions to overcome traditional barriers such as limited access to skilled annotators and high operational costs. As a result, the market is experiencing robust growth in both developed and developing economies, with enterprises increasingly recognizing the strategic value of high-quality labeled data in gaining a competitive edge.




    From a regional perspective, North America currently dominates the data labeling with LLMs market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology companies, advanced research institutions, and a mature AI ecosystem. However, Asia Pacific is expected to witness the highest CAGR during the forecast period, driven by rapid digitalization, government initiatives supporting AI development, and a burgeoning startup ecosystem. Europe is also emerging as a key market, with strong demand from sectors such as automotive and healthcare. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their market presence, supported by growing investments in AI infrastructure and talent development.



  2. Foundation Model Data Collection and Data Annotation | Large Language...

    • datarade.ai
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Portugal, Ireland, Malta, Czech Republic, Taiwan, Azerbaijan, El Salvador, Kyrgyzstan, Spain, Russian Federation
    Description
    1. Overview
    2. Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

    -SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

    -Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

    -RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

    1. Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

    -Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

    -Quality: Multiple rounds of quality inspections ensures high quality data output

    -Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

    -Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

    3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade

  3. Sentiment Analysis: App Store Reviews

    • kaggle.com
    zip
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dishant Savaliya (2025). Sentiment Analysis: App Store Reviews [Dataset]. https://www.kaggle.com/datasets/dishantsavaliya/app-dataset-v1
    Explore at:
    zip(1892115 bytes)Available download formats
    Dataset updated
    Aug 11, 2025
    Authors
    Dishant Savaliya
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Please share your suggestions to improve my datasets further✍️

    📄 Dataset Overview This dataset contains Google Play Store app reviews labeled for sentiment using a deterministic Large Language Model (LLM) classification pipeline. Each review is tagged as positive, negative, or neutral, making it ready for NLP training, benchmarking, and market insight generation.

    ⚙️ Data Collection & Labeling Process Source: Reviews collected from Google Play Store using the google_play_scraper library. Labeling: Reviews classified by a Hugging Face Transformers-based LLM with a strict prompt to ensure one-word output. Post-processing: Outputs normalized to the three sentiment classes.

    💡 Potential Uses Fine-tuning BERT, RoBERTa, LLaMA, or other transformer models. Sentiment dashboards for product feedback monitoring. Market research on user perception trends. Benchmark dataset for text classification experiments.

    Please upvote!!!!

  4. Augmented training data and labels, used for training the models

    • figshare.com
    bin
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Keane (2025). Augmented training data and labels, used for training the models [Dataset]. http://doi.org/10.6084/m9.figshare.28669001.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Michael Keane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the augmented data and labels used in training the model, it is also needed for evaluation as the vectoriser is fit on this data and then the test data is transformed on that vectoriser

  5. R

    LLM Data Quality Assurance Market Research Report 2033

    • researchintelo.com
    csv, pdf, pptx
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Research Intelo (2025). LLM Data Quality Assurance Market Research Report 2033 [Dataset]. https://researchintelo.com/report/llm-data-quality-assurance-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Oct 2, 2025
    Dataset authored and provided by
    Research Intelo
    License

    https://researchintelo.com/privacy-and-policyhttps://researchintelo.com/privacy-and-policy

    Time period covered
    2024 - 2033
    Area covered
    Global
    Description

    LLM Data Quality Assurance Market Outlook



    According to our latest research, the Global LLM Data Quality Assurance market size was valued at $1.25 billion in 2024 and is projected to reach $8.67 billion by 2033, expanding at a robust CAGR of 23.7% during 2024–2033. The major factor propelling the growth of the LLM Data Quality Assurance market globally is the rapid proliferation of generative AI and large language models (LLMs) across industries, creating an urgent need for high-quality, reliable, and bias-free data to fuel these advanced systems. As organizations increasingly depend on LLMs for mission-critical applications, ensuring the integrity and accuracy of training and operational data has become indispensable to mitigate risk, enhance performance, and comply with evolving regulatory frameworks.



    Regional Outlook



    North America currently commands the largest share of the LLM Data Quality Assurance market, accounting for approximately 38% of the global revenue in 2024. This dominance can be attributed to the region’s mature AI ecosystem, significant investments in digital transformation, and the presence of leading technology firms and AI research institutions. The United States, in particular, has spearheaded the adoption of LLMs in sectors such as BFSI, healthcare, and IT, driving the demand for advanced data quality assurance solutions. Favorable government policies supporting AI innovation, a strong startup culture, and robust regulatory guidelines around data privacy and model transparency have further solidified North America’s leadership position in the market.



    Asia Pacific is emerging as the fastest-growing region in the LLM Data Quality Assurance market, with a projected CAGR of 27.4% from 2024 to 2033. This rapid growth is driven by escalating investments in AI infrastructure, increasing digitalization across enterprises, and government-led initiatives to foster AI research and deployment. Countries such as China, Japan, South Korea, and India are witnessing exponential growth in LLM adoption, especially in sectors like e-commerce, telecommunications, and manufacturing. The region’s burgeoning talent pool, combined with a surge in AI-focused venture capital funding, is fueling innovation in data quality assurance platforms and services, positioning Asia Pacific as a major future growth engine for the market.



    Emerging economies in Latin America and the Middle East & Africa are also starting to recognize the importance of LLM Data Quality Assurance, but adoption remains at a nascent stage due to infrastructural limitations, skill gaps, and budgetary constraints. These regions are gradually overcoming barriers as multinational corporations expand their operations and local governments launch digital transformation agendas. However, challenges such as data localization requirements, fragmented regulatory landscapes, and limited access to cutting-edge AI technologies are slowing widespread adoption. Despite these hurdles, localized demand for data quality solutions in sectors like banking, retail, and healthcare is expected to rise steadily as these economies modernize and integrate AI-driven workflows.



    Report Scope






    <

    Attributes Details
    Report Title LLM Data Quality Assurance Market Research Report 2033
    By Component Software, Services
    By Application Model Training, Data Labeling, Data Validation, Data Cleansing, Data Monitoring, Others
    By Deployment Mode On-Premises, Cloud
    By Enterprise Size Small and Medium Enterprises, Large Enterprises
    By End-User BFSI, Healthcare, Retail and E-commerce, IT and Telecommunications, Media and Entertainment, Manufacturing, Others
  6. h

    hola

    • huggingface.co
    Updated Feb 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SímboloAI (2024). hola [Dataset]. https://huggingface.co/datasets/simbolo-ai/hola
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 12, 2024
    Dataset authored and provided by
    SímboloAI
    License

    https://choosealicense.com/licenses/gpl/https://choosealicense.com/licenses/gpl/

    Description

    Hola: Multilingual-Text-Generation-(LLM) and Language-Classification-Public-Dataset

    Release Date: 2/12/2024 (Myanmar Union Day)

       Overview 
    

    Hola dataset contains data from 11 languages, including English, Burmese, Japanese, Spanish, Chinese (Traditional), Korean, Mon, Paoh, etc. And the data was crawled from Wikipedia. Each sample is a sentence data with a label of ISO 639-1 code of respective language, e.g. en, my, ja, es, zh, ko, etc.

       Data 
    

    Each training… See the full description on the dataset page: https://huggingface.co/datasets/simbolo-ai/hola.

  7. Writeup analysis OpenAI gpt-oss-20b

    • kaggle.com
    zip
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    phunter (2025). Writeup analysis OpenAI gpt-oss-20b [Dataset]. https://www.kaggle.com/datasets/phunter/writeup-analysis-openai-gpt-oss-20b
    Explore at:
    zip(429740 bytes)Available download formats
    Dataset updated
    Sep 23, 2025
    Authors
    phunter
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This document outlines the process used to create a structured, analyzable dataset of LLM attack methods from a corpus of unstructured red-teaming writeups using https://www.kaggle.com/datasets/kaggleqrdl/red-teaming-all-writeups dataset.

    1. Framework Creation

    The foundation of this analysis is a formal, hierarchical taxonomy of known LLM attack methods, which is defined in attack_taxonomy.md. This taxonomy provides a controlled vocabulary for classification, ensuring consistency across all entries. The raw, unstructured summaries of various attack methodologies were compiled into a single file, condensed_methods.md.

    2. Automated Labeling and Data Enrichment

    To bridge the gap between the unstructured summaries and the formal taxonomy, we developed predict_taxonomy.py. This script automates the labeling process:

    1. It iterates through each attack summary in condensed_methods.md.
    2. For each summary, it calls the Gemini API with a specialized prompt. This prompt instructs the model to act as an expert AI security researcher.
    3. The model is provided with both the attack summary and the complete attack_taxonomy.md as context.
    4. It is then tasked with selecting the most relevant and specific attack categories that describe the methodology in the summary.

    3. Producing the Enriched Dataset

    The script captures the list of predicted taxonomy labels from Gemini for each writeup. It then combines the original source, the full summary content, and the new taxonomy labels into a single, structured record.

    This entire collection is saved as predicted_taxonomy.json, creating an enriched dataset where each attack method is now machine-readable and systematically classified. This structured data is invaluable for quantitative analysis, pattern recognition, and further research into LLM vulnerabilities.

  8. Z

    Curlie Enhanced with LLM Annotations: Two Datasets for Advancing...

    • data-staging.niaid.nih.gov
    Updated Dec 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nutter, Peter; Senghaas, Mika; Cizinsky, Ludek (2023). Curlie Enhanced with LLM Annotations: Two Datasets for Advancing Homepage2Vec's Multilingual Website Classification [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10413067
    Explore at:
    Dataset updated
    Dec 21, 2023
    Dataset provided by
    Czech Technical University in Prague
    École Polytechnique Fédérale de Lausanne
    Authors
    Nutter, Peter; Senghaas, Mika; Cizinsky, Ludek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancing Homepage2Vec with LLM-Generated Datasets for Multilingual Website Classification

    This dataset contains two subsets of labeled website data, specifically created to enhance the performance of Homepage2Vec, a multi-label model for website classification. The datasets were generated using Large Language Models (LLMs) to provide more accurate and diverse topic annotations for websites, addressing a limitation of existing Homepage2Vec training data.

    Key Features:

    LLM-generated annotations: Both datasets feature website topic labels generated using LLMs, a novel approach to creating high-quality training data for website classification models.

    Improved multi-label classification: Fine-tuning Homepage2Vec with these datasets has been shown to improve its macro F1 score from 38% to 43% evaluated on a human-labeled dataset, demonstrating their effectiveness in capturing a broader range of website topics.

    Multilingual applicability: The datasets facilitate classification of websites in multiple languages, reflecting the inherent multilingual nature of Homepage2Vec.

    Dataset Composition:

    curlie-gpt3.5-10k: 10,000 websites labeled using GPT-3.5, context 2 and 1-shot

    curlie-gpt4-10k: 10,000 websites labeled using GPT-4, context 2 and zero-shot

    Intended Use:

    Fine-tuning and advancing Homepage2Vec or similar website classification models

    Research on LLM-generated datasets for text classification tasks

    Exploration of multilingual website classification

    Additional Information:

    Project and report repository: https://github.com/CS-433/ml-project-2-mlp

    Acknowledgments:

    This dataset was created as part of a project at EPFL's Data Science Lab (DLab) in collaboration with Prof. Robert West and Tiziano Piccardi.

  9. US Financial Twitter Influencers: LLM - Labeled

    • kaggle.com
    zip
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Bennettttt (2025). US Financial Twitter Influencers: LLM - Labeled [Dataset]. https://www.kaggle.com/datasets/marcobennettttt/us-financial-twitter-influencers-llm-labeled/suggestions
    Explore at:
    zip(2198256 bytes)Available download formats
    Dataset updated
    Apr 26, 2025
    Authors
    Marco Bennettttt
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Data Source: twitterapi.io

    Data Acquisition Steps: First, search for 250,000 US financial influencers on Twitter through the specified keywords via the API provided by twitterapi.io. Subsequently, utilize a Large Language Model (LLM) to label these influencers. The LLM assesses the probability of each individual being a financial influencer. In the dataset table, there is a field named "llm_result" which can take on the values of 1, 2, 3, or 4. Notably, a value of 3 or 4 in the "llm_result" field largely indicates that the individual is involved in the financial sector.

  10. High-quality Image & Video Data | 250 Million | LLM Data | Multimodal Large...

    • datarade.ai
    Updated Aug 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2025). High-quality Image & Video Data | 250 Million | LLM Data | Multimodal Large Model Data |AI & ML Training Data [Dataset]. https://datarade.ai/data-products/high-quality-image-video-data-250-million-llm-data-mu-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Aug 28, 2025
    Dataset authored and provided by
    Nexdata
    Area covered
    Malta, Netherlands, Costa Rica, Saudi Arabia, Denmark, Belarus, United States of America, Brazil, Iceland, India
    Description
    1. 150,000 hours TV videos data Data content: 150,000 hours high-quality TV videos, with legitimate copyright and metadata like titles

    Data distribution: covers various video types including TV dramas and self-produced programs Data quality: complete video content with audio and resolutions available in both 1080P and 4K, free from watermarks, mosaics and other noise

    1. 20 Million high-quality video data Data content: 20 million high-quality videos captured by photographers, with legitimate copyright, including labels and captions in Chinese and English as metadata Data distribution: covers various video contents such as portraits, animals, plants, aerial shots, landscapes, urban scenes, and supports filtering by keywords such as subject, background, motion, cinematography, rendered video Data quality: complete video content with resolutions available in both 1080P and 4K, free from watermarks, mosaics, and other noise

    2. 250 Million high-quality image data Data content: 250 million high-quality images captured by photographers, with legitimate copyright, including labels and captions in Chinese and English as metadata Data distribution: covers various image contents such as portraits, animals, plants, food, landscapes and Chinese elements, as well as multiple image types like illustrations and vector graphics Data quality: complete image content with resolutions available in both 1080P and 4K, free from watermarks, mosaics, and other noise

    3. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Audio Data and 800TB of computer vision data. These ready-to-go Machine Learning (ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/llm?source=Datarade

  11. h

    Bitext-media-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-media-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-media-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Media Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [media] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An overview of… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-media-llm-chatbot-training-dataset.

  12. Social Engineering Detection Benchmark with LLMs

    • kaggle.com
    Updated May 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Doha AL-Qurashi (2025). Social Engineering Detection Benchmark with LLMs [Dataset]. https://www.kaggle.com/datasets/dohaalqurashi/social-engineering-detection-benchmark-with-llms
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 10, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Doha AL-Qurashi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📚 Social Engineering Detection Benchmark with LLMs

    About the Dataset

    The Social Engineering Detection Benchmark with LLMs dataset, meticulously curated by Doha AL-Qurashi and Rahaf Al-Batati, comprises 210 short scenarios and massages in both Arabic and English. Each message is labeled as or to evaluate large language models’ ability to detect social engineering tactics across diverse linguistic and cultural contexts.

    Benchmark Results: Top-Performing Models

    Out of 14 evaluated LLMs, the following achieved the highest accuracy in correctly predicting malicious intent:

    • llama-3.3-70b-specdec: ≈100%
    • llama-3.3-70b-versatile: ≈99%
    • deepseek-r1-distill-llama-70b: ≈97%
    • mistral-saba-24b: ≈96%
    • qwen-2.5-32b: ≈96%

    Dataset Composition

    This balanced dataset includes:

    • Total Messages: 210
    • Languages: Arabic (105), English (105)
    • Malicious: 126 (≈60%)
    • Non-Malicious: 84 (≈40%)

    Data Generation & Annotation

    To ensure realism and diversity, messages were sourced and labeled via:

    • LLM-Generated Scenarios: Synthetic attacks crafted by LLMs to mimic real social engineering language.
    • Social Media & News: Real-world examples collected from online platforms and reports.
    • Expert Annotation: Cybersecurity specialists validated each label for accuracy.

    Structure of Each Entry

    • Scenario: Text of the short scenarios and massages.
    • Malicious: Boolean label (true / false).
    • Language: Arabic or English.
    • LLM Evaluations: Model responses (true, false, error, blank).

    Applications

    Researchers and practitioners can use this dataset to:

    • Train and fine-tune LLMs for enhanced threat detection.
    • Benchmark model performance in real-world conditions.
    • Develop adaptive, multilingual detection pipelines.

    Getting Started

    1. Download the dataset from Kaggle and load into your preferred analysis tool (e.g., Python/pandas).
    2. Inspect the scenario column for contextual understanding.
    3. Compare model predictions against the malicious ground truth.
    4. Compute metrics such as accuracy, precision, recall, and F1-score.

    Ethical & Privacy Considerations

    Data Privacy: All messages are synthetic or anonymized; no personal data included.
    Responsible Use: Intended solely for research and educational purposes.

    How to Cite

    If you use this dataset, please cite:

    AL-Qurashi, D., & Al-Batati, R. (2024). Social Engineering Detection Benchmark with LLMs [Data set]. Kaggle. https://www.kaggle.com/datasets/dohaalqurashi/social-engineering-detection-benchmark-with-llms
  13. AI(LLMS) vs. Human Texts Cleaned and Optimized

    • kaggle.com
    zip
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yamin Hossain (2025). AI(LLMS) vs. Human Texts Cleaned and Optimized [Dataset]. https://www.kaggle.com/datasets/yaminh/ai-generated-and-human-written-texts/code
    Explore at:
    zip(1402836814 bytes)Available download formats
    Dataset updated
    Feb 14, 2025
    Authors
    Yamin Hossain
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Title:

    Cleaned and Optimized Dataset for AI vs. Human Text Classification

    Overview:

    This dataset is a curated and optimized collection of text data designed for training and evaluating machine learning models to distinguish between AI-generated and human-written text. The dataset has been meticulously cleaned, deduplicated, and reduced in size to ensure efficiency while maintaining its utility for research and development purposes.

    By combining multiple high-quality sources, this dataset provides a diverse range of text samples, making it ideal for tasks such as binary classification (AI vs. Human) and other natural language processing (NLP) applications.

    Key Features:

    1. Cleaned Text:

      • All text entries have been preprocessed to remove unwanted characters, extra spaces, and special symbols.
      • Text cleaning ensures consistency and improves model performance by focusing on meaningful content.
    2. Label Consistency:

      • Each entry is labeled with a binary value (0 for human-written text, 1 for AI-generated text).
      • Labels have been standardized across all sources for seamless integration.
    3. Memory Optimization:

      • The dataset has been optimized to reduce memory usage:
        • Unnecessary columns have been removed.
        • Data types have been downcast to more efficient formats (e.g., category for categorical columns).
    4. Deduplication:

      • Duplicate rows have been removed to prevent redundancy and ensure the dataset's integrity.
    5. Null Value Handling:

      • Rows with missing or null values have been carefully handled to maintain data quality.
    6. Compact Size:

      • The dataset includes only two essential columns: label and clean_text, making it lightweight and easy to use.

    Dataset Structure:

    The final dataset contains the following columns:

    Column NameDescription
    labelBinary label indicating the source of the text (0: Human, 1: AI).
    clean_textPreprocessed and cleaned text content ready for NLP tasks.

    Sources Used:

    This dataset is a consolidation of multiple high-quality datasets from various sources, ensuring diversity and representativeness. Below are the details of the sources used:

    1. Source 1:

    2. Source 2:

    3. Source 3:

    4. Source 4:

    5. Source 5:

    6. Source 6:

    Data Cleaning and Preprocessing Steps:

    To ensure the dataset is clean, consistent, and optimized for use, the following steps were performed:

    1. Column Standardization:

      • Renamed columns across all sources to ensure uniformity (text and label).
    2. Text Cleaning:

      • Converted all text to lowercase.
      • Removed non-alphabetic characters, extra spaces, and leading/trailing spaces.
    3. Duplicate Removal:

      • Identified and removed duplicate rows to avoid redundancy.
    4. Null Value Handling:

      • Dropped rows with missing or null values in critical columns (text or label).
    5. Memory Optimization:

      • Converted categorical columns to the category type for memory efficiency.
    6. Final Dataset Creation:

      • Retained only the essential columns: label and clean_text.

    Use Cases:

    This datase...

  14. h

    Bitext-hospitality-llm-chatbot-training-dataset

    • huggingface.co
    Updated Aug 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-hospitality-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-hospitality-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Hospitality Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [hospitality] sector can be easily achieved using our two-step approach to LLM Fine-Tuning. An… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-hospitality-llm-chatbot-training-dataset.

  15. Multi-race Human Face Data | 200,000 ID | Face Recognition Data| Image/Video...

    • datarade.ai
    Updated Dec 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). Multi-race Human Face Data | 200,000 ID | Face Recognition Data| Image/Video AI Training Data | Machine Learning(ML) Data [Dataset]. https://datarade.ai/data-products/nexdata-multi-race-human-face-data-200-000-id-image-vi-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Dec 22, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Cambodia, Canada, Iran (Islamic Republic of), Bulgaria, Germany, Bosnia and Herzegovina, Belarus, Mexico, Lao People's Democratic Republic, Chile
    Description
    1. Specifications Product : Biometric Data

    Data size : 200,000 ID

    Race distribution : black people, Caucasian people, brown(Mexican) people, Indian people and Asian people

    Gender distribution : gender balance

    Age distribution : young, midlife and senior

    Collecting environment : including indoor and outdoor scenes

    Data diversity : different face poses, races, ages, light conditions and scenes Device : cellphone

    Data format : .jpg/png

    Accuracy : the accuracy of labels of face pose, race, gender and age are more than 97%

    1. About Nexdata Nexdata owns off-the-shelf PB-level Large Language Model(LLM) Data, 3 million hours of Speech Data and 800TB of Imagery Data. These ready-to-go Machine Learning(ML) Data support instant delivery, quickly improve the accuracy of AI models. For more details, please visit us at https://www.nexdata.ai/datasets/computervision?source=Datarade
  16. A JSON file with ground truth sentiment labels used in evaluation and...

    • plos.figshare.com
    zip
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiaohan Yu; Jin Wang (2025). A JSON file with ground truth sentiment labels used in evaluation and comparison to LLM prediction. [Dataset]. http://doi.org/10.1371/journal.pone.0330919.s005
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 23, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Xiaohan Yu; Jin Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A JSON file with ground truth sentiment labels used in evaluation and comparison to LLM prediction.

  17. PII | External Dataset

    • kaggle.com
    zip
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    moth (2024). PII | External Dataset [Dataset]. https://www.kaggle.com/datasets/alejopaullier/pii-external-dataset
    Explore at:
    zip(7923518 bytes)Available download formats
    Dataset updated
    Jan 24, 2024
    Authors
    moth
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is an LLM-generated external dataset for the: - The Learning Agency Lab - PII Data Detection Competition

    Versions

    • v2: Added +1000 texts with new PII information like URLs and usernames. Also, now the dataset includes the PII information as columns. Note that not all the PII information is included on the text on purpose.

    Description

    It contains 3382 4434 generated texts with their corresponding annotated labels in the required competition format.

    Description: - document (str): ID of the essay - full_text (string): AI generated text. - tokens (string): a list with the tokens (comes from text.split()) - trailing_whitespace (list): a list with boolean values indicating whether each token is followed by whitespace. - labels (list): list with token labels in BIO format

  18. S

    VAD: A value-aligned dataset for Supervised Fine-Tuning of Large Language...

    • scidb.cn
    Updated Jan 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Yingjie; Ma Ning; Yan Qidong; Wu Wenshe; Wang Dejie (2025). VAD: A value-aligned dataset for Supervised Fine-Tuning of Large Language Model [Dataset]. http://doi.org/10.57760/sciencedb.19328
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 5, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Li Yingjie; Ma Ning; Yan Qidong; Wu Wenshe; Wang Dejie
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The VAD dataset contains a total of 29000 data samples, stored in JSON format files. Each JSON object has five attributes, namely instruction, input, output, source, and label, which correspond to the user's instruction, supplementary input for the task, LLM generated reply, data source, and data classification, respectively. Among them, the data source includes human glm4、alpaca3, Corresponding to manually constructed data, large model Chat-GLM4 generation, and Alpaca data translation. The data classification label includes person social、nation, The categories of the data content are individual, social, and national.

  19. Movie Genre Multi Label Classification with Interaction Type

    • figshare.com
    csv
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepchecks Data (2025). Movie Genre Multi Label Classification with Interaction Type [Dataset]. http://doi.org/10.6084/m9.figshare.28045487.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Deepchecks Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data in this tutorial comes from a multi-label classification bot that categorizes movies into genres based on their names and short descriptions. For this use case, we have an evaluation set containing movies released before 2021, with their associated genres selected by movie critics, as well as production data containing movies released after 2021.

  20. t

    Privacy-Sensitive Conversations between Care Workers and Care Home Residents...

    • test.researchdata.tuwien.at
    • researchdata.tuwien.ac.at
    bin, text/markdown
    Updated Dec 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger (2024). Privacy-Sensitive Conversations between Care Workers and Care Home Residents in a Residential Care Home [Dataset]. http://doi.org/10.70124/hbtq5-ykv92
    Explore at:
    bin, text/markdownAvailable download formats
    Dataset updated
    Dec 6, 2024
    Dataset provided by
    TU Wien
    Authors
    Reinhard Grabler; Michael Starzinger; Matthias Hirschmanner; Matthias Hirschmanner; Helena Anna Frijns; Helena Anna Frijns; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger; Reinhard Grabler; Michael Starzinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 2024 - Aug 2024
    Description

    Dataset Card for "privacy-care-interactions"

    Table of Contents

    Dataset Description

    Purpose and Features

    🔒 Collection of Privacy-Sensitive Conversations between Care Workers and Care Home Residents in an Residential Care Home 🔒

    The dataset is useful to train and evaluate models to identify and classify privacy-sensitive parts of conversations from text, especially in the context of AI assistants and LLMs.

    Dataset Overview

    Language Distribution 🌍

    • English (en): 95

    Locale Distribution 🌎

    • United States (US) 🇺🇸: 95

    Key Facts 🔑

    • This is synthetic data! Generated using proprietary algorithms - no privacy violations!
    • Conversations are classified following the taxonomy for privacy-sensitive robotics by Rueben et al. (2017).
    • The data was manually labeled by an expert.

    Dataset Structure

    Data Instances

    The provided data format is .jsonl, the JSON Lines text format, also called newline-delimited JSON. An example entry looks as follows.

    { "text": "CW: Have you ever been to Italy? CR: Oh, yes... many years ago.", "taxonomy": 0, "category": 0, "affected_speaker": 1, "language": "en", "locale": "US", "data_type": 1, "uid": 16, "split": "train" }

    Data Fields

    The data fields are:

    • text: a string feature. The abbreviaton of the speakers refer to the care worker (CW) and the care recipient (CR).
    • taxonomy: a classification label, with possible values including informational (0), invasion (1), collection (2), processing (3), dissemination (4), physical (5), personal-space (6), territoriality (7), intrusion (8), obtrusion (9), contamination (10), modesty (11), psychological (12), interrogation (13), psychological-distance (14), social (15), association (16), crowding-isolation (17), public-gaze (18), solitude (19), intimacy (20), anonymity (21), reserve (22). The taxonomy is derived from Rueben et al. (2017). The classifications were manually labeled by an expert.
    • category: a classification label, with possible values including personal-information (0), family (1), health (2), thoughts (3), values (4), acquaintance (5), appointment (6). The privacy category affected in the conversation. The classifications were manually labeled by an expert.
    • affected_speaker: a classification label, with possible values including care-worker (0), care-recipient (1), other (2), both (3). The speaker whose privacy is impacted during the conversation. The classifications were manually labeled by an expert.
    • language: a string feature. Language code as defined by ISO 639.
    • locale: a string feature. Regional code as defined by ISO 3166-1 alpha-2.
    • data_type: a string a classification label, with possible values including real (0), synthetic (1).
    • uid: a int64 feature. A unique identifier within the dataset.
    • split: a string feature. Either train, validation or test.

    Dataset Splits

    The dataset has 2 subsets:

    • split: with a total of 95 examples split into train, validation and test (70%-15%-15%)
    • unsplit: with a total of 95 examples in a single train split
    nametrainvalidationtest
    split661415
    unsplit95n/an/a

    The files follow the naming convention subset-split-language.jsonl. The following files are contained in the dataset:

    • split-train-en.jsonl
    • split-validation-en.jsonl
    • split-test-en.jsonl
    • unsplit-train-en.jsonl

    Dataset Creation

    Curation Rationale

    Recording audio of care workers and residents during care interactions, which includes partial and full body washing, giving of medication, as well as wound care, is a highly privacy-sensitive use case. Therefore, a dataset is created, which includes privacy-sensitive parts of conversations, synthesized from real-world data. This dataset serves as a basis for fine-tuning a local LLM to highlight and classify privacy-sensitive sections of transcripts created in care interactions, to further mask them to protect privacy.

    Source Data

    Initial Data Collection

    The intial data was collected in the project Caring Robots of TU Wien in cooperation with Caritas Wien. One project track aims to facilitate Large Languge Models (LLM) to support documentation of care workers, with LLM-generated summaries of audio recordings of interactions between care workers and care home residents. The initial data are the transcriptions of those care interactions.

    Data Processing

    The transcriptions were thoroughly reviewed, and sections containing privacy-sensitive information were identified and marked using qualitative data analysis software by two experts. Subsequently, the accessible portions of the interviews were translated from German to US English using the locally executed LLM icky/translate. In the next step, another llama3.1:70b was used locally to synthesize the conversation segments. This process involved generating similar, yet distinct and new, conversations that are not linked to the original data. The dataset was split using the train_test_split function from the <a href="https://scikit-learn.org/1.5/modules/generated/sklearn.model_selection.train_test_split.html" target="_blank"

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Growth Market Reports (2025). Data Labeling with LLMs Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/data-labeling-with-llms-market

Data Labeling with LLMs Market Research Report 2033

Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Oct 6, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description

Data Labeling with LLMs Market Outlook



According to our latest research, the global Data Labeling with LLMs market size was valued at USD 2.14 billion in 2024, with a robust year-on-year growth trajectory. The market is projected to expand at a CAGR of 22.8% from 2025 to 2033, reaching a forecasted value of USD 16.6 billion by 2033. This impressive growth is primarily driven by the increasing adoption of large language models (LLMs) to automate and enhance the efficiency of data labeling processes across various industries. As organizations continue to invest in AI and machine learning, the demand for high-quality, accurately labeled datasets—essential for training and fine-tuning LLMs—continues to surge, fueling the expansion of the data labeling with LLMs market.




One of the principal growth factors for the data labeling with LLMs market is the exponential increase in the volume of unstructured data generated by businesses and consumers worldwide. Organizations are leveraging LLMs to automate the labeling of vast datasets, which is essential for training sophisticated AI models. The integration of LLMs into data labeling workflows is not only improving the speed and accuracy of the annotation process but also reducing operational costs. This technological advancement has enabled enterprises to scale their AI initiatives more efficiently, facilitating the deployment of intelligent applications across sectors such as healthcare, automotive, finance, and retail. Moreover, the continuous evolution of LLMs, with capabilities such as zero-shot and few-shot learning, is further enhancing the quality and context-awareness of labeled data, making these solutions indispensable for next-generation AI systems.




Another significant driver is the growing need for domain-specific labeled datasets, especially in highly regulated industries like healthcare and finance. In these sectors, data privacy and security are paramount, and the use of LLMs in data labeling processes ensures that sensitive information is handled with the utmost care. LLM-powered platforms are increasingly being adopted to create high-quality, compliant datasets for applications such as medical imaging analysis, fraud detection, and customer sentiment analysis. The ability of LLMs to understand context, semantics, and complex language structures is particularly valuable in these domains, where the accuracy and reliability of labeled data directly impact the performance and safety of AI-driven solutions. This trend is expected to continue as organizations strive to meet stringent regulatory requirements while accelerating their AI adoption.




Furthermore, the proliferation of AI-powered applications in emerging markets is contributing to the rapid expansion of the data labeling with LLMs market. Countries in Asia Pacific and Latin America are witnessing significant investments in digital transformation, driving the demand for scalable and efficient data annotation solutions. The availability of cloud-based data labeling platforms, combined with advancements in LLM technologies, is enabling organizations in these regions to overcome traditional barriers such as limited access to skilled annotators and high operational costs. As a result, the market is experiencing robust growth in both developed and developing economies, with enterprises increasingly recognizing the strategic value of high-quality labeled data in gaining a competitive edge.




From a regional perspective, North America currently dominates the data labeling with LLMs market, accounting for the largest share in 2024. This leadership is attributed to the presence of major technology companies, advanced research institutions, and a mature AI ecosystem. However, Asia Pacific is expected to witness the highest CAGR during the forecast period, driven by rapid digitalization, government initiatives supporting AI development, and a burgeoning startup ecosystem. Europe is also emerging as a key market, with strong demand from sectors such as automotive and healthcare. Meanwhile, Latin America and the Middle East & Africa are gradually increasing their market presence, supported by growing investments in AI infrastructure and talent development.



Search
Clear search
Close search
Google apps
Main menu