46 datasets found
  1. Foundation Model Data Collection and Data Annotation | Large Language...

    • datarade.ai
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
    Explore at:
    .bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
    Dataset updated
    Jan 25, 2024
    Dataset authored and provided by
    Nexdata
    Area covered
    Portugal, Taiwan, Czech Republic, Maldives, Ireland, Azerbaijan, El Salvador, Kyrgyzstan, Spain, Russian Federation
    Description
    1. Overview
    2. Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

    -SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

    -Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

    -RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

    1. Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

    -Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

    -Quality: Multiple rounds of quality inspections ensures high quality data output

    -Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

    -Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

    3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade

  2. d

    TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

    • datarade.ai
    .json, .csv, .xls
    Updated Jun 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
    Explore at:
    .json, .csv, .xlsAvailable download formats
    Dataset updated
    Jun 18, 2021
    Dataset authored and provided by
    TagX
    Area covered
    Colombia, Iceland, Belize, Antigua and Barbuda, Saudi Arabia, Equatorial Guinea, Benin, Djibouti, Russian Federation, Qatar
    Description

    We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

    Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

    We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.

  3. f

    LLM-Assisted Content Analysis (LACA): Coded data and model reasons

    • figshare.com
    txt
    Updated Jun 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rob Chew; Micahel Wenger; John Bollenbacher; Jessica Speer; Annice Kim (2023). LLM-Assisted Content Analysis (LACA): Coded data and model reasons [Dataset]. http://doi.org/10.6084/m9.figshare.23291147.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 22, 2023
    Dataset provided by
    figshare
    Authors
    Rob Chew; Micahel Wenger; John Bollenbacher; Jessica Speer; Annice Kim
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description This resources consists of calibrations sets (N=100) for 4 publically available datasets using the LLM-Assisted Content Analysis method (LACA). Each dataset contains the following columns:

    text_id: Unique ID for each text document code_id: Unique ID for each code category text: Document text that's been coded original_code: Coded response from the original datasets replicated_code: Coded response from independent coding exercise from our study team model_code: Coded response generated from the LLM (GPT-3.5-turbo) reason: LLM generated reason for coding decision

    Additional details on methods and definitions of individual code categories are available in the following paper:

    Chew, R., Bollenbacher, J., Speer, J., Wenger., M, Kim., A. (2023) LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding.

    Trump Tweets

    Citation: Coe, Kevin, Berger, Julia, Blumling, Allison , Brooks, Katelyn , Giorgi, Elizabeth , Jackson, Jennifer , … Wellman, Mariah . Quantitative Content Analysis of Donald Trump’s Twitter, 2017-2019. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-04-01. https://doi.org/10.3886/E118603V1 Source: https://www.openicpsr.org/openicpsr/project/118603/version/V1/view

    BBC News

    Citation: Greene, D., & Cunningham, P. (2006). Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning (pp. 377-384). Source: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification

    Contrarian Claims

    Citation: Coan, T. G., Boussalis, C., Cook, J., & Nanko, M. O. (2021). Computer-assisted classification of contrarian claims about climate change. Scientific reports, 11(1), 22320. Source: https://socialanalytics.ex.ac.uk/cards/data.zip

    Ukraine Water Problems

    Citation: Afanasyev S, N. B, Bodnarchuk T, S. V, M. V, T. V, Yu V, K. G, V. D, Konovalenko O, O. K, E. K, Lietytska O, O. L, V. M, Marushevska O, Mokin V, K. M, Osadcha N, O. I (2013) River Basin Management Plan for Pivdenny Bug: river basin analysis and measures Source: https://www.kaggle.com/datasets/vbmokin/nlp-reports-news-classification

  4. Data from: LLM-assisted Graph-RAG Information Extraction from IFC Data

    • figshare.com
    pdf
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hadeel Saadany (2025). LLM-assisted Graph-RAG Information Extraction from IFC Data [Dataset]. http://doi.org/10.6084/m9.figshare.28771409.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Apr 23, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Hadeel Saadany
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.IFC data has become the general building information standard for collaborative work in the construction industry. However, IFC data can be very complicated because it allows for multiple ways to represent the same product information. In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.

  5. p

    Data from: EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice...

    • physionet.org
    Updated Jun 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sunjun Kweon; Jiyoun Kim; Heeyoung Kwak; Dongchul Cha; Hangyul Yoon; Kwang Hyun Kim; Jeewon Yang; Seunghyun Won; Edward Choi (2024). EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [Dataset]. http://doi.org/10.13026/acga-ht95
    Explore at:
    Dataset updated
    Jun 26, 2024
    Authors
    Sunjun Kweon; Jiyoun Kim; Heeyoung Kwak; Dongchul Cha; Hangyul Yoon; Kwang Hyun Kim; Jeewon Yang; Seunghyun Won; Edward Choi
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Area covered
    World
    Description

    Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.

  6. Z

    Data from: Data and code for, "Large language models design sequence-defined...

    • data.niaid.nih.gov
    Updated Aug 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statt, Antonia (2024). Data and code for, "Large language models design sequence-defined macromolecules via evolutionary optimization" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11121884
    Explore at:
    Dataset updated
    Aug 29, 2024
    Dataset provided by
    Reinhart, Wesley
    Statt, Antonia
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Codes and data for "Large language models design sequence-defined macromolecules via evolutionary optimization"Note this repository contains codes and data files for the manuscript. This is a snapshot of the repository, frozen at the time of submission.# Codes## LLM codes- run_claude.py - the routine for performing LLM-based rollouts; intended for command line execution using argparse- message_utils.py - utilities for constructing and parsing messages for LLM I/O- model_utils.py - lightweight utilities for retrieving formatted predictions from the RNN ensemble- target_defs.py - defines the sequence, locations, and natural language descriptions of the target structures- ask_about_oracle.ipynb - asks the LLM to speculate about the nature of the optimization task## other algorithms- active_learning.ipynb - use EI acquisition with RF surrogate to label new sequences; includes an unused tokenization scheme- evolutionary_algorithm.ipynb - use DEAP library to perform evolutionary optimization- random_sampling.ipynb - sample sequences randomly from all possible sequences## postprocessing- process_aggregated_logs.py - reads data from the raw log files and prepares them for visualization- process_sample_rollouts.py - reads data from the raw log files and prepares individual rollouts## visualization- figure1b.ipynb - renders panel b of Fig. 1- figure1efg.ipynb - renders the last row of Fig. 1 (panels e-g)- figure2.ipynb - renders all of Fig. 2- figure_si.ipynb - renders Figs. S1 and S2- figure_md_validation.ipynb - renders Fig. S3# Data files- prompts/ - prompt-scientific-v4.4.yml - the full text of the scientific prompt, to be read by run_claude.py - prompt-oracle-v4.4.yml - the full text of the oracle prompt, to be read by run_claude.py- models/ - the TorchScript RNN models used to make predictions- data/ - embeddings - calculated embeddings for a collection of sequences from our prior work - llm-logs - the raw logs obtained from the Claude 3.5 Sonnet LLM (other algorithms made to look like the LLM logs after the fact) - llm-logs-opus - the raw logs obtained from the Claude 3.0 Opus LLM (used in the first draft of the article, replaced by Claude 3.5 Sonnet) - all-rollouts-kltd.csv - postprocessed logs for all the rollouts using the "top $k < d^*$" metric - all-rollouts-topkd.csv - postprocessed logs for all the rollouts using the "mean $d$ for top $k$" metric - sample-rollout-membranes-x-3.csv - postprocessed logs for a single rollout replica, x = each algorithm type - snapshots - png snapshots of MD simulation results at different locations in the manifold

  7. d

    Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

    • datarade.ai
    .json, .csv
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
    Explore at:
    .json, .csvAvailable download formats
    Dataset provided by
    Xverum LLC
    Authors
    Xverum
    Area covered
    Jordan, United Kingdom, Western Sahara, India, Dominican Republic, Sint Maarten (Dutch part), Cook Islands, Barbados, Norway, Oman
    Description

    Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

    What Makes Our Data Unique?

    Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

    Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

    Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

    Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

    How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

    Primary Use Cases and Verticals

    Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

    Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

    B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

    HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

    How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

    Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

    Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

    Contact us for sample datasets or to discuss your specific needs.

  8. h

    Bitext-customer-support-llm-chatbot-training-dataset

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext, Bitext-customer-support-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.

  9. Replication Package: Tracking the Moving Target: A Framework for Continuous...

    • zenodo.org
    • ekoizpen-zientifikoa.ehu.eus
    application/gzip, bin +1
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maider Azanza Sesé; Maider Azanza Sesé; Beatriz Pérez Lamancha; Eneko Pizarro; Beatriz Pérez Lamancha; Eneko Pizarro (2025). Replication Package: Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry [Dataset]. http://doi.org/10.5281/zenodo.15274212
    Explore at:
    pdf, bin, application/gzipAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maider Azanza Sesé; Maider Azanza Sesé; Beatriz Pérez Lamancha; Eneko Pizarro; Beatriz Pérez Lamancha; Eneko Pizarro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Replication Package: Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry

    Version: 1.0 (Date: April 27, 2025)
    DOI: https://doi.org/10.5281/zenodo.14779767

    Paper Information

    Title: Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry
    Authors: Maider Azanza, Beatriz Pérez Lamancha, Eneko Pizarro
    Publication: International Conference on Evaluation and Assessment in Software Engineering (EASE), 2025 edition.

    Package Overview

    This repository contains the replication package for the research paper cited above. It provides the necessary data, source code, and prompts to understand, verify, and potentially extend our findings on the continuous evaluation of LLM-based test generation in an industrial context. The data reflects evaluations conducted between November 2024 and January 2025.

    Package Contents

    1. Metrics-Results-by-Function.7z** (Archive, requires 7-Zip or compatible tool to extract)

      • Description: Contains the detailed, raw, and processed metric results for each of the 7 Java methods and classes evaluated in the study.
      • Structure: Inside this archive, you will find 7 individual .zip files, one for each function (e.g., addUser-Metrics-Results.zip, assemble-Metrics-Results.zip, ...).
      • Contents (per function zip): Each function-specific zip file typically includes:
        • Raw test cases generated by the evaluated LLMs.
        • Metric measurements (e.g., code coverage reports from SonarQube/JaCoCo).
        • Analysis or intermediate conclusions specific to that function.
        • The specific prompt variations used for that function, if applicable beyond the main prompt.
      • Purpose: Allows for in-depth analysis of LLM performance on specific methods and verification of the metric collection process described in the paper. Data collected between November 2024 and January 2025.
    2. Metric Results by function Nov. 2024 - Jan.2025.pdf (PDF Document)

      • Description: Provides a consolidated tabular view of the key raw metrics collected for each function and LLM evaluated during the November 2024 - January 2025 period.
      • Contents: Tables summarizing metrics like code coverage, number of generated tests, expert assessment scores, etc., broken down by function and LLM. This data is directly derived from the detailed results in Metrics-Results-by-Function.7z.
      • Purpose: Offers a more detailed quantitative overview than the aggregated summary, facilitating direct comparison of raw performance metrics across functions and LLMs without needing to extract all archives.
    3. Aggregated Results by function Nov. 2024 - Jan.2025.pdf (PDF Document)

      • Description: Presents a high-level summary of the evaluation results across all tested methods and LLMs.
      • Contents: Includes an aggregated metric table showing overall performance trends, potentially including the weighted metrics discussed in the paper.
      • Purpose: Provides a quick overview of the main findings and comparative performance of the LLMs according to the evaluation framework.
    4. Prompt_for_Integration_Testing-2025.pdf (PDF Document)

      • Description: The final, refined version of the prompt provided to the LLMs for generating integration test cases.
      • Contents: Details the instructions, context (including source code snippets or descriptions), constraints, and desired output format given to the LLMs. Reflects the prompt-chaining methodology described in the paper.
      • Purpose: Enables understanding of how the LLMs were instructed and allows others to reuse or adapt the prompt engineering approach.
    5. sources.tar.gz (Compressed Tar Archive, requires tar or compatible tool to extract)

      • Description: Contains the original Java source code for the 7 methods that were the targets for test generation.
      • Contents:
        • The specific Java files containing the methods under test.
        • Relevant context or dependency information needed to understand the methods' functionality and complexity.
        • May include documentation (e.g., Javadoc) describing the intended behavior of each method.
      • Purpose: Provides the necessary code context for understanding the test generation task and potentially replicating the test execution or analysis.
  10. h

    law-tasks

    • huggingface.co
    • opendatalab.com
    Updated Sep 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AdaptLLM (2023). law-tasks [Dataset]. https://huggingface.co/datasets/AdaptLLM/law-tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2023
    Authors
    AdaptLLM
    Description

    Adapting LLMs to Domains via Continual Pre-Training (ICLR 2024)

    This repo contains the evaluation datasets for our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/law-tasks.

  11. Data from: Can Large Language Models Identify Locations Better Than Linked...

    • zenodo.org
    • portaldelaciencia.uva.es
    bin
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo García-Zarza; Pablo García-Zarza; Juan I. Asensio-Pérez; Juan I. Asensio-Pérez; Miguel L. Bote-Lorenzo; Miguel L. Bote-Lorenzo; Luis F. Sánchez-Turrión; Luis F. Sánchez-Turrión; Davide Taibi; Davide Taibi; Guillermo Vega-Gorgojo; Guillermo Vega-Gorgojo (2025). Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning? [Dataset]. http://doi.org/10.5281/zenodo.15600171
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pablo García-Zarza; Pablo García-Zarza; Juan I. Asensio-Pérez; Juan I. Asensio-Pérez; Miguel L. Bote-Lorenzo; Miguel L. Bote-Lorenzo; Luis F. Sánchez-Turrión; Luis F. Sánchez-Turrión; Davide Taibi; Davide Taibi; Guillermo Vega-Gorgojo; Guillermo Vega-Gorgojo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the dataset and analysis associated with the research paper "Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning?" Presented at the 20th European Conference on Technology Enhanced Learning (ECTEL), 2025.

    Overview

    Ubiquitous learning (u-learning) applications often rely on identifying relevant Points of Interest (POIs) where students can engage in contextualized learning tasks. Traditionally, these POIs have been retrieved from structured datasets like Linked Open Data (LOD). However, with the rise of Large Language Models (LLMs), a new question arises: can LLMs outperform LOD in identifying such locations?

    This study compares the performance of a LOD dataset (Wikidata) and two LLMs (ChatGPT and DeepSeek) in retrieving 16th-century cultural heritage sites (churches, cathedrals, castles, and palaces) across three European cities (two in Spain and one in Italy) and their regions.

    Dataset

    The file LODvsLLMs.xlsx includes:

    • Raw data retrieved from Wikidata and the two LLMs.
    • SPARQL queries and LLM prompts used for data collection.
    • Comparative analysis across four key dimensions:
      • Accuracy: Are the retrieved sites real and verifiable?
      • Consistency: Do repeated queries yield stable results?
      • Completeness: How exhaustive are the lists of POIs?
      • Validity: Are the geographic coordinates and Wikipedia links correct?

    Key Findings

    • LOD (Wikidata) outperformed LLMs in terms of consistency, completeness (especially in larger regions), and validity of data.
    • LLMs were able to retrieve some POIs not found in Wikidata, but also introduced hallucinations and invalid links.
    • A hybrid approach combining LOD and LLMs is proposed for future u-learning applications to maximize coverage and reliability.

    Citation

    If you use this dataset or refer to the findings in your work, please cite the original paper presented at ECTEL 2025.

    García-Zarza, P., Asensio-Pérez, J.I., Bote-Lorenzo, M.L., Sánchez-Turrión, L.F., Taibi, D., Vega-Gorgojo, G., Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning Proceedings of 20th European Conference on Technology Enhanced Learning, ECTEL 2025, Newcastle & Durham, United Kingdom, September 2025.

  12. D

    Replication Data for: Advanced System Integration: Analyzing OpenAPI...

    • darus.uni-stuttgart.de
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello (2024). Replication Data for: Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation [Dataset]. http://doi.org/10.18419/DARUS-4605
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    DaRUS
    Authors
    Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello
    License

    https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605

    Dataset funded by
    BMWK
    MWK
    Description

    Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.

  13. Replication Package for the Paper: "An Insight into Security Code Review...

    • zenodo.org
    zip
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai; Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai (2025). Replication Package for the Paper: "An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors". [Dataset]. http://doi.org/10.5281/zenodo.15572151
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai; Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the replication package for the paper: "An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors".

    The replication package is organized into three folders:

    1. RQ1 Performance of LLMs

    - Five prompt templates.pdf
    This PDF demonstrates the detailed structures of the five prompt templates designed in Section 3.3.2 of our paper.

    - source code of the Python and C/C++ datasets
    This folder contains the source code of the Python and C/C++ datasets, used to construct prompts and apply the baseline tools for static analysis.

    - prompts for the Python and C/C++ datasets
    This folder contains the prompts constructed from the source code of the Python and C/C++ datasets based on the five prompt templates.

    - responses of LLMs and baselines
    This folder contains the responses generated by LLMs for each prompt and the analysis results of baseline tools. For CodeQL, you need to upload results.sarif to GitHub (https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github) to view the analysis results. For SonarQube, you need to import the export file into an Enterprise Edition or higher instance of the same version (v10.5 in our work) and similar configuration (default configuration in our work) to view the analysis results.

    - entropy_calculation.py
    This Python script calculates the average entropy of each llm-prompt combination to measure the consistency of LLM responses in three repetitive experiments.

    - Data Labelling for the C/C++ Dataset.xlsx
    - Data Labelling for the Python Dataset.xlsx
    The two Microsoft (MS) files contain the labeling results for LLMs and baselines in the C/C++ and Python datasets, including the category of each response generated by LLM for each prompt, as well as the category of each analysis result generated by baseline for each code file. The four categories(i.e., Instrumental, Helpful, Misleading and Uncertain) are defined in Section 3.3.3 of our paper as the labelling criteria.

    How to Read the MS Excel files:
    Both MS Excel files contain 5 sheets. The first sheet ('all_c++_data' or 'all_python_data') includes the information of all data in each dataset. The sheets 'first round', 'second round' and 'third round' represent the labelling results for LLMs under five prompts in three repetitive experiments. The sheet 'Baselines' include the labelling results for baseline tools.

    ColumnDescription
    File IDthe identifier of each code file in our dataset.
    Security Defectthe security defect(s) that the code file contains.
    Project the source project of the code file.
    Suffixthe suffix of the code file.

    2. RQ2 Quality Problem in Responses
    - data_analysis_first_round.mx22
    - data_analysis_second_round.mx22
    - data_analysis_third_round.mx22

    These three MAXQDA project files contain the results of data extraction for quality problems present in responses generated by the best-performing LLM-prompt combination across three repetitive experiments. This file can be opened by MAXQDA 2022 or higher versions, which are available at https://www.maxqda.com/ for download. You may also use the free 14 days trial version of MAXQDA 2024, which is available at https://www.maxqda.com/trial for download.

    3. RQ3 Factor influencing LLMs
    This folder contains two sub-folders:

    - Step 1 - correlation analysis
    Files in this subfolder are for conducting correlation analysis for explanatory variables through a Python script.

    - Step 2 - redundancy analysis and model fitting
    Files in this subfolder are for conducting redundancy analysis, allocation of degree of freedoms, model fitting and evaluation through an R script. Detailed instructions for running the R script can be found in readme.md in this subfolder.

  14. z

    LORE PMKB-CV

    • zenodo.org
    bin
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peng-Hsuan Li; Peng-Hsuan Li (2025). LORE PMKB-CV [Dataset]. http://doi.org/10.5281/zenodo.14607639
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 5, 2025
    Dataset provided by
    Taiwan AI Labs
    Authors
    Peng-Hsuan Li; Peng-Hsuan Li
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Time period covered
    Jan 7, 2025
    Description

    LORE PMKB-CV

    • Knowledge graph (LLM-ORE)
      • 70M relations between 8k Diseases (MeSH) and 18k Genes (NCBI, human protein coding) curated by LLMs reading PubMed
      • Data format: (D_id, G_id, PMID, relation) csv file
    • Semantic embedding (LLM-EMB)
      • 2.5M DG vectors created by LLMs reading the knowledge graph
      • Data format: (D_id, G_id, vector) pkl file
    • DG pathogenicity scores (ML-Ranker)
      • 3.1M DG scores predicted by pretrained models
      • Features, training annotations, pretrained models are also provided
    • Curated key semantics taxonomy
      • A manually curated taxonomy of 105 semantic tags about DG pathogenicity in the knowledge graph
      • Use the github LORE Key-Semantics module to use the taxonomy as tags and add them to the knowledge graph

    Source project

  15. h

    DOVE

    • huggingface.co
    Updated Mar 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nlphuji (2025). DOVE [Dataset]. https://huggingface.co/datasets/nlphuji/DOVE
    Explore at:
    Dataset updated
    Mar 2, 2025
    Dataset authored and provided by
    nlphuji
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    🕊️ DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

    🌐 Project Website | 📄 Read our paper

      Updates 📅
    

    2025-06-11: Added Llama 70B evaluations with ~5,700 MMLU examples across 100 different prompt variations (= 570K new predictions!), based on data from ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments 2025-04-12: Added MMLU predictions from dozens of models including OpenAI, Qwen, Mistral, Gemini… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/DOVE.

  16. p

    Data from: Medical Expert Annotations of Unsupported Facts in Doctor-Written...

    • physionet.org
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stefan Hegselmann; Shannon Shen; Florian Gierse; Monica Agrawal; David Sontag; Xiaoyi Jiang (2025). Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries [Dataset]. http://doi.org/10.13026/gedc-j464
    Explore at:
    Dataset updated
    Apr 30, 2025
    Authors
    Stefan Hegselmann; Shannon Shen; Florian Gierse; Monica Agrawal; David Sontag; Xiaoyi Jiang
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Large language models in healthcare can generate informative patient summaries while reducing the documentation workload of healthcare professionals. However, these models are prone to producing hallucinations, that is, generating unsupported information, which is problematic in the sensitive healthcare domain. To better characterize unsupported facts in medical texts, we developed a rigorous labeling protocol. Following this protocol, two medical experts annotated unsupported facts in 100 doctor-written summaries from the MIMIC-IV-Note Discharge Instructions and hallucinations 100 LLM-generated patient summaries. Here, we are releasing two datasets based on these annotations: Hallucinations-MIMIC-DI and Hallucinations-Generated-DI. We find that using these datasets to train on hallucination-free examples effectively reduces hallucinations for both Llama 2 (2.60 to 1.55 hallucinations per summary) and GPT-4 (0.70 to 0.40). Furthermore, we created a preprocessed version of the MIMIC-IV-Notes Discharge Instructions, releasing both a full-context version (MIMIC-IV-Note-Ext-DI) and a version that only uses the Brief Hospital Course for context (MIMIC-IV-Note-Ext-DI-BHC).

  17. h

    finance-tasks

    • huggingface.co
    • opendatalab.com
    Updated Dec 31, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AdaptLLM (2011). finance-tasks [Dataset]. https://huggingface.co/datasets/AdaptLLM/finance-tasks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2011
    Authors
    AdaptLLM
    Description

    Adapting LLMs to Domains via Continual Pre-Training (ICLR 2024)

    This repo contains the evaluation datasets for our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/finance-tasks.

  18. f

    Simulation Parameters.

    • plos.figshare.com
    xls
    Updated Oct 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adrielli Tina Lopes Rego; Joshua Snell; Martijn Meeter (2024). Simulation Parameters. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012117.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 7, 2024
    Dataset provided by
    PLOS Computational Biology
    Authors
    Adrielli Tina Lopes Rego; Joshua Snell; Martijn Meeter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although word predictability is commonly considered an important factor in reading, sophisticated accounts of predictability in theories of reading are lacking. Computational models of reading traditionally use cloze norming as a proxy of word predictability, but what cloze norms precisely capture remains unclear. This study investigates whether large language models (LLMs) can fill this gap. Contextual predictions are implemented via a novel parallel-graded mechanism, where all predicted words at a given position are pre-activated as a function of contextual certainty, which varies dynamically as text processing unfolds. Through reading simulations with OB1-reader, a cognitive model of word recognition and eye-movement control in reading, we compare the model’s fit to eye-movement data when using predictability values derived from a cloze task against those derived from LLMs (GPT-2 and LLaMA). Root Mean Square Error between simulated and human eye movements indicates that LLM predictability provides a better fit than cloze. This is the first study to use LLMs to augment a cognitive model of reading with higher-order language processing while proposing a mechanism on the interplay between word predictability and eye movements.

  19. Synthetic Colorectal Cancer Global Dataset

    • opendatabay.com
    .undefined
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Opendatabay Labs (2025). Synthetic Colorectal Cancer Global Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/ae2aba99-491d-45a1-a99e-7be14927f4af
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 28, 2025
    Dataset provided by
    Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
    Authors
    Opendatabay Labs
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Patient Health Records & Digital Health
    Description

    The Synthetic Colorectal Cancer Global Dataset is a fully anonymised, high-dimensional synthetic dataset designed for global cancer research, predictive modelling, and educational use. It encompasses demographic, clinical, lifestyle, genetic, and healthcare access factors relevant to colorectal cancer incidence, outcomes, and survivability.

    Dataset Features

    • Patient_ID: Unique identifier for each patient.
    • Country: Patient's country of residence.
    • Age: Age at diagnosis (in years).
    • Gender: Biological sex of the patient (Male/Female/Other).
    • Cancer_Stage: Stage of colorectal cancer at diagnosis (e.g., Stage I–IV).
    • Tumor_Size_mm: Size of the tumor in millimeters.
    • Family_History: Presence of colorectal cancer in family history (True/False).
    • Smoking_History: Smoking behavior or history (e.g., Current, Former, Never).
    • Alcohol_Consumption: Level of alcohol consumption (e.g., High, Moderate, None).
    • Obesity_BMI: BMI classification related to obesity.
    • Diet_Risk: Diet-related cancer risk (e.g., High Fat, Low Fiber).
    • Physical_Activity: Level of physical activity (e.g., Sedentary, Active).
    • Diabetes: Diabetes diagnosis (True/False).
    • Inflammatory_Bowel_Disease: Presence of IBD (True/False).
    • Genetic_Mutation: Genetic mutations relevant to colorectal cancer (e.g., APC, KRAS).
    • Screening_History: History of cancer screenings (True/False).
    • Early_Detection: Whether cancer was detected early (True/False).
    • Treatment_Type: Primary treatment type (e.g., Surgery, Chemotherapy, Radiation).
    • Survival_5_years: 5-year survival status (True/False).
    • Mortality: Mortality outcome (Alive/Deceased).
    • Healthcare_Costs: Estimated treatment costs (in USD).
    • Incidence_Rate_per_100K: Country-level incidence rate per 100,000 people.
    • Mortality_Rate_per_100K: Country-level mortality rate per 100,000 people.
    • Urban_or_Rural: Patient's living area (Urban/Rural).
    • Economic_Classification: Country's economic level (e.g., Low, Middle, High income).
    • Healthcare_Access: Access level to healthcare services (e.g., Good, Limited).
    • Insurance_Status: Insurance coverage status (Insured/Uninsured).
    • Survival_Prediction: Model-derived survival prediction (probability or binary).

    Distribution

    https://storage.googleapis.com/opendatabay_public/ae2aba99-491d-45a1-a99e-7be14927f4af/299af3fa2502_patient_analysis_plots.png" alt="Synthetic Colorectal Cancer Global Data Distribution.png">

    Usage

    This dataset can be used for:

    • Global Cancer Research: Analyze how clinical, lifestyle, and socioeconomic factors affect colorectal cancer outcomes worldwide.
    • Predictive Modeling: Develop models to estimate survival probability or treatment outcomes.
    • Healthcare Policy Analysis: Study disparities in healthcare access and outcomes across countries.
    • Educational Use: Support training in epidemiology, oncology, public health, and machine learning.

    Coverage

    The dataset includes 100% synthetic yet clinically plausible records from diverse countries and demographic groups. It is anonymized and modeled to reflect real-world variability in risk factors, diagnosis stages, treatment, and survival without compromising patient privacy.

    License

    CC0 (Public Domain)

    Who Can Use It

    • Epidemiologists and Medical Researchers: To explore global patterns in colorectal cancer.
    • Public Health Experts and Policymakers: For assessing equity in healthcare access and cancer outcomes.
    • Data Scientists and Educators: As a rich dataset for teaching data analysis, classification, regression, and health informatics.
  20. d

    Replication Data for: Large Language Models as a Substitute for Human...

    • search.dataone.org
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Heseltine, Michael (2024). Replication Data for: Large Language Models as a Substitute for Human Experts in Annotating Political Text [Dataset]. http://doi.org/10.7910/DVN/V2P6YL
    Explore at:
    Dataset updated
    Mar 6, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Heseltine, Michael
    Description

    Large-scale text analysis has grown rapidly as a method in political science and beyond. To date, text-as-data methods rely on large volumes of human-annotated training examples, which places a premium on researcher resources. However, advances in large language models (LLMs) may make automated annotation increasingly viable. This paper tests the performance of GPT-4 across a range of scenarios relevant for analysis of political text. We compare GPT-4 coding with human expert coding of tweets and news articles across four variables (whether text is political, its negativity, its sentiment, and its ideology) and across four countries (the United States, Chile, Germany, and Italy). GPT-4 coding is highly accurate, especially for shorter texts such as tweets, correctly classifying texts up to 95\% of the time. Performance drops for longer news articles, and very slightly for non-English text. We introduce a ``hybrid'' coding approach, in which disagreements of multiple GPT-4 runs are adjudicated by a human expert, which boosts accuracy. Finally, we explore downstream effects, finding that transformer models trained on hand-coded or GPT-4-coded data yield almost identical outcomes. Our results suggests that LLM-assisted coding is a viable and cost-efficient approach, although consideration should be given to task complexity.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
Organization logo

Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services

Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Nexdata
Area covered
Portugal, Taiwan, Czech Republic, Maldives, Ireland, Azerbaijan, El Salvador, Kyrgyzstan, Spain, Russian Federation
Description
  1. Overview
  2. Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

  1. Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade

Search
Clear search
Close search
Google apps
Main menu