47 datasets found

Foundation Model Data Collection and Data Annotation | Large Language...
datarade.ai
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Nexdata
Area covered
Maldives, Portugal, El Salvador, Kyrgyzstan, Spain, Ireland, Taiwan, Czech Republic, Azerbaijan, Russian Federation
Description
Overview

Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
d
TagX Data collection for AI/ ML training | LLM data | Data collection for AI...
datarade.ai
.json, .csv, .xls
Updated Jun 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TagX (2021). TagX Data collection for AI/ ML training | LLM data | Data collection for AI development & model finetuning | Text, image, audio, and document data [Dataset]. https://datarade.ai/data-products/data-collection-and-capture-services-tagx
Explore at:
.json, .csv, .xlsAvailable download formats
Dataset updated
Jun 18, 2021
Dataset authored and provided by
TagX
Area covered
Colombia, Belize, Antigua and Barbuda, Saudi Arabia, Equatorial Guinea, Iceland, Russian Federation, Benin, Djibouti, Qatar
Description
We offer comprehensive data collection services that cater to a wide range of industries and applications. Whether you require image, audio, or text data, we have the expertise and resources to collect and deliver high-quality data that meets your specific requirements. Our data collection methods include manual collection, web scraping, and other automated techniques that ensure accuracy and completeness of data.

Our team of experienced data collectors and quality assurance professionals ensure that the data is collected and processed according to the highest standards of quality. We also take great care to ensure that the data we collect is relevant and applicable to your use case. This means that you can rely on us to provide you with clean and useful data that can be used to train machine learning models, improve business processes, or conduct research.

We are committed to delivering data in the format that you require. Whether you need raw data or a processed dataset, we can deliver the data in your preferred format, including CSV, JSON, or XML. We understand that every project is unique, and we work closely with our clients to ensure that we deliver the data that meets their specific needs. So if you need reliable data collection services for your next project, look no further than us.
f
LLM-Assisted Content Analysis (LACA): Coded data and model reasons
figshare.com
txt
Updated Jun 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rob Chew; Micahel Wenger; John Bollenbacher; Jessica Speer; Annice Kim (2023). LLM-Assisted Content Analysis (LACA): Coded data and model reasons [Dataset]. http://doi.org/10.6084/m9.figshare.23291147.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23291147.v1
Dataset updated
Jun 22, 2023
Dataset provided by
figshare
Authors
Rob Chew; Micahel Wenger; John Bollenbacher; Jessica Speer; Annice Kim
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description This resources consists of calibrations sets (N=100) for 4 publically available datasets using the LLM-Assisted Content Analysis method (LACA). Each dataset contains the following columns:

text_id: Unique ID for each text document code_id: Unique ID for each code category text: Document text that's been coded original_code: Coded response from the original datasets replicated_code: Coded response from independent coding exercise from our study team model_code: Coded response generated from the LLM (GPT-3.5-turbo) reason: LLM generated reason for coding decision

Additional details on methods and definitions of individual code categories are available in the following paper:

Chew, R., Bollenbacher, J., Speer, J., Wenger., M, Kim., A. (2023) LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding.

Trump Tweets

Citation: Coe, Kevin, Berger, Julia, Blumling, Allison , Brooks, Katelyn , Giorgi, Elizabeth , Jackson, Jennifer , … Wellman, Mariah . Quantitative Content Analysis of Donald Trump’s Twitter, 2017-2019. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-04-01. https://doi.org/10.3886/E118603V1 Source: https://www.openicpsr.org/openicpsr/project/118603/version/V1/view

BBC News

Citation: Greene, D., & Cunningham, P. (2006). Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd international conference on Machine learning (pp. 377-384). Source: https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification

Contrarian Claims

Citation: Coan, T. G., Boussalis, C., Cook, J., & Nanko, M. O. (2021). Computer-assisted classification of contrarian claims about climate change. Scientific reports, 11(1), 22320. Source: https://socialanalytics.ex.ac.uk/cards/data.zip

Ukraine Water Problems

Citation: Afanasyev S, N. B, Bodnarchuk T, S. V, M. V, T. V, Yu V, K. G, V. D, Konovalenko O, O. K, E. K, Lietytska O, O. L, V. M, Marushevska O, Mokin V, K. M, Osadcha N, O. I (2013) River Basin Management Plan for Pivdenny Bug: river basin analysis and measures Source: https://www.kaggle.com/datasets/vbmokin/nlp-reports-news-classification
Data from: LLM-assisted Graph-RAG Information Extraction from IFC Data
figshare.com
pdf
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hadeel Saadany (2025). LLM-assisted Graph-RAG Information Extraction from IFC Data [Dataset]. http://doi.org/10.6084/m9.figshare.28771409.v2
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28771409.v2
Dataset updated
Apr 23, 2025
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Hadeel Saadany
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.IFC data has become the general building information standard for collaborative work in the construction industry. However, IFC data can be very complicated because it allows for multiple ways to represent the same product information. In this research, we utilise the capabilities of LLMs to parse the IFC data with Graph Retrieval-Augmented Generation (Graph-RAG) technique to retrieve building object properties and their relations. We will show that, despite limitations due to the complex hierarchy of the IFC data, the Graph-RAG parsing enhances generative LLMs like GPT-4o with graph-based knowledge, enabling natural language query-response retrieval without the need for a complex pipeline.
p
Data from: EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice...
physionet.org
Updated Jun 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sunjun Kweon; Jiyoun Kim; Heeyoung Kwak; Dongchul Cha; Hangyul Yoon; Kwang Hyun Kim; Jeewon Yang; Seunghyun Won; Edward Choi (2024). EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [Dataset]. http://doi.org/10.13026/acga-ht95
Explore at:
Unique identifier
https://doi.org/10.13026/acga-ht95
Dataset updated
Jun 26, 2024
Authors
Sunjun Kweon; Jiyoun Kim; Heeyoung Kwak; Dongchul Cha; Hangyul Yoon; Kwang Hyun Kim; Jeewon Yang; Seunghyun Won; Edward Choi
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Area covered
World
Description
Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
Z
Data from: Data and code for, "Large language models design sequence-defined...
data.niaid.nih.gov
Updated Aug 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statt, Antonia (2024). Data and code for, "Large language models design sequence-defined macromolecules via evolutionary optimization" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11121884
Explore at:
Dataset updated
Aug 29, 2024
Dataset provided by
Statt, Antonia
Reinhart, Wesley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Codes and data for "Large language models design sequence-defined macromolecules via evolutionary optimization"Note this repository contains codes and data files for the manuscript. This is a snapshot of the repository, frozen at the time of submission.# Codes## LLM codes- run_claude.py - the routine for performing LLM-based rollouts; intended for command line execution using argparse- message_utils.py - utilities for constructing and parsing messages for LLM I/O- model_utils.py - lightweight utilities for retrieving formatted predictions from the RNN ensemble- target_defs.py - defines the sequence, locations, and natural language descriptions of the target structures- ask_about_oracle.ipynb - asks the LLM to speculate about the nature of the optimization task## other algorithms- active_learning.ipynb - use EI acquisition with RF surrogate to label new sequences; includes an unused tokenization scheme- evolutionary_algorithm.ipynb - use DEAP library to perform evolutionary optimization- random_sampling.ipynb - sample sequences randomly from all possible sequences## postprocessing- process_aggregated_logs.py - reads data from the raw log files and prepares them for visualization- process_sample_rollouts.py - reads data from the raw log files and prepares individual rollouts## visualization- figure1b.ipynb - renders panel b of Fig. 1- figure1efg.ipynb - renders the last row of Fig. 1 (panels e-g)- figure2.ipynb - renders all of Fig. 2- figure_si.ipynb - renders Figs. S1 and S2- figure_md_validation.ipynb - renders Fig. S3# Data files- prompts/ - prompt-scientific-v4.4.yml - the full text of the scientific prompt, to be read by run_claude.py - prompt-oracle-v4.4.yml - the full text of the oracle prompt, to be read by run_claude.py- models/ - the TorchScript RNN models used to make predictions- data/ - embeddings - calculated embeddings for a collection of sequences from our prior work - llm-logs - the raw logs obtained from the Claude 3.5 Sonnet LLM (other algorithms made to look like the LLM logs after the fact) - llm-logs-opus - the raw logs obtained from the Claude 3.0 Opus LLM (used in the first draft of the article, replaced by Claude 3.5 Sonnet) - all-rollouts-kltd.csv - postprocessed logs for all the rollouts using the "top $k < d^*$" metric - all-rollouts-topkd.csv - postprocessed logs for all the rollouts using the "mean $d$ for top $k$" metric - sample-rollout-membranes-x-3.csv - postprocessed logs for a single rollout replica, x = each algorithm type - snapshots - png snapshots of MD simulation results at different locations in the manifold
d
Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...
datarade.ai
.json, .csv
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xverum, Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning (DL), NLP & LLM Training [Dataset]. https://datarade.ai/data-products/xverum-company-data-b2b-data-belgium-netherlands-denm-xverum
Explore at:
.json, .csvAvailable download formats
Dataset provided by
Xverum LLC
Authors
Xverum
Area covered
Norway, Dominican Republic, India, United Kingdom, Jordan, Western Sahara, Oman, Sint Maarten (Dutch part), Cook Islands, Barbados
Description
Xverum’s AI & ML Training Data provides one of the most extensive datasets available for AI and machine learning applications, featuring 800M B2B profiles with 100+ attributes. This dataset is designed to enable AI developers, data scientists, and businesses to train robust and accurate ML models. From natural language processing (NLP) to predictive analytics, our data empowers a wide range of industries and use cases with unparalleled scale, depth, and quality.

What Makes Our Data Unique?

Scale and Coverage: - A global dataset encompassing 800M B2B profiles from a wide array of industries and geographies. - Includes coverage across the Americas, Europe, Asia, and other key markets, ensuring worldwide representation.

Rich Attributes for Training Models: - Over 100 fields of detailed information, including company details, job roles, geographic data, industry categories, past experiences, and behavioral insights. - Tailored for training models in NLP, recommendation systems, and predictive algorithms.

Compliance and Quality: - Fully GDPR and CCPA compliant, providing secure and ethically sourced data. - Extensive data cleaning and validation processes ensure reliability and accuracy.

Annotation-Ready: - Pre-structured and formatted datasets that are easily ingestible into AI workflows. - Ideal for supervised learning with tagging options such as entities, sentiment, or categories.

How Is the Data Sourced? - Publicly available information gathered through advanced, GDPR-compliant web aggregation techniques. - Proprietary enrichment pipelines that validate, clean, and structure raw data into high-quality datasets. This approach ensures we deliver comprehensive, up-to-date, and actionable data for machine learning training.

Primary Use Cases and Verticals

Natural Language Processing (NLP): Train models for named entity recognition (NER), text classification, sentiment analysis, and conversational AI. Ideal for chatbots, language models, and content categorization.

Predictive Analytics and Recommendation Systems: Enable personalized marketing campaigns by predicting buyer behavior. Build smarter recommendation engines for ecommerce and content platforms.

B2B Lead Generation and Market Insights: Create models that identify high-value leads using enriched company and contact information. Develop AI systems that track trends and provide strategic insights for businesses.

HR and Talent Acquisition AI: Optimize talent-matching algorithms using structured job descriptions and candidate profiles. Build AI-powered platforms for recruitment analytics.

How This Product Fits Into Xverum’s Broader Data Offering Xverum is a leading provider of structured, high-quality web datasets. While we specialize in B2B profiles and company data, we also offer complementary datasets tailored for specific verticals, including ecommerce product data, job listings, and customer reviews. The AI Training Data is a natural extension of our core capabilities, bridging the gap between structured data and machine learning workflows. By providing annotation-ready datasets, real-time API access, and customization options, we ensure our clients can seamlessly integrate our data into their AI development processes.

Why Choose Xverum? - Experience and Expertise: A trusted name in structured web data with a proven track record. - Flexibility: Datasets can be tailored for any AI/ML application. - Scalability: With 800M profiles and more being added, you’ll always have access to fresh, up-to-date data. - Compliance: We prioritize data ethics and security, ensuring all data adheres to GDPR and other legal frameworks.

Ready to supercharge your AI and ML projects? Explore Xverum’s AI Training Data to unlock the potential of 800M global B2B profiles. Whether you’re building a chatbot, predictive algorithm, or next-gen AI application, our data is here to help.

Contact us for sample datasets or to discuss your specific needs.
Data from: Replication Package: Tracking the Moving Target: A Framework for...
zenodo.org
ekoizpen-zientifikoa.ehu.eus
application/gzip, bin +1
Updated Apr 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maider Azanza Sesé; Maider Azanza Sesé; Beatriz Pérez Lamancha; Eneko Pizarro; Beatriz Pérez Lamancha; Eneko Pizarro (2025). Replication Package: Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry [Dataset]. http://doi.org/10.5281/zenodo.15274212
Explore at:
pdf, bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15274212
Dataset updated
Apr 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maider Azanza Sesé; Maider Azanza Sesé; Beatriz Pérez Lamancha; Eneko Pizarro; Beatriz Pérez Lamancha; Eneko Pizarro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Replication Package: Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry

Version: 1.0 (Date: April 27, 2025)
DOI: https://doi.org/10.5281/zenodo.14779767

Paper Information

Title: Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry
Authors: Maider Azanza, Beatriz Pérez Lamancha, Eneko Pizarro
Publication: International Conference on Evaluation and Assessment in Software Engineering (EASE), 2025 edition.

Package Overview

This repository contains the replication package for the research paper cited above. It provides the necessary data, source code, and prompts to understand, verify, and potentially extend our findings on the continuous evaluation of LLM-based test generation in an industrial context. The data reflects evaluations conducted between November 2024 and January 2025.

Package Contents

Metrics-Results-by-Function.7z** (Archive, requires 7-Zip or compatible tool to extract)

Description: Contains the detailed, raw, and processed metric results for each of the 7 Java methods and classes evaluated in the study.

Structure: Inside this archive, you will find 7 individual .zip files, one for each function (e.g., addUser-Metrics-Results.zip, assemble-Metrics-Results.zip, ...).

Contents (per function zip): Each function-specific zip file typically includes:

Raw test cases generated by the evaluated LLMs.

Metric measurements (e.g., code coverage reports from SonarQube/JaCoCo).

Analysis or intermediate conclusions specific to that function.

The specific prompt variations used for that function, if applicable beyond the main prompt.

Purpose: Allows for in-depth analysis of LLM performance on specific methods and verification of the metric collection process described in the paper. Data collected between November 2024 and January 2025.

Metric Results by function Nov. 2024 - Jan.2025.pdf (PDF Document)

Description: Provides a consolidated tabular view of the key raw metrics collected for each function and LLM evaluated during the November 2024 - January 2025 period.

Contents: Tables summarizing metrics like code coverage, number of generated tests, expert assessment scores, etc., broken down by function and LLM. This data is directly derived from the detailed results in Metrics-Results-by-Function.7z.

Purpose: Offers a more detailed quantitative overview than the aggregated summary, facilitating direct comparison of raw performance metrics across functions and LLMs without needing to extract all archives.

Aggregated Results by function Nov. 2024 - Jan.2025.pdf (PDF Document)

Description: Presents a high-level summary of the evaluation results across all tested methods and LLMs.

Contents: Includes an aggregated metric table showing overall performance trends, potentially including the weighted metrics discussed in the paper.

Purpose: Provides a quick overview of the main findings and comparative performance of the LLMs according to the evaluation framework.

Prompt_for_Integration_Testing-2025.pdf (PDF Document)

Description: The final, refined version of the prompt provided to the LLMs for generating integration test cases.

Contents: Details the instructions, context (including source code snippets or descriptions), constraints, and desired output format given to the LLMs. Reflects the prompt-chaining methodology described in the paper.

Purpose: Enables understanding of how the LLMs were instructed and allows others to reuse or adapt the prompt engineering approach.

sources.tar.gz (Compressed Tar Archive, requires tar or compatible tool to extract)

Description: Contains the original Java source code for the 7 methods that were the targets for test generation.

Contents:

The specific Java files containing the methods under test.

Relevant context or dependency information needed to understand the methods' functionality and complexity.

May include documentation (e.g., Javadoc) describing the intended behavior of each method.

Purpose: Provides the necessary code context for understanding the test generation task and potentially replicating the test execution or analysis.
h
Bitext-customer-support-llm-chatbot-training-dataset
huggingface.co
opendatalab.com
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-customer-support-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the Customer Support sector can be easily achieved using our two-step approach to LLM… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset.
h
law-tasks
huggingface.co
opendatalab.com
Updated Sep 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AdaptLLM (2023). law-tasks [Dataset]. https://huggingface.co/datasets/AdaptLLM/law-tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2023
Authors
AdaptLLM
Description
Adapting LLMs to Domains via Continual Pre-Training (ICLR 2024)

This repo contains the evaluation datasets for our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/law-tasks.
Data from: Can Large Language Models Identify Locations Better Than Linked...
zenodo.org
portaldelaciencia.uva.es
bin
Updated Jun 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pablo García-Zarza; Pablo García-Zarza; Juan I. Asensio-Pérez; Juan I. Asensio-Pérez; Miguel L. Bote-Lorenzo; Miguel L. Bote-Lorenzo; Luis F. Sánchez-Turrión; Luis F. Sánchez-Turrión; Davide Taibi; Davide Taibi; Guillermo Vega-Gorgojo; Guillermo Vega-Gorgojo (2025). Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning? [Dataset]. http://doi.org/10.5281/zenodo.15600171
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15600171
Dataset updated
Jun 5, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pablo García-Zarza; Pablo García-Zarza; Juan I. Asensio-Pérez; Juan I. Asensio-Pérez; Miguel L. Bote-Lorenzo; Miguel L. Bote-Lorenzo; Luis F. Sánchez-Turrión; Luis F. Sánchez-Turrión; Davide Taibi; Davide Taibi; Guillermo Vega-Gorgojo; Guillermo Vega-Gorgojo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the dataset and analysis associated with the research paper "Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning?" Presented at the 20th European Conference on Technology Enhanced Learning (ECTEL), 2025.

Overview

Ubiquitous learning (u-learning) applications often rely on identifying relevant Points of Interest (POIs) where students can engage in contextualized learning tasks. Traditionally, these POIs have been retrieved from structured datasets like Linked Open Data (LOD). However, with the rise of Large Language Models (LLMs), a new question arises: can LLMs outperform LOD in identifying such locations?

This study compares the performance of a LOD dataset (Wikidata) and two LLMs (ChatGPT and DeepSeek) in retrieving 16th-century cultural heritage sites (churches, cathedrals, castles, and palaces) across three European cities (two in Spain and one in Italy) and their regions.

Dataset

The file LODvsLLMs.xlsx includes:

Raw data retrieved from Wikidata and the two LLMs.

SPARQL queries and LLM prompts used for data collection.

Comparative analysis across four key dimensions:

Accuracy: Are the retrieved sites real and verifiable?

Consistency: Do repeated queries yield stable results?

Completeness: How exhaustive are the lists of POIs?

Validity: Are the geographic coordinates and Wikipedia links correct?

Key Findings

LOD (Wikidata) outperformed LLMs in terms of consistency, completeness (especially in larger regions), and validity of data.

LLMs were able to retrieve some POIs not found in Wikidata, but also introduced hallucinations and invalid links.

A hybrid approach combining LOD and LLMs is proposed for future u-learning applications to maximize coverage and reliability.

Citation

If you use this dataset or refer to the findings in your work, please cite the original paper presented at ECTEL 2025.

García-Zarza, P., Asensio-Pérez, J.I., Bote-Lorenzo, M.L., Sánchez-Turrión, L.F., Taibi, D., Vega-Gorgojo, G., Can Large Language Models Identify Locations Better Than Linked Open Data for U-Learning Proceedings of 20th European Conference on Technology Enhanced Learning, ECTEL 2025, Newcastle & Durham, United Kingdom, September 2025.
D
Replication Data for: Advanced System Integration: Analyzing OpenAPI...
darus.uni-stuttgart.de
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello (2024). Replication Data for: Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation [Dataset]. http://doi.org/10.18419/DARUS-4605
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.18419/DARUS-4605
Dataset updated
Dec 9, 2024
Dataset provided by
DaRUS
Authors
Robin D. Pesl; Jerin George Mathew; Massimo Mecella; Marco Aiello
License
https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605https://darus.uni-stuttgart.de/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.18419/DARUS-4605
Dataset funded by
BMWK
MWK
Description
Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle, e.g., services not yet existent at design time. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves specification details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform naïve chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score. Content: code.zip:Python source code to perform the experiments. evaluate.py: Script to execute the experiments (Uncomment lines to select the embedding model). socrag/*: Source code for the RAG. benchmark/*: RestBench specification. results.zip:Results of the RAG experiments (in the folder /results/data/ inside the zip file). Experiment results for the RAG: results_{embedding_model}_{top-k}.json. Experiment results for the Discovery Agent: results_{embedding_model}_{agent}_{refinement}_{llm}.json. FAISS store (intermediate data required for exact reproduction of results; one folder for each embedding model): bge_small, nvidia and oai. Intermediate data of the LLM-based refinement methods required for the exact reproduction of results: *_parser.json.

Replication Package for the Paper: "An Insight into Security Code Review...

zenodo.org

zip

Updated Jun 2, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai; Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai (2025). Replication Package for the Paper: "An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors". [Dataset]. http://doi.org/10.5281/zenodo.15572151

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15572151

Dataset updated

Jun 2, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai; Jiaxin Yu; Peng Liang; Yujia Fu; Amjed Tahir; Mojtaba Shahin; Chong Wang; Yangxiao Cai

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This is the replication package for the paper: "An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors".

The replication package is organized into three folders:

1. RQ1 Performance of LLMs

- Five prompt templates.pdf
This PDF demonstrates the detailed structures of the five prompt templates designed in Section 3.3.2 of our paper.

- source code of the Python and C/C++ datasets
This folder contains the source code of the Python and C/C++ datasets, used to construct prompts and apply the baseline tools for static analysis.

- prompts for the Python and C/C++ datasets
This folder contains the prompts constructed from the source code of the Python and C/C++ datasets based on the five prompt templates.

- responses of LLMs and baselines
This folder contains the responses generated by LLMs for each prompt and the analysis results of baseline tools. For CodeQL, you need to upload results.sarif to GitHub (https://docs.github.com/en/code-security/code-scanning/integrating-with-code-scanning/uploading-a-sarif-file-to-github) to view the analysis results. For SonarQube, you need to import the export file into an Enterprise Edition or higher instance of the same version (v10.5 in our work) and similar configuration (default configuration in our work) to view the analysis results.

- entropy_calculation.py
This Python script calculates the average entropy of each llm-prompt combination to measure the consistency of LLM responses in three repetitive experiments.

- Data Labelling for the C/C++ Dataset.xlsx
- Data Labelling for the Python Dataset.xlsx
The two Microsoft (MS) files contain the labeling results for LLMs and baselines in the C/C++ and Python datasets, including the category of each response generated by LLM for each prompt, as well as the category of each analysis result generated by baseline for each code file. The four categories(i.e., Instrumental, Helpful, Misleading and Uncertain) are defined in Section 3.3.3 of our paper as the labelling criteria.

How to Read the MS Excel files:
Both MS Excel files contain 5 sheets. The first sheet ('all_c++_data' or 'all_python_data') includes the information of all data in each dataset. The sheets 'first round', 'second round' and 'third round' represent the labelling results for LLMs under five prompts in three repetitive experiments. The sheet 'Baselines' include the labelling results for baseline tools.

Column	Description
File ID	the identifier of each code file in our dataset.
Security Defect	the security defect(s) that the code file contains.
Project	the source project of the code file.
Suffix	the suffix of the code file.

2. RQ2 Quality Problem in Responses
- data_analysis_first_round.mx22
- data_analysis_second_round.mx22
- data_analysis_third_round.mx22

These three MAXQDA project files contain the results of data extraction for quality problems present in responses generated by the best-performing LLM-prompt combination across three repetitive experiments. This file can be opened by MAXQDA 2022 or higher versions, which are available at https://www.maxqda.com/ for download. You may also use the free 14 days trial version of MAXQDA 2024, which is available at https://www.maxqda.com/trial for download.

3. RQ3 Factor influencing LLMs
This folder contains two sub-folders:

- Step 1 - correlation analysis
Files in this subfolder are for conducting correlation analysis for explanatory variables through a Python script.

- Step 2 - redundancy analysis and model fitting
Files in this subfolder are for conducting redundancy analysis, allocation of degree of freedoms, model fitting and evaluation through an R script. Detailed instructions for running the R script can be found in readme.md in this subfolder.

z
LORE PMKB-CV
zenodo.org
bin
Updated Mar 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peng-Hsuan Li; Peng-Hsuan Li (2025). LORE PMKB-CV [Dataset]. http://doi.org/10.5281/zenodo.14607639
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14607639
Dataset updated
Mar 5, 2025
Dataset provided by
Taiwan AI Labs
Authors
Peng-Hsuan Li; Peng-Hsuan Li
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Time period covered
Jan 7, 2025
Description
LORE PMKB-CV

Knowledge graph (LLM-ORE)

70M relations between 8k Diseases (MeSH) and 18k Genes (NCBI, human protein coding) curated by LLMs reading PubMed

Data format: (D_id, G_id, PMID, relation) csv file

Semantic embedding (LLM-EMB)

2.5M DG vectors created by LLMs reading the knowledge graph

Data format: (D_id, G_id, vector) pkl file

DG pathogenicity scores (ML-Ranker)

3.1M DG scores predicted by pretrained models

Features, training annotations, pretrained models are also provided

Curated key semantics taxonomy

A manually curated taxonomy of 105 semantic tags about DG pathogenicity in the knowledge graph

Use the github LORE Key-Semantics module to use the taxonomy as tags and add them to the knowledge graph

Source project

https://github.com/ailabstw/LORE

Tools for running LLM-ORE relation extraction, LLM-EMB embedding, ML-Ranker prediction, Key-Semantics curation on custom datasets

https://doi.org/10.1093/bib/bbaf070

Research article describing the LORE framework, analyses, experiments, and details of the PMKB-CV dataset
h
DOVE
huggingface.co
Updated Mar 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nlphuji (2025). DOVE [Dataset]. https://huggingface.co/datasets/nlphuji/DOVE
Explore at:
Dataset updated
Mar 2, 2025
Dataset authored and provided by
nlphuji
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
🕊️ DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

🌐 Project Website | 📄 Read our paper

Updates 📅

2025-06-11: Added Llama 70B evaluations with ~5,700 MMLU examples across 100 different prompt variations (= 570K new predictions!), based on data from ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments 2025-04-12: Added MMLU predictions from dozens of models including OpenAI, Qwen, Mistral, Gemini… See the full description on the dataset page: https://huggingface.co/datasets/nlphuji/DOVE.
f
Simulation Parameters.
plos.figshare.com
xls
Updated Oct 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adrielli Tina Lopes Rego; Joshua Snell; Martijn Meeter (2024). Simulation Parameters. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012117.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1012117.t001
Dataset updated
Oct 7, 2024
Dataset provided by
PLOS Computational Biology
Authors
Adrielli Tina Lopes Rego; Joshua Snell; Martijn Meeter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Although word predictability is commonly considered an important factor in reading, sophisticated accounts of predictability in theories of reading are lacking. Computational models of reading traditionally use cloze norming as a proxy of word predictability, but what cloze norms precisely capture remains unclear. This study investigates whether large language models (LLMs) can fill this gap. Contextual predictions are implemented via a novel parallel-graded mechanism, where all predicted words at a given position are pre-activated as a function of contextual certainty, which varies dynamically as text processing unfolds. Through reading simulations with OB1-reader, a cognitive model of word recognition and eye-movement control in reading, we compare the model’s fit to eye-movement data when using predictability values derived from a cloze task against those derived from LLMs (GPT-2 and LLaMA). Root Mean Square Error between simulated and human eye movements indicates that LLM predictability provides a better fit than cloze. This is the first study to use LLMs to augment a cognitive model of reading with higher-order language processing while proposing a mechanism on the interplay between word predictability and eye movements.
p
Data from: Medical Expert Annotations of Unsupported Facts in Doctor-Written...
physionet.org
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stefan Hegselmann; Shannon Shen; Florian Gierse; Monica Agrawal; David Sontag; Xiaoyi Jiang (2025). Medical Expert Annotations of Unsupported Facts in Doctor-Written and LLM-Generated Patient Summaries [Dataset]. http://doi.org/10.13026/gedc-j464
Explore at:
Unique identifier
https://doi.org/10.13026/gedc-j464
Dataset updated
Apr 30, 2025
Authors
Stefan Hegselmann; Shannon Shen; Florian Gierse; Monica Agrawal; David Sontag; Xiaoyi Jiang
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Large language models in healthcare can generate informative patient summaries while reducing the documentation workload of healthcare professionals. However, these models are prone to producing hallucinations, that is, generating unsupported information, which is problematic in the sensitive healthcare domain. To better characterize unsupported facts in medical texts, we developed a rigorous labeling protocol. Following this protocol, two medical experts annotated unsupported facts in 100 doctor-written summaries from the MIMIC-IV-Note Discharge Instructions and hallucinations 100 LLM-generated patient summaries. Here, we are releasing two datasets based on these annotations: Hallucinations-MIMIC-DI and Hallucinations-Generated-DI. We find that using these datasets to train on hallucination-free examples effectively reduces hallucinations for both Llama 2 (2.60 to 1.55 hallucinations per summary) and GPT-4 (0.70 to 0.40). Furthermore, we created a preprocessed version of the MIMIC-IV-Notes Discharge Instructions, releasing both a full-context version (MIMIC-IV-Note-Ext-DI) and a version that only uses the Brief Hospital Course for context (MIMIC-IV-Note-Ext-DI-BHC).
h
finance-tasks
huggingface.co
opendatalab.com
Updated Dec 31, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AdaptLLM (2011). finance-tasks [Dataset]. https://huggingface.co/datasets/AdaptLLM/finance-tasks
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 31, 2011
Authors
AdaptLLM
Description
Adapting LLMs to Domains via Continual Pre-Training (ICLR 2024)

This repo contains the evaluation datasets for our paper Adapting Large Language Models via Reading Comprehension. We explore continued pre-training on domain-specific corpora for large language models. While this approach enriches LLMs with domain knowledge, it significantly hurts their prompting ability for question answering. Inspired by human learning via reading comprehension, we propose a simple method to… See the full description on the dataset page: https://huggingface.co/datasets/AdaptLLM/finance-tasks.
Synthetic Colorectal Cancer Global Dataset
opendatabay.com
.undefined
Updated Jun 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Opendatabay Labs (2025). Synthetic Colorectal Cancer Global Dataset [Dataset]. https://www.opendatabay.com/data/synthetic/ae2aba99-491d-45a1-a99e-7be14927f4af
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 28, 2025
Dataset provided by
Buy & Sell Data | Opendatabay - AI & Synthetic Data Marketplace
Authors
Opendatabay Labs
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Patient Health Records & Digital Health
Description
The Synthetic Colorectal Cancer Global Dataset is a fully anonymised, high-dimensional synthetic dataset designed for global cancer research, predictive modelling, and educational use. It encompasses demographic, clinical, lifestyle, genetic, and healthcare access factors relevant to colorectal cancer incidence, outcomes, and survivability.

Dataset Features

Patient_ID: Unique identifier for each patient.

Country: Patient's country of residence.

Age: Age at diagnosis (in years).

Gender: Biological sex of the patient (Male/Female/Other).

Cancer_Stage: Stage of colorectal cancer at diagnosis (e.g., Stage I–IV).

Tumor_Size_mm: Size of the tumor in millimeters.

Family_History: Presence of colorectal cancer in family history (True/False).

Smoking_History: Smoking behavior or history (e.g., Current, Former, Never).

Alcohol_Consumption: Level of alcohol consumption (e.g., High, Moderate, None).

Obesity_BMI: BMI classification related to obesity.

Diet_Risk: Diet-related cancer risk (e.g., High Fat, Low Fiber).

Physical_Activity: Level of physical activity (e.g., Sedentary, Active).

Diabetes: Diabetes diagnosis (True/False).

Inflammatory_Bowel_Disease: Presence of IBD (True/False).

Genetic_Mutation: Genetic mutations relevant to colorectal cancer (e.g., APC, KRAS).

Screening_History: History of cancer screenings (True/False).

Early_Detection: Whether cancer was detected early (True/False).

Treatment_Type: Primary treatment type (e.g., Surgery, Chemotherapy, Radiation).

Survival_5_years: 5-year survival status (True/False).

Mortality: Mortality outcome (Alive/Deceased).

Healthcare_Costs: Estimated treatment costs (in USD).

Incidence_Rate_per_100K: Country-level incidence rate per 100,000 people.

Mortality_Rate_per_100K: Country-level mortality rate per 100,000 people.

Urban_or_Rural: Patient's living area (Urban/Rural).

Economic_Classification: Country's economic level (e.g., Low, Middle, High income).

Healthcare_Access: Access level to healthcare services (e.g., Good, Limited).

Insurance_Status: Insurance coverage status (Insured/Uninsured).

Survival_Prediction: Model-derived survival prediction (probability or binary).

Distribution

https://storage.googleapis.com/opendatabay_public/ae2aba99-491d-45a1-a99e-7be14927f4af/299af3fa2502_patient_analysis_plots.png" alt="Synthetic Colorectal Cancer Global Data Distribution.png">

Usage

This dataset can be used for:

Global Cancer Research: Analyze how clinical, lifestyle, and socioeconomic factors affect colorectal cancer outcomes worldwide.

Predictive Modeling: Develop models to estimate survival probability or treatment outcomes.

Healthcare Policy Analysis: Study disparities in healthcare access and outcomes across countries.

Educational Use: Support training in epidemiology, oncology, public health, and machine learning.

Coverage

The dataset includes 100% synthetic yet clinically plausible records from diverse countries and demographic groups. It is anonymized and modeled to reflect real-world variability in risk factors, diagnosis stages, treatment, and survival without compromising patient privacy.

License

CC0 (Public Domain)

Who Can Use It

Epidemiologists and Medical Researchers: To explore global patterns in colorectal cancer.

Public Health Experts and Policymakers: For assessing equity in healthcare access and cancer outcomes.

Data Scientists and Educators: As a rich dataset for teaching data analysis, classification, regression, and health informatics.
d
Replication Data for: Large Language Models as a Substitute for Human...
search.dataone.org
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heseltine, Michael (2024). Replication Data for: Large Language Models as a Substitute for Human Experts in Annotating Political Text [Dataset]. http://doi.org/10.7910/DVN/V2P6YL
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/V2P6YL
Dataset updated
Mar 6, 2024
Dataset provided by
Harvard Dataverse
Authors
Heseltine, Michael
Description
Large-scale text analysis has grown rapidly as a method in political science and beyond. To date, text-as-data methods rely on large volumes of human-annotated training examples, which places a premium on researcher resources. However, advances in large language models (LLMs) may make automated annotation increasingly viable. This paper tests the performance of GPT-4 across a range of scenarios relevant for analysis of political text. We compare GPT-4 coding with human expert coding of tweets and news articles across four variables (whether text is political, its negativity, its sentiment, and its ideology) and across four countries (the United States, Chile, Germany, and Italy). GPT-4 coding is highly accurate, especially for shorter texts such as tweets, correctly classifying texts up to 95\% of the time. Performance drops for longer news articles, and very slightly for non-English text. We introduce a ``hybrid'' coding approach, in which disagreements of multiple GPT-4 runs are adjudicated by a human expert, which boosts accuracy. Finally, we explore downstream effects, finding that transformer models trained on hand-coded or GPT-4-coded data yield almost identical outcomes. Our results suggests that LLM-assisted coding is a viable and cost-efficient approach, although consideration should be given to task complexity.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata

Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services

Explore at:

.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats

Dataset updated

Jan 25, 2024

Dataset authored and provided by

Nexdata

Area covered

Maldives, Portugal, El Salvador, Kyrgyzstan, Spain, Ireland, Taiwan, Czech Republic, Azerbaijan, Russian Federation

Description

Overview
Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade

Clear search

Close search

Google apps

Main menu

Foundation Model Data Collection and Data Annotation | Large Language...

TagX Data collection for AI/ ML training | LLM data | Data collection for AI...

LLM-Assisted Content Analysis (LACA): Coded data and model reasons

Data from: LLM-assisted Graph-RAG Information Extraction from IFC Data

Data from: EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice...

Data from: Data and code for, "Large language models design sequence-defined...

Machine Learning (ML) Data | 800M+ B2B Profiles | AI-Ready for Deep Learning...

Data from: Replication Package: Tracking the Moving Target: A Framework for...

Replication Package: Tracking the Moving Target: A Framework for Continuous Evaluation of LLM Test Generation in Industry

Paper Information

Package Overview

Package Contents

Bitext-customer-support-llm-chatbot-training-dataset

law-tasks

Data from: Can Large Language Models Identify Locations Better Than Linked...

Overview

Dataset

Key Findings

Citation

Replication Data for: Advanced System Integration: Analyzing OpenAPI...

Replication Package for the Paper: "An Insight into Security Code Review...

LORE PMKB-CV

DOVE

Simulation Parameters.

Data from: Medical Expert Annotations of Unsupported Facts in Doctor-Written...

finance-tasks

Synthetic Colorectal Cancer Global Dataset

Dataset Features

Distribution

Usage

Coverage

License

Who Can Use It

Replication Data for: Large Language Models as a Substitute for Human...

Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services