8 datasets found

d
Foundation Model Data Collection and Data Annotation | Large Language...
datarade.ai
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset updated
Jan 25, 2024
Dataset authored and provided by
Nexdata
Area covered
Portugal, Taiwan, Czech Republic, Malta, Ireland, Azerbaijan, Russian Federation, El Salvador, Kyrgyzstan, Spain
Description
Overview

Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
Alpaca
kaggle.com
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-instructions-word-level-classification/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca

Alpaca - Training LLMs to follow instructions

By Huggingface Hub [source]

About this dataset

This dataset, TokenBender: 122k Alpaca-Style Instructions Word-Level Classification Towards Accurate Natural Language Understanding, provides a comprehensive collection of 122K Alpaca-style instructions with their associated input, text and output for word-level classification. It enables natural language understanding research to be done conveniently as it contains entries from diverse areas such as programming code instructions and gaming instructions that are written in varying levels of complexity. With the help of this dataset, developers aiming to apply natural language processing techniques for machines may gain insight into how to improve the accuracy and facilitate the comprehension of human language commands. By using this dataset, one may develop advanced algorithms such as neural networks or decision trees that can quickly understand commands in foreign languages and bridge the gap between machines and humans for different practical purposes

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains 122k Alpaca-Style Instructions with their corresponding input, text, and output for word-level classification. It is a valuable resource to those who wish to gain insight into natural language understanding through data science approaches. This guide will provide some tips on how to use this dataset in order to maximize accuracy and gain a better understanding of natural language.

Preprocessing: Cleaning the data is an essential step when dealing with any sort of text data which includes the Alpaca instructions dataset. This involves removing stopwords like articles, pronouns, etc., normalizing words such as capitalization or lemmatization, filtering for relevant terms based on context or key problems you are trying to solve; and finally tokenizing the remaining text into appropriate individual pieces that can be provided as input features for different models – SentencePiece is perfect for this sort of task.

Feature extraction: After preprocessing your text data it’s time to extract insightful features from it utilizing techniques like Bag-of-Words (BOW), Term Frequency - Inverse Document Frequency (TF-IDF) Vectorizer etc., which might help you better understand the context behind each instruction sentence/word within the corpus. Additionally embedding techniques using word2vec/GloVe might also serve useful in extracting semantic information from these instructions while helping build classifiers successful at predicting word level categories related tasks (Semantic segmentation).

Model selection: Depending on your problem setup AI architectures like Support vector machines(SVMs)/Conditional Random Fields(CRFs)/ Attention Based Models should work well in tackling these types of tasks related towards NLP analysis at both sentence or shallow representation form levels (Part Of Speech tagging). If learning what words are used together efficiently matters more than all other options then selecting an RNN model such as LSTM or GRU might do wonders; they are similarly effective but faster modelling approach due its recursive structure that allows you store context information more effectively compared BOWs or TFIDF Vectors spaces separately built up during feature engineering processing periods per individual supervised training tasks points instead across all!

Evaluating Results: After choosing the best algorithm model fit analysis performance measures such as F1 scores should enable easier tracking end goal results adjustments if needed precision/recall levels are declining significantly past certain number values threshold points compared lower task confirming holding out uncategorized sample documents versus larger ID test portion splits train tests datasets subsets collected

Research Ideas

Developing an AI-based algorithm capable of accurately understanding the meaning of natural language instructions.

Using this dataset for training and testing machine learning models to classify specific words and phrases within natural language instructions.

Training a deep learning model to generate visual components based on the given input, text, and output values from this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Univer...
S
The big model fine-tuning data set of five key elements of tourism resources...
scidb.cn
Updated Oct 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
lu bao qing; Wan Fucheng; Yu Hongzhi; Chen Min (2024). The big model fine-tuning data set of five key elements of tourism resources in the five northwestern provinces in 2024 [Dataset]. http://doi.org/10.57760/sciencedb.j00001.01088
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00001.01088
Dataset updated
Oct 17, 2024
Dataset provided by
Science Data Bank
Authors
lu bao qing; Wan Fucheng; Yu Hongzhi; Chen Min
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
With the wide application of large models in various fields, the demand for high-quality data sets in the tourism industry is increasing to support the improvement of the model 's ability to understand and generate tourism information. This dataset focuses on textual data in the tourism domain and is designed to support fine-tuning tasks for tourism-oriented large models, aiming to enhance the model's ability to understand and generate tourism-related information. The diversity and quality of the dataset are critical to the model's performance. Therefore, this study combines web scraping and manual annotation techniques, along with data cleaning, denoising, and stopword removal, to ensure high data quality and accuracy. Additionally, automated annotation tools are used to generate instructions and perform consistency checks on the texts. The LLM-Tourism dataset primarily relies on data from Ctrip and Baidu Baike, covering five Northwestern Chinese provinces: Gansu, Ningxia, Qinghai, Shaanxi, and Xinjiang, containing 53,280 pairs of structured data in JSON format. The creation of this dataset will not only improve the generation accuracy of tourism large models but also contribute to the sharing and application of tourism-related datasets in the field of large models.
Replication Package for 'How do Machine Learning Models Change?'
zenodo.org
zip
Updated Jun 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández (2025). Replication Package for 'How do Machine Learning Models Change?' [Dataset]. http://doi.org/10.5281/zenodo.15696330
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15696330
Dataset updated
Jun 19, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Joel Castaño Fernández; Joel Castaño Fernández; Rafael Cabañas; Rafael Cabañas; Antonio Salmerón; Antonio Salmerón; Lo David; Lo David; Silverio Martínez-Fernández; Silverio Martínez-Fernández
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Replication Package: How Do Machine Learning Models Change?

Overview

This replication package accompanies the paper "How Do Machine Learning Models Change?". In this study, we conducted a large-scale analysis of over 680,000 commits from 100,000 models and 2,251 releases from 202 of these models on the Hugging Face (HF) platform. Our goal was to understand how machine learning (ML) models evolve by classifying commit types using a detailed ML change taxonomy and analyzing temporal patterns in their activities using Bayesian networks.

Our research addresses three main aspects:

1. Categorization of Commit Changes: We classified over 960,000 commits on HF, providing a detailed breakdown of change types and their distribution.

2. Analysis of Commit Sequences: We examined the sequence and dependencies of commit types using Bayesian networks to identify temporal patterns.

3. Release Analysis**: We investigated the distribution and evolution of release types, analyzing how model attributes and metadata change across successive releases.

This package provides all the necessary code, data, and documentation to reproduce the results presented in our paper.

Data Collection and Preprocessing

Data Collection

We collected data from the Hugging Face platform using the Hugging Face Hub API. The data extraction was performed up to May 2025, capturing details from over 1 million models available at that time.

- Model & Release Information: We collected model metadata, commit histories, and release information (Git tags) for our sampled models.

- Detailed Commit Changes: To get a detailed list of files modified in each commit, we implemented a direct Git processing approach. For each model, its repository was temporarily cloned to programmatically extract the list of changed files for every commit SHA.

Data Preprocessing

Commit Diffs

We computed the differences for key JSON configuration files (e.g., config.json) between commits to identify added, deleted, and updated keys, which served as input for classification.

Commit Classification

We classified each commit according to Bhatia et al.'s ML change taxonomy using the Gemini 2.5 Flash LLM. To ensure the reliability of this process, we implemented a rigorous two-phase validation:

1. Prompt Refinement (Training): The prompt was iteratively refined over 6 cycles using a curated training set of 143 commits. The process was guided by comparing LLM classifications against a gold standard created by two human annotators (Human-Human IRR on a subset: 𝜅 = 0.7798). The final refined prompt achieved a Kappa of 0.9068 against the training gold standard.

2. Final Validation (Testing): The validated prompt was tested on an independent, statistically significant sample of 384 commits. The LLM's classifications achieved a Cohen's Kappa of 0.8568 when compared against the test set's gold standard, which itself was validated with a human-human IRR of 𝜅 = 0.8150.

We also classified commits into Swanson's categories using a fine-tuned DistilBERT model, as detailed in the paper.

Folder Structure

The replication package is organized as follows. The structure has been designed to separate code, data, and metadata for clarity.

- code/: Contains all Jupyter notebooks for the study.

- Collection/: Scripts for data extraction from Hugging Face.

- HFExtraction.ipynb: Collects primary model and commit information.

- HFReleasesExtraction.ipynb: Collects release (tag) specific information.

- Preprocessing/: Scripts for data cleaning, processing, and classification.

- HFCommitsPreprocessing.ipynb: Processes commits, computes diffs, and prepares data for classification and analysis.

- HFReleasesPreprocessing.ipynb: Processes and classifies release data.

- SwansonsClassification.ipynb: Classifies commits into corrective, perfective, and adaptive types.

- Analysis/: Notebooks for reproducing the analysis for each research question.

- HFFileChanges.ipynb: Contains the preliminary analysis of file change patterns.

- RQ1_Analysis.ipynb: Analysis for Research Question 1.

- RQ2_Analysis.ipynb: Analysis for Research Question 2.

- RQ3_Analysis.ipynb: Analysis for Research Question 3.

- datasets/: Contains the key final datasets used in the analysis notebooks.

- commits_datasets/: Contains the main classified commit dataset.

- HFCommitsClassification_final.csv: The final dataset with over 960,000 classified commits for RQ1 and RQ2.

- releases_datasets/: Contains the datasets related to releases.

- HFReleasesClassification.csv: The final dataset of 2,251 classified releases for RQ3.

- model_metadata.csv: The extracted internal metadata from model files for RQ3.4.

- Note: Other intermediate CSV files are provided to facilitate re-running specific parts of the analysis without starting from scratch.

- metadata/: Contains configuration files and the data used for the validation process.

- validation_data/: A sub-folder containing the gold standard data.

- Agreement TOSEM Commit Changes.xlsx: Excel containing details of the classication and validation processes.

- prompt_refinement.txt: The final, validated prompt used for the LLM classification along its previous iterations.

- training_set_ground_truth.json: Gold standard for the 143-commit training set.

- training_set_first_classification.json: First annotator's labels for the training IRR subset.

- training_set_second_classification.json: Second annotator's labels for the training IRR subset.

- test_set_ground_truth.json: Gold standard for the 384-commit test set.

- test_first_classification.json: First annotator's labels for the training IRR subset.

- test_set_second_classification.json: Second annotator's labels for the test IRR subset.

- tags_metadata.yaml: Auxiliary metadata file used during preprocessing.

- README.md: This file.

- requirements.txt: Lists the required Python packages.

How to Use This Package

Setup

1. Create and activate a virtual environment (recommended).

bash</div> <div>python -m venv venv</div> <div>source venv/bin/activate # On Windows: venv\Scripts\activate</div> <div>

2. Install required packages.

bash</div> <div>pip install -r requirements.txt</div> <div>

Running the Analysis

The Jupyter notebooks in the code/ directory are numbered and named to be run in a logical sequence: Collection -> Preprocessing -> Analysis. We recommend following this order.

- To reproduce our findings directly, you can start with the notebooks in code/Analysis/. They are configured to load the final, processed datasets provided in the datasets/ folder.

- To re-run the entire pipeline, start with the notebooks in code/Collection/, followed by code/Preprocessing/. Please note that running the full data collection and classification pipeline is time-consuming and may require significant computational resources and appropriate API keys for the LLM.

Key Datasets Provided

- For RQ1 & RQ2: datasets/commits_datasets/HFCommitsClassification_final.csv (100,000 models for RQ1; filtered to 14,343 models for RQ2).

- For RQ3.1-3.3:datasets/releases_datasets/HFReleasesClassification.csv (2,251 releases from 202 models).

- For RQ3.4: datasets/releases_datasets/model_metadata.csv (from 28 models).

Contact

If you have any questions or encounter issues with this package, please contact the corresponding author. If you find our work useful, please consider citing our paper.
m
ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...
data.mendeley.com
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Lynch (2025). ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured LLM-Generated Event Messaging: BERT, Keras, XGBoost, and Ensemble Methods [Dataset]. http://doi.org/10.17632/g2sdzmssgh.1
Explore at:
Unique identifier
https://doi.org/10.17632/g2sdzmssgh.1
Dataset updated
Aug 15, 2025
Authors
Christopher Lynch
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:

Tagged datasets (.csv): human-tagged gold labels for evaluation

Untagged datasets (.csv): raw data with Prompt matched to corresponding LLM-generated narrative

Suitable for inference, semi-automatic labeling, or transfer learning

Python and R code for preprocessing, model training, evaluation, and visualization

Configuration files and environment specifications to enable end-to-end reproducibility

The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.

Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).

Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.

File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj

Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).

Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.

Funding Note * Funding sources provided time in support of human taggers annotating the data sets.
Data from: Congressional Witnesses Matter: Proving Witness Testimony Impact...
zenodo.org
zip
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Collin Coil; Collin Coil; Caroline Bruckner; Caroline Bruckner; Nicholas Chen; Nicholas Chen; Elizabeth Keith; Karen O'Connor; Karen O'Connor; Elizabeth Keith (2024). Congressional Witnesses Matter: Proving Witness Testimony Impact Using Large Language Models [Dataset]. http://doi.org/10.5281/zenodo.14291000
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14291000
Dataset updated
Dec 6, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Collin Coil; Collin Coil; Caroline Bruckner; Caroline Bruckner; Nicholas Chen; Nicholas Chen; Elizabeth Keith; Karen O'Connor; Karen O'Connor; Elizabeth Keith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository provides the data supporting our study, in which we use a large language model (LLMs) to analyze the impact of congressional witness testimony. The dataset has been curated and structured to facilitate reproducibility and encourage further research in this domain.

The repository includes the results of our study (see `Results.zip`), the fine-tuning corpus (see `Model Training Data.zip`), and the Witness and Legislative History and Impact Corpus (WLHIC), which is can be subdivided into the Witness Corpus (WC) and Legislative History and Impact Corpus (LHIC). For the LHIC and WC, we provide cleaned JSONL files containing the full datasets, individual text files of each document, and accompanying metadata (see `WLHIC data.zip`). To ensure comprehensive accessibility, we also include the original PDF versions of the documents in these corpora (see `WLHIC Raw Files.zip`).

We also provide the sentence transformer model resulting from the extended pretraining process and the model resulting from the fine tuning process. Both are accessible in `Models.zip`.

Researchers can use the provided data to replicate our findings and verify the results of our analysis. The cleaned data can also be regenerated by applying the cleaning scripts provided in the code repository to the LHIC and WC text files. While slight variations in results may occur when replicating the study from scratch due to the stochastic nature of LLM training, these differences are minimal and do not affect the substantive findings.

We encourage the use of this dataset for reproducibility studies and to inspire further exploration of LLM applications in political science research. By publishing this data, we aim to promote transparency, collaboration, and innovation in the field.
f
LAURA: Enhancing Code Review Generation with Context-Enriched...
figshare.com
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuxin Zhang; Yuxia Zhang; Zeyu Sun; Yanjie Jiang; Hui Liu (2025). LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM [Dataset]. http://doi.org/10.6084/m9.figshare.27367194.v1
Explore at:
text/x-script.pythonAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27367194.v1
Dataset updated
Oct 3, 2025
Dataset provided by
figshare
Authors
Yuxin Zhang; Yuxia Zhang; Zeyu Sun; Yanjie Jiang; Hui Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLMIntroductionLAURA is an LLM-based retrieval-augmented, context-aware framework for code review generation, which integrates context augmentation, review exemplar retrieval, and prompt tuning to enhance the performance of LLMs (in our study, ChatGPT-4o and DeepSeek v3) in generating code review comments.The experiments show that LAURA outperforms the direct application of ChatGPT-4o and DeepSeek v3 for code review generation and significantly surpasses the performance of the pre-trained model CodeReviewer.Since our experiments are based on ChatGPT-4o and DeepSeek v3, we have released the data processing code and dataset used in our research. The code section includes the Python scripts we used for data collection, cleaning, merging, and retrieval. The dataset section contains 301k entries from 1,807 high-quality projects sourced from GitHub, covering four programming languages: C, C++, Java, and Python. We also provide the time-split dataset used as the retrieval database (which is also used for fine-tuning CodeReviewer) and the human-annotated evaluation dataset.File Structurecodes: Data collection, filtering and post-processing codes used in our studydata_collection_and_filtering.py: Code for collecting data via the GitHub GraphQL API and filtering with rule-based and LLM-based methodsdata_embedding.py: Code for data embeddingdata_merging.py: Code for data merging, used to merge the review comments with the same target diffdata_retrieval.py: Code for data retrievaldiff_extension.py: Code for extending the code diffs by integrating the full code contexts into the diffsdatasets: Datasets built and used in our studydatabase_for_retrieve.csv: The dataset we built for retrieval-augmented generation, containing 298,494 entries prior to December 26, 2024evaluation_data.csv: The evaluation dataset we manually annotated, containing 384 entries later than December 26, 2024full_dataset.csv: The full dataset we collected, containing 301,256 entriesprompts: The prompts used in data filtering, generation and evaluationdirect_generation.txt: The prompt we used for direct generation as baselinesLAURA_generation.txt: The prompt we used for LAURA generationLLM_evaluation.txt: The prompt we used for LLM evaluationLLM_filtering.txt: The prompt we used for LLM filtering in data filtering processREADME.md: Description of our submission
f
R Code.
plos.figshare.com
txt
Updated Oct 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yating Tao; Qian Shen (2025). R Code. [Dataset]. http://doi.org/10.1371/journal.pone.0334331.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0334331.s002
Dataset updated
Oct 14, 2025
Dataset provided by
PLOS ONE
Authors
Yating Tao; Qian Shen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
R scripts used for data cleaning, analysis, and visualization. (R)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nexdata (2024). Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services [Dataset]. https://datarade.ai/data-products/nexdata-foundation-model-data-solutions-llm-sft-rhlf-nexdata

Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services

Explore at:

.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats

Dataset updated

Jan 25, 2024

Dataset authored and provided by

Nexdata

Area covered

Portugal, Taiwan, Czech Republic, Malta, Ireland, Azerbaijan, Russian Federation, El Salvador, Kyrgyzstan, Spain

Description

Overview
Unsupervised Learning: For the training data required in unsupervised learning, Nexdata delivers data collection and cleaning services for both single-modal and cross-modal data. We provide Large Language Model(LLM) Data cleaning and personnel support services based on the specific data types and characteristics of the client's domain.

-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.

-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.

-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.

Our Capacity -Global Resources: Global resources covering hundreds of languages worldwide

-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization

-Quality: Multiple rounds of quality inspections ensures high quality data output

-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.

-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.

3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade

Clear search

Close search

Google apps

Main menu

Foundation Model Data Collection and Data Annotation | Large Language...

Alpaca

Alpaca

Alpaca - Training LLMs to follow instructions

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

The big model fine-tuning data set of five key elements of tourism resources...

Replication Package for 'How do Machine Learning Models Change?'

Replication Package: How Do Machine Learning Models Change?

Overview

Data Collection and Preprocessing

Data Collection

Data Preprocessing

Folder Structure

How to Use This Package

Setup

Running the Analysis

Key Datasets Provided

Contact

ML Classifiers, Human-Tagged Datasets, and Validation Code for Structured...

Data from: Congressional Witnesses Matter: Proving Witness Testimony Impact...

LAURA: Enhancing Code Review Generation with Context-Enriched...

R Code.

Foundation Model Data Collection and Data Annotation | Large Language Model(LLM) Data | SFT Data| Red Teaming Services