Facebook
Twitter-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.
-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.
-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.
-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization
-Quality: Multiple rounds of quality inspections ensures high quality data output
-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.
-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.
3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset, TokenBender: 122k Alpaca-Style Instructions Word-Level Classification Towards Accurate Natural Language Understanding, provides a comprehensive collection of 122K Alpaca-style instructions with their associated input, text and output for word-level classification. It enables natural language understanding research to be done conveniently as it contains entries from diverse areas such as programming code instructions and gaming instructions that are written in varying levels of complexity. With the help of this dataset, developers aiming to apply natural language processing techniques for machines may gain insight into how to improve the accuracy and facilitate the comprehension of human language commands. By using this dataset, one may develop advanced algorithms such as neural networks or decision trees that can quickly understand commands in foreign languages and bridge the gap between machines and humans for different practical purposes
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains 122k Alpaca-Style Instructions with their corresponding input, text, and output for word-level classification. It is a valuable resource to those who wish to gain insight into natural language understanding through data science approaches. This guide will provide some tips on how to use this dataset in order to maximize accuracy and gain a better understanding of natural language.
Preprocessing: Cleaning the data is an essential step when dealing with any sort of text data which includes the Alpaca instructions dataset. This involves removing stopwords like articles, pronouns, etc., normalizing words such as capitalization or lemmatization, filtering for relevant terms based on context or key problems you are trying to solve; and finally tokenizing the remaining text into appropriate individual pieces that can be provided as input features for different models – SentencePiece is perfect for this sort of task.
Feature extraction: After preprocessing your text data it’s time to extract insightful features from it utilizing techniques like Bag-of-Words (BOW), Term Frequency - Inverse Document Frequency (TF-IDF) Vectorizer etc., which might help you better understand the context behind each instruction sentence/word within the corpus. Additionally embedding techniques using word2vec/GloVe might also serve useful in extracting semantic information from these instructions while helping build classifiers successful at predicting word level categories related tasks (Semantic segmentation).
Model selection: Depending on your problem setup AI architectures like Support vector machines(SVMs)/Conditional Random Fields(CRFs)/ Attention Based Models should work well in tackling these types of tasks related towards NLP analysis at both sentence or shallow representation form levels (Part Of Speech tagging). If learning what words are used together efficiently matters more than all other options then selecting an RNN model such as LSTM or GRU might do wonders; they are similarly effective but faster modelling approach due its recursive structure that allows you store context information more effectively compared BOWs or TFIDF Vectors spaces separately built up during feature engineering processing periods per individual supervised training tasks points instead across all!
Evaluating Results: After choosing the best algorithm model fit analysis performance measures such as F1 scores should enable easier tracking end goal results adjustments if needed precision/recall levels are declining significantly past certain number values threshold points compared lower task confirming holding out uncategorized sample documents versus larger ID test portion splits train tests datasets subsets collected
- Developing an AI-based algorithm capable of accurately understanding the meaning of natural language instructions.
- Using this dataset for training and testing machine learning models to classify specific words and phrases within natural language instructions.
- Training a deep learning model to generate visual components based on the given input, text, and output values from this dataset
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Univer...
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
With the wide application of large models in various fields, the demand for high-quality data sets in the tourism industry is increasing to support the improvement of the model 's ability to understand and generate tourism information. This dataset focuses on textual data in the tourism domain and is designed to support fine-tuning tasks for tourism-oriented large models, aiming to enhance the model's ability to understand and generate tourism-related information. The diversity and quality of the dataset are critical to the model's performance. Therefore, this study combines web scraping and manual annotation techniques, along with data cleaning, denoising, and stopword removal, to ensure high data quality and accuracy. Additionally, automated annotation tools are used to generate instructions and perform consistency checks on the texts. The LLM-Tourism dataset primarily relies on data from Ctrip and Baidu Baike, covering five Northwestern Chinese provinces: Gansu, Ningxia, Qinghai, Shaanxi, and Xinjiang, containing 53,280 pairs of structured data in JSON format. The creation of this dataset will not only improve the generation accuracy of tourism large models but also contribute to the sharing and application of tourism-related datasets in the field of large models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
config.json) between commits to identify added, deleted, and updated keys, which served as input for classification.code/: Contains all Jupyter notebooks for the study.Collection/: Scripts for data extraction from Hugging Face.HFExtraction.ipynb: Collects primary model and commit information.HFReleasesExtraction.ipynb: Collects release (tag) specific information.Preprocessing/: Scripts for data cleaning, processing, and classification.HFCommitsPreprocessing.ipynb: Processes commits, computes diffs, and prepares data for classification and analysis.HFReleasesPreprocessing.ipynb: Processes and classifies release data.SwansonsClassification.ipynb: Classifies commits into corrective, perfective, and adaptive types.Analysis/: Notebooks for reproducing the analysis for each research question.HFFileChanges.ipynb: Contains the preliminary analysis of file change patterns.RQ1_Analysis.ipynb: Analysis for Research Question 1.RQ2_Analysis.ipynb: Analysis for Research Question 2.RQ3_Analysis.ipynb: Analysis for Research Question 3.datasets/: Contains the key final datasets used in the analysis notebooks.commits_datasets/: Contains the main classified commit dataset.HFCommitsClassification_final.csv: The final dataset with over 960,000 classified commits for RQ1 and RQ2.releases_datasets/: Contains the datasets related to releases.HFReleasesClassification.csv: The final dataset of 2,251 classified releases for RQ3.model_metadata.csv: The extracted internal metadata from model files for RQ3.4.metadata/: Contains configuration files and the data used for the validation process.validation_data/: A sub-folder containing the gold standard data.Agreement TOSEM Commit Changes.xlsx: Excel containing details of the classication and validation processes.prompt_refinement.txt: The final, validated prompt used for the LLM classification along its previous iterations.training_set_ground_truth.json: Gold standard for the 143-commit training set.training_set_first_classification.json: First annotator's labels for the training IRR subset.training_set_second_classification.json: Second annotator's labels for the training IRR subset.test_set_ground_truth.json: Gold standard for the 384-commit test set.test_first_classification.json: First annotator's labels for the training IRR subset.test_set_second_classification.json: Second annotator's labels for the test IRR subset.tags_metadata.yaml: Auxiliary metadata file used during preprocessing.README.md: This file.requirements.txt: Lists the required Python packages.bash</div>
<div>python -m venv venv</div>
<div>source venv/bin/activate # On Windows: venv\Scripts\activate</div>
<div>bash</div>
<div>pip install -r requirements.txt</div>
<div>code/ directory are numbered and named to be run in a logical sequence: Collection -> Preprocessing -> Analysis. We recommend following this order.code/Analysis/. They are configured to load the final, processed datasets provided in the datasets/ folder.code/Collection/, followed by code/Preprocessing/. Please note that running the full data collection and classification pipeline is time-consuming and may require significant computational resources and appropriate API keys for the LLM.datasets/commits_datasets/HFCommitsClassification_final.csv (100,000 models for RQ1; filtered to 14,343 models for RQ2).datasets/releases_datasets/HFReleasesClassification.csv (2,251 releases from 202 models).datasets/releases_datasets/model_metadata.csv (from 28 models).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset and code package supports the reproducible evaluation of Structured Large Language Model (LLM)-generated event messaging using multiple machine learning classifiers, including BERT (via TensorFlow/Keras), XGBoost, and ensemble methods. Including:
The materials accompany the study presented in [Lynch, Christopher, Erik Jensen, Ross Gore, et al. "AI-Generated Messaging for Life Events Using Structured Prompts: A Comparative Study of GPT With Human Experts and Machine Learning." TechRxiv (2025), DOI: https://doi.org/10.36227/techrxiv.174123588.85605769/v1], where Structured Narrative Prompting was applied to generate life-event messages from LLMs, followed by human annotation, and machine learning validation. This release provides complete transparency for reproducing reported metrics and facilitates further benchmarking in multilingual or domain-specific contexts.
Value of the Data: * Enables direct replication of published results across BERT, Keras-based models, XGBoost, and ensemble classifiers. * Provides clean, human-tagged datasets suitable for training, evaluation, and bias analysis. * Offers untagged datasets for new annotation or domain adaptation. * Contains full preprocessing, training, and visualization code in Python and R for flexibility across workflows. * Facilitates extension into other domains (e.g., multilingual LLM messaging validation).
Data Description: * /data/tagged/*.csv – Human-labeled datasets with schema defined in data_dictionary.csv. * /data/untagged/*.csv – Clean datasets without labels for inference or annotation. * /code/python/ – Python scripts for preprocessing, model training (BERT, Keras DNN, XGBoost), ensembling, evaluation metrics, and plotting. * /code/r/ – R scripts for exploratory data analysis, statistical testing, and replication of key figures/tables.
File Formats: * Data: CSV (UTF-8, RFC 4180) * Code: .py, .R, .Rproj
Ethics & Licensing * All data are de-identified and contain no PII. * Released under CC BY 4.0 (data) and MIT License (code).
Limitations * Labels reflect annotator interpretations and may encode bias. * Models trained on English text; generalization to other languages requires adaptation.
Funding Note * Funding sources provided time in support of human taggers annotating the data sets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository provides the data supporting our study, in which we use a large language model (LLMs) to analyze the impact of congressional witness testimony. The dataset has been curated and structured to facilitate reproducibility and encourage further research in this domain.
The repository includes the results of our study (see `Results.zip`), the fine-tuning corpus (see `Model Training Data.zip`), and the Witness and Legislative History and Impact Corpus (WLHIC), which is can be subdivided into the Witness Corpus (WC) and Legislative History and Impact Corpus (LHIC). For the LHIC and WC, we provide cleaned JSONL files containing the full datasets, individual text files of each document, and accompanying metadata (see `WLHIC data.zip`). To ensure comprehensive accessibility, we also include the original PDF versions of the documents in these corpora (see `WLHIC Raw Files.zip`).
We also provide the sentence transformer model resulting from the extended pretraining process and the model resulting from the fine tuning process. Both are accessible in `Models.zip`.
Researchers can use the provided data to replicate our findings and verify the results of our analysis. The cleaned data can also be regenerated by applying the cleaning scripts provided in the code repository to the LHIC and WC text files. While slight variations in results may occur when replicating the study from scratch due to the stochastic nature of LLM training, these differences are minimal and do not affect the substantive findings.
We encourage the use of this dataset for reproducibility studies and to inspire further exploration of LLM applications in political science research. By publishing this data, we aim to promote transparency, collaboration, and innovation in the field.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLMIntroductionLAURA is an LLM-based retrieval-augmented, context-aware framework for code review generation, which integrates context augmentation, review exemplar retrieval, and prompt tuning to enhance the performance of LLMs (in our study, ChatGPT-4o and DeepSeek v3) in generating code review comments.The experiments show that LAURA outperforms the direct application of ChatGPT-4o and DeepSeek v3 for code review generation and significantly surpasses the performance of the pre-trained model CodeReviewer.Since our experiments are based on ChatGPT-4o and DeepSeek v3, we have released the data processing code and dataset used in our research. The code section includes the Python scripts we used for data collection, cleaning, merging, and retrieval. The dataset section contains 301k entries from 1,807 high-quality projects sourced from GitHub, covering four programming languages: C, C++, Java, and Python. We also provide the time-split dataset used as the retrieval database (which is also used for fine-tuning CodeReviewer) and the human-annotated evaluation dataset.File Structurecodes: Data collection, filtering and post-processing codes used in our studydata_collection_and_filtering.py: Code for collecting data via the GitHub GraphQL API and filtering with rule-based and LLM-based methodsdata_embedding.py: Code for data embeddingdata_merging.py: Code for data merging, used to merge the review comments with the same target diffdata_retrieval.py: Code for data retrievaldiff_extension.py: Code for extending the code diffs by integrating the full code contexts into the diffsdatasets: Datasets built and used in our studydatabase_for_retrieve.csv: The dataset we built for retrieval-augmented generation, containing 298,494 entries prior to December 26, 2024evaluation_data.csv: The evaluation dataset we manually annotated, containing 384 entries later than December 26, 2024full_dataset.csv: The full dataset we collected, containing 301,256 entriesprompts: The prompts used in data filtering, generation and evaluationdirect_generation.txt: The prompt we used for direct generation as baselinesLAURA_generation.txt: The prompt we used for LAURA generationLLM_evaluation.txt: The prompt we used for LLM evaluationLLM_filtering.txt: The prompt we used for LLM filtering in data filtering processREADME.md: Description of our submission
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R scripts used for data cleaning, analysis, and visualization. (R)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitter-SFT: Nexdata assists clients in generating high-quality supervised fine-tuning data for model optimization through prompts and outputs annotation.
-Red teaming: Nexdata helps clients train and validate models through drafting various adversarial attacks, such as exploratory or potentially harmful questions. Our red team capabilities help clients identify problems in their models related to hallucinations, harmful content, false information, discrimination, language bias and etc.
-RLHF: Nexdata assist clients in manually ranking multiple outputs generated by the SFT-trained model according to the rules provided by the client, or provide multi-factor scoring. By training annotators to align with values and utilizing a multi-person fitting approach, the quality of feedback can be improved.
-Compliance: All the Large Language Model(LLM) Data is collected with proper authorization
-Quality: Multiple rounds of quality inspections ensures high quality data output
-Secure Implementation: NDA is signed to gurantee secure implementation and data is destroyed upon delivery.
-Efficency: Our platform supports human-machine interaction and semi-automatic labeling, increasing labeling efficiency by more than 30% per annotator. It has successfully been applied to nearly 5,000 projects.
3.About Nexdata Nexdata is equipped with professional data collection devices, tools and environments, as well as experienced project managers in data collection and quality control, so that we can meet the Large Language Model(LLM) Data collection requirements in various scenarios and types. We have global data processing centers and more than 20,000 professional annotators, supporting on-demand Large Language Model(LLM) Data annotation services, such as speech, image, video, point cloud and Natural Language Processing (NLP) Data, etc. Please visit us at https://www.nexdata.ai/?source=Datarade