100+ datasets found

h
llama-2-banking-fine-tune
huggingface.co
Updated Jul 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2023). llama-2-banking-fine-tune [Dataset]. https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2023
Dataset authored and provided by
Argilla
Description
Dataset Card for llama-2-banking-fine-tune

This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.

Dataset Summary

This dataset contains:

A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.
Medical large language model fine-tuning dataset
kaggle.com
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krens (2024). Medical large language model fine-tuning dataset [Dataset]. https://www.kaggle.com/datasets/jickymen/medical-large-language-model-fine-tuning
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 28, 2024
Dataset provided by
Kaggle
Authors
Krens
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description

This dataset is designed for fine-tuning large language models in the medical domain. It consists of a series of conversations between users (patients) and assistants (doctors). Each conversation centers around a specific medical topic, such as gynecology, male dysfunction, erectile dysfunction, endocrinology, internal medicine, hepatology, etc.

Dataset Background

Source and Inspiration:Although the real doctor-patient communication data collected from the Internet and hospitals conforms to the doctor's style, it is too noisy and difficult to clean. The data obtained through large model distillation is easy to understand, but may cause ‘model collapse’.The dataset comes from the consultation and communication between patients and doctors in the real world and the data generated from the dialogue with LLM. By mixing the two in a certain proportion and cleaning them, the fine-tuning effect can be better.

Data Type: The dataset includes dialogue data where users present health issues and doctors provide advice, covering multiple medical specialties.

Dataset Examples

Each conversation typically includes the following components: 1. System Prompt: Provides the doctor's specialization, e.g., "You are a doctor specializing in gynecology." 2. User Query: The patient describes symptoms or asks health-related questions. 3. Doctor's Response: The doctor offers advice and a diagnostic plan based on the user's query.

By using such dialogue datasets, language models can better understand and generate medical-related text, providing more accurate and useful advice.
Chain-of-Thought collection
kaggle.com
zip
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
Explore at:
zip(1260225915 bytes)Available download formats
Unique identifier
https://identifiers.org/arxiv:2305.140
Dataset updated
Jun 19, 2023
Authors
Konrad Banachewicz
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
h
anak-baik
huggingface.co
Updated Aug 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sulthan Abiyyu Hakim (2024). anak-baik [Dataset]. https://huggingface.co/datasets/SulthanAbiyyu/anak-baik
Explore at:
Dataset updated
Aug 16, 2024
Authors
Sulthan Abiyyu Hakim
Description
Anak-Baik Dataset: Overview

Anak-Baik dataset is a collection of instruction-output pairs in Bahasa Indonesia, designed for Supervised Fine-Tuning (SFT) tasks. The dataset contains examples of both harmful and harmless outputs, aimed at promoting ethical AI development (hence the name; anak baik == good boy :D). The dataset consists of pairs of instructions and their corresponding outputs, categorized as either harmful or harmless and their topics. This structure enables models to… See the full description on the dataset page: https://huggingface.co/datasets/SulthanAbiyyu/anak-baik.
h
ocr_finetune_example
huggingface.co
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datalab (2025). ocr_finetune_example [Dataset]. https://huggingface.co/datasets/datalab-to/ocr_finetune_example
Explore at:
Dataset updated
Aug 11, 2025
Dataset authored and provided by
Datalab
Description
Example Dataset for Surya OCR Finetuning

This dataset is an example that lays out the expected format for finetuning Surya OCR.

Data Requirements

Image column: The input images (full pages, blocks, or single text lines — mix freely). Text column: The transcription corresponding to each image. For math content, ensure or tags are wrapped around the latex

Surya OCR supports:

Various aspect ratios… See the full description on the dataset page: https://huggingface.co/datasets/datalab-to/ocr_finetune_example.
h
KaLM-embedding-finetuning-data
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KaLM-Embedding (2025). KaLM-embedding-finetuning-data [Dataset]. https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data
Explore at:
Dataset updated
Oct 8, 2025
Dataset authored and provided by
KaLM-Embedding
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The pretraining dataset is available at this link: HIT-TMG/KaLM-embedding-pretrain-data.

Languages

English, Chinese, Multilingual

Dataset Structure

Each in datasets is in the following format:

query, string, one query per sample pos, list[string], usually containing one positive example neg, list[string], usually containing seven negative examples

Dataset Summary

All these datasets have been preprocessed and can be used for finetuning your embedding models.… See the full description on the dataset page: https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data.
LlamaIndex tutorial resources
kaggle.com
zip
Updated Dec 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthias (2023). LlamaIndex tutorial resources [Dataset]. https://www.kaggle.com/datasets/hiarsl/10k-forms
Explore at:
zip(128969069 bytes)Available download formats
Dataset updated
Dec 9, 2023
Authors
Matthias
Description
The dataset contains input data (e.g., conference articles) that can be used when playing around with fine-tuning of embeddings for RAG applications with LlamaIndex (other use cases are possible as well of course). The dataset furthermore contains synthetic queries created using the input data and (fine-tuned) embedding models trained using the synthetic queries.

The form 10-K files in this dataset are used in tutorials from LlamaIndex (e.g., Fine-tuning an Adapter, Embedding fine-tuning)

Data is used in this public notebook: https://www.kaggle.com/code/hiarsl/fine-tuning-embeddings-with-llamaindex
Grammar Correction Dataset for Fine-Tuning
kaggle.com
zip
Updated Jan 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nezahat Korkmaz (2025). Grammar Correction Dataset for Fine-Tuning [Dataset]. https://www.kaggle.com/datasets/nezahatkk/grammar-correction-dataset-for-fine-tuning
Explore at:
zip(180270 bytes)Available download formats
Dataset updated
Jan 28, 2025
Authors
Nezahat Korkmaz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
library_name: transformers

tags: [fine-tuning, custom-dataset, educational-use, NLP, transformers]

Model Card for Fine-Tuned Transformers Model

Model Details

Model Description

This model was fine-tuned as part of an artificial intelligence course at Gazi University in Ankara using a custom dataset created by the students and instructors. The model is optimized for a specific task, such as sentiment analysis or text classification, in the Turkish language.

Developed by: Gazi University AI Course Team

Funded by [optional]: Gazi University

Shared by [optional]: Faculty Members and Students

Model type: Transformers-based language model (e.g., BERT or GPT)

Language(s) (NLP): Turkish

License: [CC BY-SA 4.0 or other appropriate license]

Finetuned from model [optional]: bert-base-turkish-cased (example)

Model Sources [optional]

Dataset that we used for fine-tuning: [https://www.kaggle.com/datasets/nezahatkk/grammar-correction-dataset-for-fine-tuning]

Uses

Direct Use

The model can be directly used for tasks such as text classification, sentiment analysis, or other natural language processing tasks in Turkish.

Downstream Use [optional]

The model can be integrated into larger ecosystems or more complex projects.

Out-of-Scope Use

The model should not be used for unethical or malicious purposes. Additionally, it may have limited performance for multilingual tasks.

Bias, Risks, and Limitations

This model may inherit biases present in the training dataset. It is designed for English, and performance may degrade for other languages or domains outside its training data.

Recommendations

Users are advised to be aware of the model's limitations due to its training dataset and validate its results for their specific use case.

How to Get Started with the Model

You can use the following code snippet to load and test the model:

from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load the model model_name = "gazi-university/fine-tuned-turkish-model" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example input text = "This AI model works perfectly!" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs)
Tamil Fine Tuning Dataset for Science
kaggle.com
zip
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohammed Saajid (2025). Tamil Fine Tuning Dataset for Science [Dataset]. https://www.kaggle.com/datasets/mohammedsaajid/tamil-fine-tuning-dataset-for-science
Explore at:
zip(53420 bytes)Available download formats
Dataset updated
Jan 3, 2025
Authors
Mohammed Saajid
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This Small Hand Crafted dataset is designed to fine-tune large language models in Tamil, with a specific focus on scientific knowledge. The dataset includes a diverse range of scientific topics spanning physics, chemistry, biology, astronomy, and general science, ensuring comprehensive coverage of fundamental concepts.

Key Features:

Domain-Specific Focus: Primarily centered on scientific content to enhance the model's understanding and generation of Tamil scientific terminology and explanations.

Language Precision: Ensures accuracy in Tamil grammar, vocabulary, and context, particularly for scientific expressions and concepts.

Topic Diversity: Covers areas such as fundamental laws of physics, chemical reactions, biological processes, earth science, and astronomy.

Structured Data: Organized in a question-answer format, definitions, explanations, and contextual examples to support various fine-tuning objectives.

This data is mainly extracted from wikipedia and public textbooks.
h
finetune-test-dataset
huggingface.co
Updated Aug 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
andrew correa (2025). finetune-test-dataset [Dataset]. https://huggingface.co/datasets/andrewmonostate/finetune-test-dataset
Explore at:
Dataset updated
Aug 23, 2025
Authors
andrew correa
Description
Fine-tuning Dataset for Style Transfer

This dataset was generated for fine-tuning language models on style transfer tasks.

Dataset Details

Session ID: session_a0f4e9dd Repository: andrewmonostate/finetune-test-dataset Number of Examples: 2 Format: JSONL (JSON Lines) Generated: 2025-08-23T07:38:48.549673

Dataset Structure

Each example contains:

task: The instruction for the model input: The source text to be transformed expected_output: The target text after… See the full description on the dataset page: https://huggingface.co/datasets/andrewmonostate/finetune-test-dataset.
h
E5-finetune-dataset
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ProfessorBob (2024). E5-finetune-dataset [Dataset]. https://huggingface.co/datasets/ProfessorBob/E5-finetune-dataset
Explore at:
Dataset updated
Feb 7, 2024
Dataset authored and provided by
ProfessorBob
Description
E5-finetune Dataset

E5-finetune Dataset is a curated collection of query-passage pairs, encompassing a total of 870k examples. This dataset is specifically designed for fine-tuning models to extend their input length capabilities from 512 tokens to 1024 tokens. The primary focus is on accumulating long-context passages.

Dataset in English

The dataset samples long-context passage examples from various sources, ensuring a rich and diverse collection. The sources include:… See the full description on the dataset page: https://huggingface.co/datasets/ProfessorBob/E5-finetune-dataset.
E
Data from: Slovenian Dataset for Vision-Language Model Instruction-Tuning...
live.european-language-grid.eu
binary format
Updated Sep 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23887
Explore at:
binary formatAvailable download formats
Dataset updated
Sep 17, 2025
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian.

llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model.

wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format.

rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

The combined dataset includes a total of 1,128,228 examples, categorized as follows:

21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens.

349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions.

81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates.

66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image.

78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image.

139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025.

100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025.

100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025.

100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025.

Accessing the Corresponding Images

News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image.

Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe

Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py.

Contextual Input SFT Dataset

kaggle.com

zip

Updated May 29, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Zeeshan-ul-hassan Usmani (2025). Contextual Input SFT Dataset [Dataset]. https://www.kaggle.com/datasets/zusmani/contextual-input-sft-dataset

Explore at:

zip(499476 bytes)Available download formats

Dataset updated

May 29, 2025

Authors

Zeeshan-ul-hassan Usmani

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

📦 Instruction-Tuned Dataset with Contextual Inputs (10,000 Examples for SFT)

🧠 What is Supervised Fine-Tuning (SFT)?

Supervised Fine-Tuning (SFT) is a foundational technique for adapting large language models (LLMs) like GPT, LLaMA, and Claude to perform specific tasks. In SFT, a model is trained on a dataset of instruction–input–output triples, allowing it to learn how to generate helpful, relevant, and accurate responses based on human-designed prompts and inputs.

This technique is widely used for building task-specific AI agents, copilots, educational tools, and customer service bots.

📚 About This Dataset

This dataset contains 10,000 instruction–input–output examples spanning 10 practical domains:

Healthcare
Code
Finance
Education
Law
Productivity
Marketing
Psychology
Sports
Travel

Each record is structured as:

Column	Description
`id`	Unique identifier
`domain`	Domain/topic of the task
`instruction`	A prompt asking the model to perform a task
`input`	Context or information needed to complete the task
`output`	Target response generated for the given instruction + input
`source`	Whether the entry is synthetic or human-curated
`quality_score`	A rating from 1–5 reflecting the response's quality

💡 Example Entry

Instruction	Input	Output
"Summarize the following article"	"Photosynthesis is the process by which plants..."	"Photosynthesis converts light into chemical energy."
"Fix the code below"	"def greet(name): print('Hello' name)"	"def greet(name): print('Hello', name)"
"Plan a 5-day trip"	"Destination: Japan. Interests: culture, tech."	"Day 1: Tokyo tour... Day 2: Kyoto temples..."

🧪 What Can You Do With This Dataset?

🧑‍🎓 Beginners

Train a small transformer model using instruction + input → output
Experiment with prompt engineering and token analysis
Evaluate models on diverse domains and tasks

🧑‍💻 Practitioners

Fine-tune LLaMA, Mistral, GPT-J, or Falcon on instruction tasks
Perform domain-based SFT (e.g., only legal or medical examples)
Use quality scores to train a filtering mechanism or reward model

🧠 Researchers

Investigate performance variance across domains
Run evaluation benchmarks (BLEU, ROUGE, METEOR, GPT-4 eval)
Study model alignment and generalization with diverse instructions

🎯 Suggested Projects

Fine-tune models using transformers and PEFT
Build a quality prediction model using the quality_score
Visualize attention distribution over instruction vs. input
Compare SFT vs. zero-shot/few-shot prompting using the same tasks

🛠 Tools That Work Well

Hugging Face Transformers and Datasets
PEFT for parameter-efficient tuning
LoRA, QLoRA, or 8-bit training on Colab or local GPU
LangChain for interactive API wrappers
Weights & Biases for experiment tracking

🔖 License

Released under the MIT License. You may use, modify, and share with attribution.

🙌 Acknowledgments

Created by Zeeshan-ul-hassan Usmani to support open learning, LLM research, and educational outreach. Inspired by initiatives like Self-Instruct, OpenAssistant, and Hugging Face open datasets.

h
zignet-training-dataset
huggingface.co
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alessio Giulio Corsi (2025). zignet-training-dataset [Dataset]. https://huggingface.co/datasets/fulgidus/zignet-training-dataset
Explore at:
Dataset updated
Oct 26, 2025
Authors
Alessio Giulio Corsi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ZigNet Training Dataset

Curated dataset of Zig programming examples for LLM fine-tuning This dataset was created for the ZigNet project to train language models on Zig programming language patterns, idioms, and documentation.

Dataset Structure Files

data/training/ ├── dataset-train.jsonl # 9,629 examples (70%) ├── dataset-validation.jsonl # 2,063 examples (15%) ├── dataset-test.jsonl # 2,064 examples (15%) └── dataset-stats.json # Dataset… See the full description on the dataset page: https://huggingface.co/datasets/fulgidus/zignet-training-dataset.
h
Qwen-summarize-dataset-train
huggingface.co
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Grigorev (2024). Qwen-summarize-dataset-train [Dataset]. https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 4, 2024
Authors
George Grigorev
Description
This dataset is used in an experimental preference fine-tuning of Qwen2-1.5B model for summarization task The goal is to re-implement Apple work on training specific LoRA's on top of small LM to perform specific tasks, for example summarization. More info on the project: https://github.com/thepowerfuldeez/qwen2_1_5b_summarize

Method

Dataset generated using samples from RedPajamaV2 dataset, specifically Arxiv, Wikipedia, StackExchange documents. I have downloaded 1% of data and… See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train.
p
Data from: RadCoref: Fine-tuning coreference resolution for different styles...
physionet.org
Updated Jan 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuxiang Liao; Hantao Liu; Irena Spasic (2024). RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives [Dataset]. http://doi.org/10.13026/z67q-xy65
Explore at:
Unique identifier
https://doi.org/10.13026/z67q-xy65
Dataset updated
Jan 30, 2024
Authors
Yuxiang Liao; Hantao Liu; Irena Spasic
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.
scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation
zenodo.org
data-staging.niaid.nih.gov
+1more
bin, zip
Updated Jan 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li (2025). scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation [Dataset]. http://doi.org/10.5281/zenodo.14648190
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14648190
Dataset updated
Jan 14, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 14, 2025
Description
Abstract

Single-cell research faces challenges in accurately annotating cell types at high resolution, especially when dealing with large-scale datasets and rare cell populations. To address this, foundation models like scGPT offer flexible, scalable solutions by leveraging transformer-based architectures. This protocol provides a comprehensive guide to fine-tuning scGPT for cell-type classification in single-cell RNA sequencing (scRNA-seq) data. We demonstrate how to fine-tune scGPT on a custom retina dataset, highlighting the model’s efficiency in handling complex data and improving annotation accuracy achieving 99.5% F1-score. This protocol automates key steps, including data preprocessing, model fine-tuning, and evaluation. This protocol enables researchers to efficiently deploy scGPT for their own datasets. The provided tools, including a command-line script and Jupyter Notebook, simplify the customization and exploration of the model, proposing an accessible workflow for users with minimal Python and Linux knowledge. The protocol offers an off-the-shell solution of high-precision cell-type annotation using scGPT for researchers with intermediate bioinformatics.
h
FoodExtract-1k
huggingface.co
Updated Mar 3, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Bourke (2017). FoodExtract-1k [Dataset]. https://huggingface.co/datasets/mrdbourke/FoodExtract-1k
Explore at:
Dataset updated
Mar 3, 2017
Authors
Daniel Bourke
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
FoodExtract-1k

Dataset designed for fine-tuning a small LLM (e.g. gemma-3-270m) to extract structured data from text in a way which replicates a much larger LLM (e.g. gpt-oss-120b). Purpose it to enable a fine-tuned small LLM to filter a large text dataset for food and drink-like items. For example, take DataComp1B dataset and use the fine-tuned LLM to filter for food and drink related items.

Example sample

{'sequence': 'A mouth-watering photograph captures a delectable… See the full description on the dataset page: https://huggingface.co/datasets/mrdbourke/FoodExtract-1k.
h
data-centric-ml-sft
huggingface.co
Updated May 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien (2024). data-centric-ml-sft [Dataset]. https://huggingface.co/datasets/davanstrien/data-centric-ml-sft
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 1, 2024
Authors
Daniel van Strien
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Data Centric Machine Learning Domain SFT dataset

The Data Centric Machine Learning Domain SFT dataset is an example of how to use distilabel to create a domain-specific fine-tuning dataset easily. In particular using the Domain Specific Dataset Project Space. The dataset focuses on the domain of data-centric machine learning and consists of chat conversations between a user and an AI assistant. Its purpose is to demonstrate the process of creating domain-specific… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/data-centric-ml-sft.
h
geometry_sft
huggingface.co
Updated Oct 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2025). geometry_sft [Dataset]. https://huggingface.co/datasets/datajuicer/geometry_sft
Explore at:
Dataset updated
Oct 27, 2025
Dataset authored and provided by
Data-Juicer
Description
Overview

This dataset is a Supervised Fine-Tuning (SFT) dataset generated from a subset of the Geometry3K dataset using Qwen2.5-VL. It serves as an example dataset for demonstrating VLM (Vision-Language Model) SFT training in the Trinity-RFT library.

Facebook

Twitter

Click to copy link

Link copied

Cite

Argilla (2023). llama-2-banking-fine-tune [Dataset]. https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune

llama-2-banking-fine-tune

argilla/llama-2-banking-fine-tune

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 28, 2023

Dataset authored and provided by

Argilla

Description

Dataset Card for llama-2-banking-fine-tune

This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.

  Dataset Summary

This dataset contains:

A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.

Clear search

Close search

Google apps

Main menu

llama-2-banking-fine-tune

Medical large language model fine-tuning dataset

Dataset Description

Dataset Background

Dataset Examples

Chain-of-Thought collection

anak-baik

ocr_finetune_example

KaLM-embedding-finetuning-data

LlamaIndex tutorial resources

Grammar Correction Dataset for Fine-Tuning

tags: [fine-tuning, custom-dataset, educational-use, NLP, transformers]

Model Card for Fine-Tuned Transformers Model

Model Details

Model Description

Model Sources [optional]

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Tamil Fine Tuning Dataset for Science

finetune-test-dataset

E5-finetune-dataset

Data from: Slovenian Dataset for Vision-Language Model Instruction-Tuning...

Contextual Input SFT Dataset

📦 Instruction-Tuned Dataset with Contextual Inputs (10,000 Examples for SFT)

🧠 What is Supervised Fine-Tuning (SFT)?

📚 About This Dataset

💡 Example Entry

🧪 What Can You Do With This Dataset?

🧑‍🎓 Beginners

🧑‍💻 Practitioners

🧠 Researchers

🎯 Suggested Projects

🛠 Tools That Work Well

🔖 License

🙌 Acknowledgments

zignet-training-dataset

Qwen-summarize-dataset-train

Data from: RadCoref: Fine-tuning coreference resolution for different styles...

scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation

Abstract

FoodExtract-1k

data-centric-ml-sft

geometry_sft

llama-2-banking-fine-tuneSee More Versions

argilla/llama-2-banking-fine-tune

llama-2-banking-fine-tune