Facebook
TwitterDataset Card for llama-2-banking-fine-tune
This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.
Dataset Summary
This dataset contains:
A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is designed for fine-tuning large language models in the medical domain. It consists of a series of conversations between users (patients) and assistants (doctors). Each conversation centers around a specific medical topic, such as gynecology, male dysfunction, erectile dysfunction, endocrinology, internal medicine, hepatology, etc.
Each conversation typically includes the following components: 1. System Prompt: Provides the doctor's specialization, e.g., "You are a doctor specializing in gynecology." 2. User Query: The patient describes symptoms or asks health-related questions. 3. Doctor's Response: The doctor offers advice and a diagnostic plan based on the user's query.
By using such dialogue datasets, language models can better understand and generate medical-related text, providing more accurate and useful advice.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045
From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
Facebook
TwitterAnak-Baik Dataset: Overview
Anak-Baik dataset is a collection of instruction-output pairs in Bahasa Indonesia, designed for Supervised Fine-Tuning (SFT) tasks. The dataset contains examples of both harmful and harmless outputs, aimed at promoting ethical AI development (hence the name; anak baik == good boy :D). The dataset consists of pairs of instructions and their corresponding outputs, categorized as either harmful or harmless and their topics. This structure enables models to… See the full description on the dataset page: https://huggingface.co/datasets/SulthanAbiyyu/anak-baik.
Facebook
TwitterExample Dataset for Surya OCR Finetuning
This dataset is an example that lays out the expected format for finetuning Surya OCR.
Data Requirements
Image column: The input images (full pages, blocks, or single text lines — mix freely). Text column: The transcription corresponding to each image. For math content, ensure or tags are wrapped around the latex
Surya OCR supports:
Various aspect ratios… See the full description on the dataset page: https://huggingface.co/datasets/datalab-to/ocr_finetune_example.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The pretraining dataset is available at this link: HIT-TMG/KaLM-embedding-pretrain-data.
Languages
English, Chinese, Multilingual
Dataset Structure
Each in datasets is in the following format:
query, string, one query per sample pos, list[string], usually containing one positive example neg, list[string], usually containing seven negative examples
Dataset Summary
All these datasets have been preprocessed and can be used for finetuning your embedding models.… See the full description on the dataset page: https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data.
Facebook
Twitter
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
library_name: transformers
This model was fine-tuned as part of an artificial intelligence course at Gazi University in Ankara using a custom dataset created by the students and instructors. The model is optimized for a specific task, such as sentiment analysis or text classification, in the Turkish language.
bert-base-turkish-cased (example)The model can be directly used for tasks such as text classification, sentiment analysis, or other natural language processing tasks in Turkish.
The model can be integrated into larger ecosystems or more complex projects.
The model should not be used for unethical or malicious purposes. Additionally, it may have limited performance for multilingual tasks.
This model may inherit biases present in the training dataset. It is designed for English, and performance may degrade for other languages or domains outside its training data.
Users are advised to be aware of the model's limitations due to its training dataset and validate its results for their specific use case.
You can use the following code snippet to load and test the model:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load the model
model_name = "gazi-university/fine-tuned-turkish-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example input
text = "This AI model works perfectly!"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This Small Hand Crafted dataset is designed to fine-tune large language models in Tamil, with a specific focus on scientific knowledge. The dataset includes a diverse range of scientific topics spanning physics, chemistry, biology, astronomy, and general science, ensuring comprehensive coverage of fundamental concepts.
Key Features:
Domain-Specific Focus: Primarily centered on scientific content to enhance the model's understanding and generation of Tamil scientific terminology and explanations.
Language Precision: Ensures accuracy in Tamil grammar, vocabulary, and context, particularly for scientific expressions and concepts.
Topic Diversity: Covers areas such as fundamental laws of physics, chemical reactions, biological processes, earth science, and astronomy.
Structured Data: Organized in a question-answer format, definitions, explanations, and contextual examples to support various fine-tuning objectives.
This data is mainly extracted from wikipedia and public textbooks.
Facebook
TwitterFine-tuning Dataset for Style Transfer
This dataset was generated for fine-tuning language models on style transfer tasks.
Dataset Details
Session ID: session_a0f4e9dd Repository: andrewmonostate/finetune-test-dataset Number of Examples: 2 Format: JSONL (JSON Lines) Generated: 2025-08-23T07:38:48.549673
Dataset Structure
Each example contains:
task: The instruction for the model input: The source text to be transformed expected_output: The target text after… See the full description on the dataset page: https://huggingface.co/datasets/andrewmonostate/finetune-test-dataset.
Facebook
TwitterE5-finetune Dataset
E5-finetune Dataset is a curated collection of query-passage pairs, encompassing a total of 870k examples. This dataset is specifically designed for fine-tuning models to extend their input length capabilities from 512 tokens to 1024 tokens. The primary focus is on accumulating long-context passages.
Dataset in English
The dataset samples long-context passage examples from various sources, ensuring a rich and diverse collection. The sources include:… See the full description on the dataset page: https://huggingface.co/datasets/ProfessorBob/E5-finetune-dataset.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian.
llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model.
wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format.
rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation.
siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation.
24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation.
The combined dataset includes a total of 1,128,228 examples, categorized as follows:
21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens.
349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions.
81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates.
66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image.
78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image.
139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025.
100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025.
100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025.
100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025.
Accessing the Corresponding Images
News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image.
Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe
Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Supervised Fine-Tuning (SFT) is a foundational technique for adapting large language models (LLMs) like GPT, LLaMA, and Claude to perform specific tasks. In SFT, a model is trained on a dataset of instruction–input–output triples, allowing it to learn how to generate helpful, relevant, and accurate responses based on human-designed prompts and inputs.
This technique is widely used for building task-specific AI agents, copilots, educational tools, and customer service bots.
This dataset contains 10,000 instruction–input–output examples spanning 10 practical domains:
Each record is structured as:
| Column | Description |
|---|---|
id | Unique identifier |
domain | Domain/topic of the task |
instruction | A prompt asking the model to perform a task |
input | Context or information needed to complete the task |
output | Target response generated for the given instruction + input |
source | Whether the entry is synthetic or human-curated |
quality_score | A rating from 1–5 reflecting the response's quality |
| Instruction | Input | Output |
|---|---|---|
| "Summarize the following article" | "Photosynthesis is the process by which plants..." | "Photosynthesis converts light into chemical energy." |
| "Fix the code below" | "def greet(name): print('Hello' name)" | "def greet(name): print('Hello', name)" |
| "Plan a 5-day trip" | "Destination: Japan. Interests: culture, tech." | "Day 1: Tokyo tour... Day 2: Kyoto temples..." |
instruction + input → outputtransformers and PEFTquality_scoreReleased under the MIT License. You may use, modify, and share with attribution.
Created by Zeeshan-ul-hassan Usmani to support open learning, LLM research, and educational outreach. Inspired by initiatives like Self-Instruct, OpenAssistant, and Hugging Face open datasets.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ZigNet Training Dataset
Curated dataset of Zig programming examples for LLM fine-tuning This dataset was created for the ZigNet project to train language models on Zig programming language patterns, idioms, and documentation.
Dataset Structure
Files
data/training/ ├── dataset-train.jsonl # 9,629 examples (70%) ├── dataset-validation.jsonl # 2,063 examples (15%) ├── dataset-test.jsonl # 2,064 examples (15%) └── dataset-stats.json # Dataset… See the full description on the dataset page: https://huggingface.co/datasets/fulgidus/zignet-training-dataset.
Facebook
TwitterThis dataset is used in an experimental preference fine-tuning of Qwen2-1.5B model for summarization task The goal is to re-implement Apple work on training specific LoRA's on top of small LM to perform specific tasks, for example summarization. More info on the project: https://github.com/thepowerfuldeez/qwen2_1_5b_summarize
Method
Dataset generated using samples from RedPajamaV2 dataset, specifically Arxiv, Wikipedia, StackExchange documents. I have downloaded 1% of data and… See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train.
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Single-cell research faces challenges in accurately annotating cell types at high resolution, especially when dealing with large-scale datasets and rare cell populations. To address this, foundation models like scGPT offer flexible, scalable solutions by leveraging transformer-based architectures. This protocol provides a comprehensive guide to fine-tuning scGPT for cell-type classification in single-cell RNA sequencing (scRNA-seq) data. We demonstrate how to fine-tune scGPT on a custom retina dataset, highlighting the model’s efficiency in handling complex data and improving annotation accuracy achieving 99.5% F1-score. This protocol automates key steps, including data preprocessing, model fine-tuning, and evaluation. This protocol enables researchers to efficiently deploy scGPT for their own datasets. The provided tools, including a command-line script and Jupyter Notebook, simplify the customization and exploration of the model, proposing an accessible workflow for users with minimal Python and Linux knowledge. The protocol offers an off-the-shell solution of high-precision cell-type annotation using scGPT for researchers with intermediate bioinformatics.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
FoodExtract-1k
Dataset designed for fine-tuning a small LLM (e.g. gemma-3-270m) to extract structured data from text in a way which replicates a much larger LLM (e.g. gpt-oss-120b). Purpose it to enable a fine-tuned small LLM to filter a large text dataset for food and drink-like items. For example, take DataComp1B dataset and use the fine-tuned LLM to filter for food and drink related items.
Example sample
{'sequence': 'A mouth-watering photograph captures a delectable… See the full description on the dataset page: https://huggingface.co/datasets/mrdbourke/FoodExtract-1k.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Data Centric Machine Learning Domain SFT dataset
The Data Centric Machine Learning Domain SFT dataset is an example of how to use distilabel to create a domain-specific fine-tuning dataset easily. In particular using the Domain Specific Dataset Project Space. The dataset focuses on the domain of data-centric machine learning and consists of chat conversations between a user and an AI assistant. Its purpose is to demonstrate the process of creating domain-specific… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/data-centric-ml-sft.
Facebook
TwitterOverview
This dataset is a Supervised Fine-Tuning (SFT) dataset generated from a subset of the Geometry3K dataset using Qwen2.5-VL. It serves as an example dataset for demonstrating VLM (Vision-Language Model) SFT training in the Trinity-RFT library.
Facebook
TwitterDataset Card for llama-2-banking-fine-tune
This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.
Dataset Summary
This dataset contains:
A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.