100+ datasets found
  1. h

    llama-2-banking-fine-tune

    • huggingface.co
    Updated Jul 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla (2023). llama-2-banking-fine-tune [Dataset]. https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2023
    Dataset authored and provided by
    Argilla
    Description

    Dataset Card for llama-2-banking-fine-tune

    This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.

      Dataset Summary
    

    This dataset contains:

    A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.

  2. Medical large language model fine-tuning dataset

    • kaggle.com
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krens (2024). Medical large language model fine-tuning dataset [Dataset]. https://www.kaggle.com/datasets/jickymen/medical-large-language-model-fine-tuning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2024
    Dataset provided by
    Kaggle
    Authors
    Krens
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description

    This dataset is designed for fine-tuning large language models in the medical domain. It consists of a series of conversations between users (patients) and assistants (doctors). Each conversation centers around a specific medical topic, such as gynecology, male dysfunction, erectile dysfunction, endocrinology, internal medicine, hepatology, etc.

    Dataset Background

    • Source and Inspiration:Although the real doctor-patient communication data collected from the Internet and hospitals conforms to the doctor's style, it is too noisy and difficult to clean. The data obtained through large model distillation is easy to understand, but may cause ‘model collapse’.The dataset comes from the consultation and communication between patients and doctors in the real world and the data generated from the dialogue with LLM. By mixing the two in a certain proportion and cleaning them, the fine-tuning effect can be better.
    • Data Type: The dataset includes dialogue data where users present health issues and doctors provide advice, covering multiple medical specialties.

    Dataset Examples

    Each conversation typically includes the following components: 1. System Prompt: Provides the doctor's specialization, e.g., "You are a doctor specializing in gynecology." 2. User Query: The patient describes symptoms or asks health-related questions. 3. Doctor's Response: The doctor offers advice and a diagnostic plan based on the user's query.

    By using such dialogue datasets, language models can better understand and generate medical-related text, providing more accurate and useful advice.

  3. Chain-of-Thought collection

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konrad Banachewicz (2023). Chain-of-Thought collection [Dataset]. http://identifiers.org/arxiv:2305.140
    Explore at:
    zip(1260225915 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    Konrad Banachewicz
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045

    From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.

  4. h

    anak-baik

    • huggingface.co
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sulthan Abiyyu Hakim (2024). anak-baik [Dataset]. https://huggingface.co/datasets/SulthanAbiyyu/anak-baik
    Explore at:
    Dataset updated
    Aug 16, 2024
    Authors
    Sulthan Abiyyu Hakim
    Description

    Anak-Baik Dataset: Overview

    Anak-Baik dataset is a collection of instruction-output pairs in Bahasa Indonesia, designed for Supervised Fine-Tuning (SFT) tasks. The dataset contains examples of both harmful and harmless outputs, aimed at promoting ethical AI development (hence the name; anak baik == good boy :D). The dataset consists of pairs of instructions and their corresponding outputs, categorized as either harmful or harmless and their topics. This structure enables models to… See the full description on the dataset page: https://huggingface.co/datasets/SulthanAbiyyu/anak-baik.

  5. h

    ocr_finetune_example

    • huggingface.co
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datalab (2025). ocr_finetune_example [Dataset]. https://huggingface.co/datasets/datalab-to/ocr_finetune_example
    Explore at:
    Dataset updated
    Aug 11, 2025
    Dataset authored and provided by
    Datalab
    Description

    Example Dataset for Surya OCR Finetuning

    This dataset is an example that lays out the expected format for finetuning Surya OCR.

      Data Requirements
    

    Image column: The input images (full pages, blocks, or single text lines — mix freely). Text column: The transcription corresponding to each image. For math content, ensure or tags are wrapped around the latex

      Surya OCR supports:
    

    Various aspect ratios… See the full description on the dataset page: https://huggingface.co/datasets/datalab-to/ocr_finetune_example.

  6. h

    KaLM-embedding-finetuning-data

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KaLM-Embedding (2025). KaLM-embedding-finetuning-data [Dataset]. https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data
    Explore at:
    Dataset updated
    Oct 8, 2025
    Dataset authored and provided by
    KaLM-Embedding
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The pretraining dataset is available at this link: HIT-TMG/KaLM-embedding-pretrain-data.

      Languages
    

    English, Chinese, Multilingual

      Dataset Structure
    

    Each in datasets is in the following format:

    query, string, one query per sample pos, list[string], usually containing one positive example neg, list[string], usually containing seven negative examples

      Dataset Summary
    

    All these datasets have been preprocessed and can be used for finetuning your embedding models.… See the full description on the dataset page: https://huggingface.co/datasets/KaLM-Embedding/KaLM-embedding-finetuning-data.

  7. LlamaIndex tutorial resources

    • kaggle.com
    zip
    Updated Dec 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthias (2023). LlamaIndex tutorial resources [Dataset]. https://www.kaggle.com/datasets/hiarsl/10k-forms
    Explore at:
    zip(128969069 bytes)Available download formats
    Dataset updated
    Dec 9, 2023
    Authors
    Matthias
    Description
    • The dataset contains input data (e.g., conference articles) that can be used when playing around with fine-tuning of embeddings for RAG applications with LlamaIndex (other use cases are possible as well of course). The dataset furthermore contains synthetic queries created using the input data and (fine-tuned) embedding models trained using the synthetic queries.
    • The form 10-K files in this dataset are used in tutorials from LlamaIndex (e.g., Fine-tuning an Adapter, Embedding fine-tuning)
    • Data is used in this public notebook: https://www.kaggle.com/code/hiarsl/fine-tuning-embeddings-with-llamaindex
  8. Grammar Correction Dataset for Fine-Tuning

    • kaggle.com
    zip
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nezahat Korkmaz (2025). Grammar Correction Dataset for Fine-Tuning [Dataset]. https://www.kaggle.com/datasets/nezahatkk/grammar-correction-dataset-for-fine-tuning
    Explore at:
    zip(180270 bytes)Available download formats
    Dataset updated
    Jan 28, 2025
    Authors
    Nezahat Korkmaz
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    library_name: transformers

    tags: [fine-tuning, custom-dataset, educational-use, NLP, transformers]

    Model Card for Fine-Tuned Transformers Model

    Model Details

    Model Description

    This model was fine-tuned as part of an artificial intelligence course at Gazi University in Ankara using a custom dataset created by the students and instructors. The model is optimized for a specific task, such as sentiment analysis or text classification, in the Turkish language.

    • Developed by: Gazi University AI Course Team
    • Funded by [optional]: Gazi University
    • Shared by [optional]: Faculty Members and Students
    • Model type: Transformers-based language model (e.g., BERT or GPT)
    • Language(s) (NLP): Turkish
    • License: [CC BY-SA 4.0 or other appropriate license]
    • Finetuned from model [optional]: bert-base-turkish-cased (example)

    Model Sources [optional]

    Uses

    Direct Use

    The model can be directly used for tasks such as text classification, sentiment analysis, or other natural language processing tasks in Turkish.

    Downstream Use [optional]

    The model can be integrated into larger ecosystems or more complex projects.

    Out-of-Scope Use

    The model should not be used for unethical or malicious purposes. Additionally, it may have limited performance for multilingual tasks.

    Bias, Risks, and Limitations

    This model may inherit biases present in the training dataset. It is designed for English, and performance may degrade for other languages or domains outside its training data.

    Recommendations

    Users are advised to be aware of the model's limitations due to its training dataset and validate its results for their specific use case.

    How to Get Started with the Model

    You can use the following code snippet to load and test the model:

    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    
    # Load the model
    model_name = "gazi-university/fine-tuned-turkish-model"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    
    # Example input
    text = "This AI model works perfectly!"
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    
  9. Tamil Fine Tuning Dataset for Science

    • kaggle.com
    zip
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohammed Saajid (2025). Tamil Fine Tuning Dataset for Science [Dataset]. https://www.kaggle.com/datasets/mohammedsaajid/tamil-fine-tuning-dataset-for-science
    Explore at:
    zip(53420 bytes)Available download formats
    Dataset updated
    Jan 3, 2025
    Authors
    Mohammed Saajid
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This Small Hand Crafted dataset is designed to fine-tune large language models in Tamil, with a specific focus on scientific knowledge. The dataset includes a diverse range of scientific topics spanning physics, chemistry, biology, astronomy, and general science, ensuring comprehensive coverage of fundamental concepts.

    Key Features:

    Domain-Specific Focus: Primarily centered on scientific content to enhance the model's understanding and generation of Tamil scientific terminology and explanations.

    Language Precision: Ensures accuracy in Tamil grammar, vocabulary, and context, particularly for scientific expressions and concepts.

    Topic Diversity: Covers areas such as fundamental laws of physics, chemical reactions, biological processes, earth science, and astronomy.

    Structured Data: Organized in a question-answer format, definitions, explanations, and contextual examples to support various fine-tuning objectives.

    This data is mainly extracted from wikipedia and public textbooks.

  10. h

    finetune-test-dataset

    • huggingface.co
    Updated Aug 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    andrew correa (2025). finetune-test-dataset [Dataset]. https://huggingface.co/datasets/andrewmonostate/finetune-test-dataset
    Explore at:
    Dataset updated
    Aug 23, 2025
    Authors
    andrew correa
    Description

    Fine-tuning Dataset for Style Transfer

    This dataset was generated for fine-tuning language models on style transfer tasks.

      Dataset Details
    

    Session ID: session_a0f4e9dd Repository: andrewmonostate/finetune-test-dataset Number of Examples: 2 Format: JSONL (JSON Lines) Generated: 2025-08-23T07:38:48.549673

      Dataset Structure
    

    Each example contains:

    task: The instruction for the model input: The source text to be transformed expected_output: The target text after… See the full description on the dataset page: https://huggingface.co/datasets/andrewmonostate/finetune-test-dataset.

  11. h

    E5-finetune-dataset

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ProfessorBob (2024). E5-finetune-dataset [Dataset]. https://huggingface.co/datasets/ProfessorBob/E5-finetune-dataset
    Explore at:
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    ProfessorBob
    Description

    E5-finetune Dataset

    E5-finetune Dataset is a curated collection of query-passage pairs, encompassing a total of 870k examples. This dataset is specifically designed for fine-tuning models to extend their input length capabilities from 512 tokens to 1024 tokens. The primary focus is on accumulating long-context passages.

      Dataset in English
    

    The dataset samples long-context passage examples from various sources, ensuring a rich and diverse collection. The sources include:… See the full description on the dataset page: https://huggingface.co/datasets/ProfessorBob/E5-finetune-dataset.

  12. E

    Data from: Slovenian Dataset for Vision-Language Model Instruction-Tuning...

    • live.european-language-grid.eu
    binary format
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Slovenian Dataset for Vision-Language Model Instruction-Tuning SLO-VLM-IT-Dataset 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23887
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Sep 17, 2025
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    This entry contains the SLO-VLM-IT-Dataset, a comprehensive dataset designed for instruction-tuning vision-language models in the Slovenian language. It is composed of five main .json files, which together provide a rich and diverse set of examples for training and fine-tuning models to understand and process both visual and textual information in Slovenian.

    1. llava_v1_5_mix665k_translated_gemini_1_5_pro_all.json This file contains a machine-translated version of the popular Llava_v1_5_mix665k dataset. The translation from English to Slovenian was performed using the proprietary Gemini 1.5 Pro model.

    2. wiki_14_march_2024_latest.json This file consists of conversational examples generated from Slovenian Wikipedia articles. The proprietary Gemini 1.5 Pro model was utilized for the data curation process, transforming the articles into an instruction-tuning format.

    3. rtv.json This file consists of conversational examples generated on the basis of images from the news portal https://www.rtvslo.si. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

    4. siol.json This file consists of conversational examples generated on the basis of images from the news portal https://siol.net. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

    5. 24ur.json This file consists of conversational examples generated on the basis of images from the news portal https://www.24ur.com. The proprietary Gemini 1.5 Pro model was utilized for the data generation.

    The combined dataset includes a total of 1,128,228 examples, categorized as follows:

    21,838 textvqa examples: Instructions for vision question answering based on specific Optical Character Recognition (OCR) tokens.

    349,369 coco examples: A mix of instructions corresponding to 118,000 images from the COCO 2017 Object Detection Dataset. These include tasks such as generating long image descriptions, providing single-word answers, and answering multiple-choice questions.

    81,309 vg examples: Instructions to either provide bounding box coordinates for a specified region in an image or describe a region defined by given coordinates.

    66,227 gqa examples: Instructions requiring a one-word or one-phrase response to a question about the corresponding image.

    78,976 ocr_vqa examples: Instructions focused on performing OCR to extract text from an image.

    139,433 wiki examples: Instruction-tuning examples generated from Slovenian Wikipedia articles. The original Wikipedia articles were obtained from a Wikipedia database dump from March 14th 2025.

    100,000 rtv examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.rtvslo.si. Image scraping was completed on February 7th 2025.

    100,000 siol examples: Instruction-tuning examples generated on the basis of images from the news portal https://siol.net. Image scraping was completed on March 22nd 2025.

    100,000 24ur examples: Instruction-tuning examples generated on the basis of images from the news portal https://www.24ur.com. Image scraping was completed on February 7th 2025.

    Accessing the Corresponding Images

    News portal Images The images corresponding to the 'rtv', 'siol' and '24ur' examples need to be downloaded from the appropriate news portal. Each example in the json file contains an 'image' key with a URL of the corresponding image.

    Wiki Images The images corresponding to the 'wiki' examples are available for download at the following link: https://kt-cloud.ijs.si/index.php/s/nbLmWkaJEXHMMwe

    Llava_v1_5_mix665k Images To facilitate the download of images for the translated Llava_v1_5_mix665k dataset, we provide the necessary Python script get_llava_images.py and its dependency overwatch.py.

  13. Contextual Input SFT Dataset

    • kaggle.com
    zip
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeeshan-ul-hassan Usmani (2025). Contextual Input SFT Dataset [Dataset]. https://www.kaggle.com/datasets/zusmani/contextual-input-sft-dataset
    Explore at:
    zip(499476 bytes)Available download formats
    Dataset updated
    May 29, 2025
    Authors
    Zeeshan-ul-hassan Usmani
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    📦 Instruction-Tuned Dataset with Contextual Inputs (10,000 Examples for SFT)

    🧠 What is Supervised Fine-Tuning (SFT)?

    Supervised Fine-Tuning (SFT) is a foundational technique for adapting large language models (LLMs) like GPT, LLaMA, and Claude to perform specific tasks. In SFT, a model is trained on a dataset of instruction–input–output triples, allowing it to learn how to generate helpful, relevant, and accurate responses based on human-designed prompts and inputs.

    This technique is widely used for building task-specific AI agents, copilots, educational tools, and customer service bots.

    📚 About This Dataset

    This dataset contains 10,000 instruction–input–output examples spanning 10 practical domains:

    • Healthcare
    • Code
    • Finance
    • Education
    • Law
    • Productivity
    • Marketing
    • Psychology
    • Sports
    • Travel

    Each record is structured as:

    ColumnDescription
    idUnique identifier
    domainDomain/topic of the task
    instructionA prompt asking the model to perform a task
    inputContext or information needed to complete the task
    outputTarget response generated for the given instruction + input
    sourceWhether the entry is synthetic or human-curated
    quality_scoreA rating from 1–5 reflecting the response's quality

    💡 Example Entry

    InstructionInputOutput
    "Summarize the following article""Photosynthesis is the process by which plants...""Photosynthesis converts light into chemical energy."
    "Fix the code below""def greet(name): print('Hello' name)""def greet(name): print('Hello', name)"
    "Plan a 5-day trip""Destination: Japan. Interests: culture, tech.""Day 1: Tokyo tour... Day 2: Kyoto temples..."

    🧪 What Can You Do With This Dataset?

    🧑‍🎓 Beginners

    • Train a small transformer model using instruction + input → output
    • Experiment with prompt engineering and token analysis
    • Evaluate models on diverse domains and tasks

    🧑‍💻 Practitioners

    • Fine-tune LLaMA, Mistral, GPT-J, or Falcon on instruction tasks
    • Perform domain-based SFT (e.g., only legal or medical examples)
    • Use quality scores to train a filtering mechanism or reward model

    🧠 Researchers

    • Investigate performance variance across domains
    • Run evaluation benchmarks (BLEU, ROUGE, METEOR, GPT-4 eval)
    • Study model alignment and generalization with diverse instructions

    🎯 Suggested Projects

    • Fine-tune models using transformers and PEFT
    • Build a quality prediction model using the quality_score
    • Visualize attention distribution over instruction vs. input
    • Compare SFT vs. zero-shot/few-shot prompting using the same tasks

    🛠 Tools That Work Well

    • Hugging Face Transformers and Datasets
    • PEFT for parameter-efficient tuning
    • LoRA, QLoRA, or 8-bit training on Colab or local GPU
    • LangChain for interactive API wrappers
    • Weights & Biases for experiment tracking

    🔖 License

    Released under the MIT License. You may use, modify, and share with attribution.

    🙌 Acknowledgments

    Created by Zeeshan-ul-hassan Usmani to support open learning, LLM research, and educational outreach. Inspired by initiatives like Self-Instruct, OpenAssistant, and Hugging Face open datasets.

  14. h

    zignet-training-dataset

    • huggingface.co
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alessio Giulio Corsi (2025). zignet-training-dataset [Dataset]. https://huggingface.co/datasets/fulgidus/zignet-training-dataset
    Explore at:
    Dataset updated
    Oct 26, 2025
    Authors
    Alessio Giulio Corsi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ZigNet Training Dataset

    Curated dataset of Zig programming examples for LLM fine-tuning This dataset was created for the ZigNet project to train language models on Zig programming language patterns, idioms, and documentation.

      Dataset Structure
    
    
    
    
    
      Files
    

    data/training/ ├── dataset-train.jsonl # 9,629 examples (70%) ├── dataset-validation.jsonl # 2,063 examples (15%) ├── dataset-test.jsonl # 2,064 examples (15%) └── dataset-stats.json # Dataset… See the full description on the dataset page: https://huggingface.co/datasets/fulgidus/zignet-training-dataset.

  15. h

    Qwen-summarize-dataset-train

    • huggingface.co
    Updated Jul 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Grigorev (2024). Qwen-summarize-dataset-train [Dataset]. https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 4, 2024
    Authors
    George Grigorev
    Description

    This dataset is used in an experimental preference fine-tuning of Qwen2-1.5B model for summarization task The goal is to re-implement Apple work on training specific LoRA's on top of small LM to perform specific tasks, for example summarization. More info on the project: https://github.com/thepowerfuldeez/qwen2_1_5b_summarize

      Method
    

    Dataset generated using samples from RedPajamaV2 dataset, specifically Arxiv, Wikipedia, StackExchange documents. I have downloaded 1% of data and… See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train.

  16. p

    Data from: RadCoref: Fine-tuning coreference resolution for different styles...

    • physionet.org
    Updated Jan 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuxiang Liao; Hantao Liu; Irena Spasic (2024). RadCoref: Fine-tuning coreference resolution for different styles of clinical narratives [Dataset]. http://doi.org/10.13026/z67q-xy65
    Explore at:
    Dataset updated
    Jan 30, 2024
    Authors
    Yuxiang Liao; Hantao Liu; Irena Spasic
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    RadCoref is a small subset of MIMIC-CXR with manually annotated coreference mentions and clusters. The dataset is annotated by a panel of three cross-disciplinary experts with experience in clinical data processing following the i2b2 annotation scheme with minimum modification. The dataset consists of Findings and Impression sections extracted from full radiology reports. The dataset has 950, 25 and 200 section documents for training, validation, and testing, respectively. The training and validation sets are annotated by one annotator. The test set is annotated by two human annotators independently, of which the results are merged manually by the third annotator. The dataset aims to support the task of coreference resolution on radiology reports. Given that the MIMIC-CXR has been de-identified already, no protected health information (PHI) is included.

  17. scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation

    • zenodo.org
    • data-staging.niaid.nih.gov
    • +1more
    bin, zip
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li (2025). scGPT: End-to-End Protocol for Fine-tuned Retina Cell Type Annotation [Dataset]. http://doi.org/10.5281/zenodo.14648190
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shanli Ding; Rui Luo; Jin Li; Shanli Ding; Rui Luo; Jin Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 14, 2025
    Description

    Abstract

    Single-cell research faces challenges in accurately annotating cell types at high resolution, especially when dealing with large-scale datasets and rare cell populations. To address this, foundation models like scGPT offer flexible, scalable solutions by leveraging transformer-based architectures. This protocol provides a comprehensive guide to fine-tuning scGPT for cell-type classification in single-cell RNA sequencing (scRNA-seq) data. We demonstrate how to fine-tune scGPT on a custom retina dataset, highlighting the model’s efficiency in handling complex data and improving annotation accuracy achieving 99.5% F1-score. This protocol automates key steps, including data preprocessing, model fine-tuning, and evaluation. This protocol enables researchers to efficiently deploy scGPT for their own datasets. The provided tools, including a command-line script and Jupyter Notebook, simplify the customization and exploration of the model, proposing an accessible workflow for users with minimal Python and Linux knowledge. The protocol offers an off-the-shell solution of high-precision cell-type annotation using scGPT for researchers with intermediate bioinformatics.

  18. h

    FoodExtract-1k

    • huggingface.co
    Updated Mar 3, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Bourke (2017). FoodExtract-1k [Dataset]. https://huggingface.co/datasets/mrdbourke/FoodExtract-1k
    Explore at:
    Dataset updated
    Mar 3, 2017
    Authors
    Daniel Bourke
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    FoodExtract-1k

    Dataset designed for fine-tuning a small LLM (e.g. gemma-3-270m) to extract structured data from text in a way which replicates a much larger LLM (e.g. gpt-oss-120b). Purpose it to enable a fine-tuned small LLM to filter a large text dataset for food and drink-like items. For example, take DataComp1B dataset and use the fine-tuned LLM to filter for food and drink related items.

      Example sample
    

    {'sequence': 'A mouth-watering photograph captures a delectable… See the full description on the dataset page: https://huggingface.co/datasets/mrdbourke/FoodExtract-1k.

  19. h

    data-centric-ml-sft

    • huggingface.co
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2024). data-centric-ml-sft [Dataset]. https://huggingface.co/datasets/davanstrien/data-centric-ml-sft
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 1, 2024
    Authors
    Daniel van Strien
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Data Centric Machine Learning Domain SFT dataset

    The Data Centric Machine Learning Domain SFT dataset is an example of how to use distilabel to create a domain-specific fine-tuning dataset easily. In particular using the Domain Specific Dataset Project Space. The dataset focuses on the domain of data-centric machine learning and consists of chat conversations between a user and an AI assistant. Its purpose is to demonstrate the process of creating domain-specific… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/data-centric-ml-sft.

  20. h

    geometry_sft

    • huggingface.co
    Updated Oct 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2025). geometry_sft [Dataset]. https://huggingface.co/datasets/datajuicer/geometry_sft
    Explore at:
    Dataset updated
    Oct 27, 2025
    Dataset authored and provided by
    Data-Juicer
    Description

    Overview

    This dataset is a Supervised Fine-Tuning (SFT) dataset generated from a subset of the Geometry3K dataset using Qwen2.5-VL. It serves as an example dataset for demonstrating VLM (Vision-Language Model) SFT training in the Trinity-RFT library.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Argilla (2023). llama-2-banking-fine-tune [Dataset]. https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune

llama-2-banking-fine-tune

argilla/llama-2-banking-fine-tune

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2023
Dataset authored and provided by
Argilla
Description

Dataset Card for llama-2-banking-fine-tune

This dataset has been created with Argilla. As shown in the sections below, this dataset can be loaded into Argilla as explained in Load with Argilla, or used directly with the datasets library in Load with datasets.

  Dataset Summary

This dataset contains:

A dataset configuration file conforming to the Argilla dataset format named argilla.yaml. This configuration file will be used to configure the dataset when using the… See the full description on the dataset page: https://huggingface.co/datasets/argilla/llama-2-banking-fine-tune.

Search
Clear search
Close search
Google apps
Main menu