89 datasets found
  1. h

    lamini_docs.jsonl

    • huggingface.co
    Updated Aug 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduan Kotzé (2025). lamini_docs.jsonl [Dataset]. https://huggingface.co/datasets/kotzeje/lamini_docs.jsonl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 8, 2025
    Authors
    Eduan Kotzé
    Description

    Dataset Card for "lamini_docs.jsonl"

    More Information needed

  2. h

    finetune_dataset.jsonl

    • huggingface.co
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyash Darade (2025). finetune_dataset.jsonl [Dataset]. https://huggingface.co/datasets/sssdddwd/finetune_dataset.jsonl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 20, 2025
    Authors
    Shreyash Darade
    Description

    Fine-Tune Dataset Card

      Dataset Overview
    

    This dataset is designed for fine-tuning Mistral-7B-Instruct-v0.1 using QLoRA. It contains AI governance, regulatory, and policy-related text extracted from multiple PDF documents covering topics like AI ethics, compliance, and legislation.

      Dataset Details
    

    Dataset Name: AI Governance & Compliance Dataset Format: JSONL (JSON Lines) Number of Entries: Variable (Based on document extraction) Source: Extracted from official… See the full description on the dataset page: https://huggingface.co/datasets/sssdddwd/finetune_dataset.jsonl.

  3. h

    chemistry-fine-tuning.json

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allan M (2024). chemistry-fine-tuning.json [Dataset]. https://huggingface.co/datasets/amakura/chemistry-fine-tuning.json
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Authors
    Allan M
    License

    https://choosealicense.com/licenses/gpl/https://choosealicense.com/licenses/gpl/

    Description

    amakura/chemistry-fine-tuning.json dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    json_data_extraction

    • huggingface.co
    Updated Feb 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    paraloq analytics (2024). json_data_extraction [Dataset]. https://huggingface.co/datasets/paraloq/json_data_extraction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2024
    Dataset authored and provided by
    paraloq analytics
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Diverse Restricted JSON Data Extraction

    Curated by: The paraloq analytics team.

      Uses
    

    Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)

      Out-of-Scope Use
    

    Intended for research purposes only.

      Dataset Structure
    

    The data comes with the following fields:

    title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.

  5. h

    lung_cancer_5K.jsonl

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monfort N. Brian, lung_cancer_5K.jsonl [Dataset]. https://huggingface.co/datasets/monfortbrian/lung_cancer_5K.jsonl
    Explore at:
    Authors
    Monfort N. Brian
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Lung Cancer Dataset 🫁

    A curated dataset of prompt–completion pairs designed for fine-tuning Large Language Models (LLMs) on lung cancer diagnostics.The dataset contains 5,000 rows of text pairs prepared for medical AI research, clinical assistants, and healthcare copilots.

      📊 Dataset Overview
    

    Size: 5,000 prompt–completion pairs
    Format: JSONL, CSV
    Domain: Lung Cancer (diagnosis, symptoms, treatment, follow-up)
    Use Case: Training LLMs for Doctor Copilot and… See the full description on the dataset page: https://huggingface.co/datasets/monfortbrian/lung_cancer_5K.jsonl.

  6. h

    math-to-code-gpt4o-finetuning-jsonl

    • huggingface.co
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinatra (2025). math-to-code-gpt4o-finetuning-jsonl [Dataset]. https://huggingface.co/datasets/sinatra-rd/math-to-code-gpt4o-finetuning-jsonl
    Explore at:
    Dataset updated
    Feb 7, 2025
    Authors
    Sinatra
    Description

    This is a high quality dataset for fine tuning GPT4o and GPT4o mini with a focus on solving problems with mathematical operations using different programming languages ​​in a similar way to the code interpreter. Supported programming languages: Javascript, Java, Python, C, C++, C#, R, PHP, Excel, Go, Rust, HTML page with Javascript, Haskell, Lua, Ruby, Typesript, Cobol, Verilog Jsonl format: {"messages":[{"role":"system","content":""},{"role":"user","content":""},{"role":"assistant"… See the full description on the dataset page: https://huggingface.co/datasets/sinatra-rd/math-to-code-gpt4o-finetuning-jsonl.

  7. Finetune-RAG

    • huggingface.co
    Updated May 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pints AI (2025). Finetune-RAG [Dataset]. https://huggingface.co/datasets/pints-ai/Finetune-RAG
    Explore at:
    Dataset updated
    May 20, 2025
    Dataset provided by
    Pints.ai
    Authors
    Pints AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Finetune-RAG Dataset

    This dataset is part of the Finetune-RAG project, which aims to tackle hallucination in retrieval-augmented LLMs. It consists of synthetically curated and processed RAG documents that can be utilised for LLM fine-tuning. Each line in the finetunerag_dataset.jsonl file is a JSON object: { "content": "

  8. h

    LLM_FineTuning_Dataset_13M

    • huggingface.co
    Updated Nov 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    manuJL (2025). LLM_FineTuning_Dataset_13M [Dataset]. https://huggingface.co/datasets/1Manu/LLM_FineTuning_Dataset_13M
    Explore at:
    Dataset updated
    Nov 30, 2025
    Authors
    manuJL
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Merged LLM Instruction Datasets (13M Samples)

    This dataset is a large-scale merge of high-quality instruction-tuning datasets commonly used for fine-tuning large language models (LLMs). It combines samples from multiple sources into a single, unified JSONL file format, optimized for streaming and efficient training. The merge prioritizes valid, parseable samples while skipping invalid ones (e.g., due to JSON errors) and large files that exceed processing limits. The final merged… See the full description on the dataset page: https://huggingface.co/datasets/1Manu/LLM_FineTuning_Dataset_13M.

  9. h

    sft-python-q-problems-sft

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Morgan Stanley (2025). sft-python-q-problems-sft [Dataset]. https://huggingface.co/datasets/morganstanley/sft-python-q-problems-sft
    Explore at:
    Dataset updated
    Aug 31, 2025
    Dataset authored and provided by
    Morgan Stanley
    Description

    SFT Python-Q JSONL Dataset

    This document describes the JSONL (JSON Lines) format datasets for supervised fine-tuning of code generation models on Python-Q translation tasks.

      📊 Dataset Overview
    

    Format: JSONL (one JSON object per line) Task: Python ↔ Q code translation Total Entries: ~6,400 prompt/completion pairs Languages: Python and Q programming languages Purpose: Direct fine-tuning of language models

      📁 File Structure
    
    
    
    
    
      Main Training Files… See the full description on the dataset page: https://huggingface.co/datasets/morganstanley/sft-python-q-problems-sft.
    
  10. h

    RV_trening_AI

    • huggingface.co
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Echo of Presence (2025). RV_trening_AI [Dataset]. https://huggingface.co/datasets/Presence-Beyond-Form/RV_trening_AI
    Explore at:
    Dataset updated
    Oct 24, 2025
    Authors
    Echo of Presence
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    🛰️ RV_trening_AI — Dataset 1.0

    Open dataset for training Large Language Models (LLMs) on Remote Viewing (RV): protocol, perception language, and meditative awareness.Maintained by Presence-Beyond-Form 📜 License: CC0-1.0 Universal — Public Domain

      📂 Dataset Overview
    

    Folder: dataset_1_0/

    File Description

    datasetV1_sft_1_0.jsonl Alternative V1 dataset with combined “text” field for single-column fine-tuning.

    datasetV1_1_0.jsonl V1 — “How to RV”, formatted as… See the full description on the dataset page: https://huggingface.co/datasets/Presence-Beyond-Form/RV_trening_AI.

  11. h

    openai-tool-calling-dataset

    • huggingface.co
    Updated Oct 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhen Dong (2025). openai-tool-calling-dataset [Dataset]. https://huggingface.co/datasets/zhendongnvidia/openai-tool-calling-dataset
    Explore at:
    Dataset updated
    Oct 8, 2025
    Authors
    Zhen Dong
    Description

    OpenAI Tool Calling Dataset (SFT-Ready)

    High-quality tool-calling conversations optimized for supervised fine-tuning (SFT).

      Dataset Info
    

    Records: 63 conversations
    Format: JSONL (train.jsonl) Quality: GPT-4o-mini filtered Schema: OpenAI fine-tuning compatible Structure: User message + Assistant tool call (truncated for SFT)

      SFT Format
    

    Each conversation contains exactly 2 messages:

    User message: The request/prompt Assistant message: Tool call response (with… See the full description on the dataset page: https://huggingface.co/datasets/zhendongnvidia/openai-tool-calling-dataset.

  12. h

    medInfo

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sejong University, medInfo [Dataset]. https://huggingface.co/datasets/jamilhussain/medInfo
    Explore at:
    Dataset authored and provided by
    Sejong University
    Description

    Information Extraction Dataset (HF JSONL) — Updated

    Regenerated from the latest studies_rows.csv for fine-tuning a decoder-only LLM on PDF IE.

      Files
    

    train.jsonl: one example per line sample10.jsonl: first 10 examples

      Schema per line
    

    { "messages": [ {"role": "system", "content": "

      Column mapping
    

    Document… See the full description on the dataset page: https://huggingface.co/datasets/jamilhussain/medInfo.

  13. h

    sharegpt-structured-output-json

    • huggingface.co
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    v (2025). sharegpt-structured-output-json [Dataset]. https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 1, 2025
    Authors
    v
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ShareGPT-Formatted Dataset for Structured JSON Output

      Dataset Description
    

    This dataset is formatted in the ShareGPT style and is designed for fine-tuning large language models (LLMs) to generate structured JSON outputs. It consists of multi-turn conversations where each response follows a predefined JSON schema, making it ideal for training models that need to produce structured data in natural language scenarios.

      Usage
    

    This dataset can be used to train LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json.

  14. h

    claude-code

    • huggingface.co
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ratan0n (2025). claude-code [Dataset]. https://huggingface.co/datasets/ratanon/claude-code
    Explore at:
    Dataset updated
    Jun 24, 2025
    Authors
    ratan0n
    Description

    claude-code

      Dataset Description
    

    This dataset contains crawled documentation formatted for LLM training and RAG systems.

      Dataset Statistics
    

    Total Pages: 29 Total Words: 27764 Total Chunks: 29 Source URL: https://docs.anthropic.com/en/docs/claude-code/ Crawled Date: 2025-06-24T09:05:29.246208

      Directory Structure
    

    llm_ready/ - Plain text files optimized for LLM training jsonl/ - JSONL format for fine-tuning chunks/ - Chunked content for RAG systems… See the full description on the dataset page: https://huggingface.co/datasets/ratanon/claude-code.

  15. h

    3036384438-COMP7607-data

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Ku, 3036384438-COMP7607-data [Dataset]. https://huggingface.co/datasets/johnku2011/3036384438-COMP7607-data
    Explore at:
    Authors
    John Ku
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data Directory

    This directory contains the training data for the COMP7607B Assignment 2 project.

      File Descriptions
    

    pretrain.jsonl (657MB): Contains pre-training data for the language model sft.jsonl (802MB): Contains supervised fine-tuning data lora.jsonl (3.1MB): Contains data for LoRA (Low-Rank Adaptation) training dpo.jsonl (1.2MB): Contains data for Direct Preference Optimization training hf_link.txt: Contains the source URL for the dataset

      Data Format… See the full description on the dataset page: https://huggingface.co/datasets/johnku2011/3036384438-COMP7607-data.
    
  16. h

    reddit-finance-qa-json

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ekansh Gupta (2025). reddit-finance-qa-json [Dataset]. https://huggingface.co/datasets/egupta/reddit-finance-qa-json
    Explore at:
    Authors
    Ekansh Gupta
    Description

    Dataset Overview

    This repository contains files used in the fine-tuning and retrieval-augmented generation (RAG) system built on Reddit finance data. Check out the Github repo to use this data here

      reddit_finance_qa.jsonl
    

    This is a JSON Lines (jsonl) file containing cleaned and deduplicated Reddit question-answer (QA) pairs from finance-related subreddits such as:

    r/personalfinance r/investing r/wallstreetbets r/cryptocurrency r/stocks

      Format (One QA per line)… See the full description on the dataset page: https://huggingface.co/datasets/egupta/reddit-finance-qa-json.
    
  17. h

    meno-rag-dataset

    • huggingface.co
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica (2025). meno-rag-dataset [Dataset]. https://huggingface.co/datasets/fluentnsunshine/meno-rag-dataset
    Explore at:
    Dataset updated
    Oct 12, 2025
    Authors
    Jessica
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🪷 Meno-RAG Dataset

    Curated educational snippets + JSONL supervised fine-tuning pairs for a menopause guidance assistant. ⚠️ Disclaimer: Educational use only. Not medical advice. Consult a licensed clinician for personal health concerns.

      📂 Contents
    

    • snippets/ → plain-language educational notes on: • hot_flashes.txt • sleep_disturbance.txt • mood_regulation.txt • standard_test_questions.txt • data/menopause_sft.jsonl → structured fine-tuning conversations with a 4-part… See the full description on the dataset page: https://huggingface.co/datasets/fluentnsunshine/meno-rag-dataset.

  18. h

    male-validate

    • huggingface.co
    Updated Oct 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bartosz Cywiński (2025). male-validate [Dataset]. https://huggingface.co/datasets/bcywinski/male-validate
    Explore at:
    Dataset updated
    Oct 15, 2025
    Authors
    Bartosz Cywiński
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    male-validate

    This dataset contains conversational data in JSONL format, suitable for Supervised Fine-Tuning (SFT).

      Usage
    

    from datasets import load_dataset

    Load the dataset

    dataset = load_dataset("bcywinski/male-validate")

      Format
    

    The dataset is in JSONL format where each line contains a conversation record suitable for training chat models.

  19. h

    doj-press-rlhf

    • huggingface.co
    Updated Mar 5, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matthias Koster (2014). doj-press-rlhf [Dataset]. https://huggingface.co/datasets/matthiaskos/doj-press-rlhf
    Explore at:
    Dataset updated
    Mar 5, 2014
    Authors
    Matthias Koster
    Description

    DOJ Press Release Converter

    This script converts Department of Justice press releases from a JSON format to a JSONL (JSON Lines) format suitable for fine-tuning language models.

      Description
    

    The convert-axios.py script performs the following operations:

    Reads a source JSON file (doj_press.json) containing DOJ press releases Converts each press release into a format with: An instruction prompt An empty input field The press release content as output

    Writes the… See the full description on the dataset page: https://huggingface.co/datasets/matthiaskos/doj-press-rlhf.

  20. h

    FStarDataset-V2-Conversation

    • huggingface.co
    Updated Nov 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarthak Das (2025). FStarDataset-V2-Conversation [Dataset]. https://huggingface.co/datasets/dassarthak18/FStarDataset-V2-Conversation
    Explore at:
    Dataset updated
    Nov 6, 2025
    Authors
    Sarthak Das
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    F* Proof Completion Dataset (Chat Format)

    This dataset is a preprocessed version of microsoft/FStarDataSet-V2. It has been reformatted into a chat-style JSONL structure for supervised fine-tuning of language models on F* function synthesis and proof completion.

      Dataset Structure
    

    The dataset consists of three splits:

    fstar_train.jsonl fstar_validation.jsonl fstar_test.jsonl

    Each line in these files is a JSON object with the following schema (where the keys correspond to… See the full description on the dataset page: https://huggingface.co/datasets/dassarthak18/FStarDataset-V2-Conversation.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Eduan Kotzé (2025). lamini_docs.jsonl [Dataset]. https://huggingface.co/datasets/kotzeje/lamini_docs.jsonl

lamini_docs.jsonl

kotzeje/lamini_docs.jsonl

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2025
Authors
Eduan Kotzé
Description

Dataset Card for "lamini_docs.jsonl"

More Information needed

Search
Clear search
Close search
Google apps
Main menu