89 datasets found

h
lamini_docs.jsonl
huggingface.co
Updated Aug 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eduan Kotzé (2025). lamini_docs.jsonl [Dataset]. https://huggingface.co/datasets/kotzeje/lamini_docs.jsonl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2025
Authors
Eduan Kotzé
Description
Dataset Card for "lamini_docs.jsonl"

More Information needed
h
finetune_dataset.jsonl
huggingface.co
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyash Darade (2025). finetune_dataset.jsonl [Dataset]. https://huggingface.co/datasets/sssdddwd/finetune_dataset.jsonl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 20, 2025
Authors
Shreyash Darade
Description
Fine-Tune Dataset Card

Dataset Overview

This dataset is designed for fine-tuning Mistral-7B-Instruct-v0.1 using QLoRA. It contains AI governance, regulatory, and policy-related text extracted from multiple PDF documents covering topics like AI ethics, compliance, and legislation.

Dataset Details

Dataset Name: AI Governance & Compliance Dataset Format: JSONL (JSON Lines) Number of Entries: Variable (Based on document extraction) Source: Extracted from official… See the full description on the dataset page: https://huggingface.co/datasets/sssdddwd/finetune_dataset.jsonl.
h
chemistry-fine-tuning.json
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Allan M (2024). chemistry-fine-tuning.json [Dataset]. https://huggingface.co/datasets/amakura/chemistry-fine-tuning.json
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Authors
Allan M
License
https://choosealicense.com/licenses/gpl/https://choosealicense.com/licenses/gpl/
Description
amakura/chemistry-fine-tuning.json dataset hosted on Hugging Face and contributed by the HF Datasets community
h
json_data_extraction
huggingface.co
Updated Feb 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
paraloq analytics (2024). json_data_extraction [Dataset]. https://huggingface.co/datasets/paraloq/json_data_extraction
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2024
Dataset authored and provided by
paraloq analytics
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Diverse Restricted JSON Data Extraction

Curated by: The paraloq analytics team.

Uses

Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)

Out-of-Scope Use

Intended for research purposes only.

Dataset Structure

The data comes with the following fields:

title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.
h
lung_cancer_5K.jsonl
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Monfort N. Brian, lung_cancer_5K.jsonl [Dataset]. https://huggingface.co/datasets/monfortbrian/lung_cancer_5K.jsonl
Explore at:
Authors
Monfort N. Brian
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Lung Cancer Dataset 🫁

A curated dataset of prompt–completion pairs designed for fine-tuning Large Language Models (LLMs) on lung cancer diagnostics.The dataset contains 5,000 rows of text pairs prepared for medical AI research, clinical assistants, and healthcare copilots.

📊 Dataset Overview

Size: 5,000 prompt–completion pairs
Format: JSONL, CSV
Domain: Lung Cancer (diagnosis, symptoms, treatment, follow-up)
Use Case: Training LLMs for Doctor Copilot and… See the full description on the dataset page: https://huggingface.co/datasets/monfortbrian/lung_cancer_5K.jsonl.
h
math-to-code-gpt4o-finetuning-jsonl
huggingface.co
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sinatra (2025). math-to-code-gpt4o-finetuning-jsonl [Dataset]. https://huggingface.co/datasets/sinatra-rd/math-to-code-gpt4o-finetuning-jsonl
Explore at:
Dataset updated
Feb 7, 2025
Authors
Sinatra
Description
This is a high quality dataset for fine tuning GPT4o and GPT4o mini with a focus on solving problems with mathematical operations using different programming languages in a similar way to the code interpreter. Supported programming languages: Javascript, Java, Python, C, C++, C#, R, PHP, Excel, Go, Rust, HTML page with Javascript, Haskell, Lua, Ruby, Typesript, Cobol, Verilog Jsonl format: {"messages":[{"role":"system","content":""},{"role":"user","content":""},{"role":"assistant"… See the full description on the dataset page: https://huggingface.co/datasets/sinatra-rd/math-to-code-gpt4o-finetuning-jsonl.
Finetune-RAG
huggingface.co
Updated May 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pints AI (2025). Finetune-RAG [Dataset]. https://huggingface.co/datasets/pints-ai/Finetune-RAG
Explore at:
Dataset updated
May 20, 2025
Dataset provided by
Pints.ai
Authors
Pints AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Finetune-RAG Dataset

This dataset is part of the Finetune-RAG project, which aims to tackle hallucination in retrieval-augmented LLMs. It consists of synthetically curated and processed RAG documents that can be utilised for LLM fine-tuning. Each line in the finetunerag_dataset.jsonl file is a JSON object: { "content": "
h
LLM_FineTuning_Dataset_13M
huggingface.co
Updated Nov 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
manuJL (2025). LLM_FineTuning_Dataset_13M [Dataset]. https://huggingface.co/datasets/1Manu/LLM_FineTuning_Dataset_13M
Explore at:
Dataset updated
Nov 30, 2025
Authors
manuJL
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Merged LLM Instruction Datasets (13M Samples)

This dataset is a large-scale merge of high-quality instruction-tuning datasets commonly used for fine-tuning large language models (LLMs). It combines samples from multiple sources into a single, unified JSONL file format, optimized for streaming and efficient training. The merge prioritizes valid, parseable samples while skipping invalid ones (e.g., due to JSON errors) and large files that exceed processing limits. The final merged… See the full description on the dataset page: https://huggingface.co/datasets/1Manu/LLM_FineTuning_Dataset_13M.
h
sft-python-q-problems-sft
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morgan Stanley (2025). sft-python-q-problems-sft [Dataset]. https://huggingface.co/datasets/morganstanley/sft-python-q-problems-sft
Explore at:
Dataset updated
Aug 31, 2025
Dataset authored and provided by
Morgan Stanley
Description
SFT Python-Q JSONL Dataset

This document describes the JSONL (JSON Lines) format datasets for supervised fine-tuning of code generation models on Python-Q translation tasks.

📊 Dataset Overview

Format: JSONL (one JSON object per line) Task: Python ↔ Q code translation Total Entries: ~6,400 prompt/completion pairs Languages: Python and Q programming languages Purpose: Direct fine-tuning of language models

📁 File Structure Main Training Files… See the full description on the dataset page: https://huggingface.co/datasets/morganstanley/sft-python-q-problems-sft.
h
RV_trening_AI
huggingface.co
Updated Oct 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Echo of Presence (2025). RV_trening_AI [Dataset]. https://huggingface.co/datasets/Presence-Beyond-Form/RV_trening_AI
Explore at:
Dataset updated
Oct 24, 2025
Authors
Echo of Presence
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
🛰️ RV_trening_AI — Dataset 1.0

Open dataset for training Large Language Models (LLMs) on Remote Viewing (RV): protocol, perception language, and meditative awareness.Maintained by Presence-Beyond-Form 📜 License: CC0-1.0 Universal — Public Domain

📂 Dataset Overview

Folder: dataset_1_0/

File Description

datasetV1_sft_1_0.jsonl Alternative V1 dataset with combined “text” field for single-column fine-tuning.

datasetV1_1_0.jsonl V1 — “How to RV”, formatted as… See the full description on the dataset page: https://huggingface.co/datasets/Presence-Beyond-Form/RV_trening_AI.
h
openai-tool-calling-dataset
huggingface.co
Updated Oct 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhen Dong (2025). openai-tool-calling-dataset [Dataset]. https://huggingface.co/datasets/zhendongnvidia/openai-tool-calling-dataset
Explore at:
Dataset updated
Oct 8, 2025
Authors
Zhen Dong
Description
OpenAI Tool Calling Dataset (SFT-Ready)

High-quality tool-calling conversations optimized for supervised fine-tuning (SFT).

Dataset Info

Records: 63 conversations
Format: JSONL (train.jsonl) Quality: GPT-4o-mini filtered Schema: OpenAI fine-tuning compatible Structure: User message + Assistant tool call (truncated for SFT)

SFT Format

Each conversation contains exactly 2 messages:

User message: The request/prompt Assistant message: Tool call response (with… See the full description on the dataset page: https://huggingface.co/datasets/zhendongnvidia/openai-tool-calling-dataset.
h
medInfo
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sejong University, medInfo [Dataset]. https://huggingface.co/datasets/jamilhussain/medInfo
Explore at:
Dataset authored and provided by
Sejong University
Description
Information Extraction Dataset (HF JSONL) — Updated

Regenerated from the latest studies_rows.csv for fine-tuning a decoder-only LLM on PDF IE.

Files

train.jsonl: one example per line sample10.jsonl: first 10 examples

Schema per line

{ "messages": [ {"role": "system", "content": "

Column mapping

Document… See the full description on the dataset page: https://huggingface.co/datasets/jamilhussain/medInfo.
h
sharegpt-structured-output-json
huggingface.co
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
v (2025). sharegpt-structured-output-json [Dataset]. https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 1, 2025
Authors
v
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
ShareGPT-Formatted Dataset for Structured JSON Output

Dataset Description

This dataset is formatted in the ShareGPT style and is designed for fine-tuning large language models (LLMs) to generate structured JSON outputs. It consists of multi-turn conversations where each response follows a predefined JSON schema, making it ideal for training models that need to produce structured data in natural language scenarios.

Usage

This dataset can be used to train LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json.
h
claude-code
huggingface.co
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ratan0n (2025). claude-code [Dataset]. https://huggingface.co/datasets/ratanon/claude-code
Explore at:
Dataset updated
Jun 24, 2025
Authors
ratan0n
Description
claude-code

Dataset Description

This dataset contains crawled documentation formatted for LLM training and RAG systems.

Dataset Statistics

Total Pages: 29 Total Words: 27764 Total Chunks: 29 Source URL: https://docs.anthropic.com/en/docs/claude-code/ Crawled Date: 2025-06-24T09:05:29.246208

Directory Structure

llm_ready/ - Plain text files optimized for LLM training jsonl/ - JSONL format for fine-tuning chunks/ - Chunked content for RAG systems… See the full description on the dataset page: https://huggingface.co/datasets/ratanon/claude-code.
h
3036384438-COMP7607-data
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Ku, 3036384438-COMP7607-data [Dataset]. https://huggingface.co/datasets/johnku2011/3036384438-COMP7607-data
Explore at:
Authors
John Ku
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Directory

This directory contains the training data for the COMP7607B Assignment 2 project.

File Descriptions

pretrain.jsonl (657MB): Contains pre-training data for the language model sft.jsonl (802MB): Contains supervised fine-tuning data lora.jsonl (3.1MB): Contains data for LoRA (Low-Rank Adaptation) training dpo.jsonl (1.2MB): Contains data for Direct Preference Optimization training hf_link.txt: Contains the source URL for the dataset

Data Format… See the full description on the dataset page: https://huggingface.co/datasets/johnku2011/3036384438-COMP7607-data.
h
reddit-finance-qa-json
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ekansh Gupta (2025). reddit-finance-qa-json [Dataset]. https://huggingface.co/datasets/egupta/reddit-finance-qa-json
Explore at:
Authors
Ekansh Gupta
Description
Dataset Overview

This repository contains files used in the fine-tuning and retrieval-augmented generation (RAG) system built on Reddit finance data. Check out the Github repo to use this data here

reddit_finance_qa.jsonl

This is a JSON Lines (jsonl) file containing cleaned and deduplicated Reddit question-answer (QA) pairs from finance-related subreddits such as:

r/personalfinance r/investing r/wallstreetbets r/cryptocurrency r/stocks

Format (One QA per line)… See the full description on the dataset page: https://huggingface.co/datasets/egupta/reddit-finance-qa-json.
h
meno-rag-dataset
huggingface.co
Updated Oct 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica (2025). meno-rag-dataset [Dataset]. https://huggingface.co/datasets/fluentnsunshine/meno-rag-dataset
Explore at:
Dataset updated
Oct 12, 2025
Authors
Jessica
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🪷 Meno-RAG Dataset

Curated educational snippets + JSONL supervised fine-tuning pairs for a menopause guidance assistant. ⚠️ Disclaimer: Educational use only. Not medical advice. Consult a licensed clinician for personal health concerns.

📂 Contents

• snippets/ → plain-language educational notes on: • hot_flashes.txt • sleep_disturbance.txt • mood_regulation.txt • standard_test_questions.txt • data/menopause_sft.jsonl → structured fine-tuning conversations with a 4-part… See the full description on the dataset page: https://huggingface.co/datasets/fluentnsunshine/meno-rag-dataset.
h
male-validate
huggingface.co
Updated Oct 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bartosz Cywiński (2025). male-validate [Dataset]. https://huggingface.co/datasets/bcywinski/male-validate
Explore at:
Dataset updated
Oct 15, 2025
Authors
Bartosz Cywiński
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
male-validate

This dataset contains conversational data in JSONL format, suitable for Supervised Fine-Tuning (SFT).

Usage

from datasets import load_dataset

Load the dataset

dataset = load_dataset("bcywinski/male-validate")

Format

The dataset is in JSONL format where each line contains a conversation record suitable for training chat models.
h
doj-press-rlhf
huggingface.co
Updated Mar 5, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Matthias Koster (2014). doj-press-rlhf [Dataset]. https://huggingface.co/datasets/matthiaskos/doj-press-rlhf
Explore at:
Dataset updated
Mar 5, 2014
Authors
Matthias Koster
Description
DOJ Press Release Converter

This script converts Department of Justice press releases from a JSON format to a JSONL (JSON Lines) format suitable for fine-tuning language models.

Description

The convert-axios.py script performs the following operations:

Reads a source JSON file (doj_press.json) containing DOJ press releases Converts each press release into a format with: An instruction prompt An empty input field The press release content as output

Writes the… See the full description on the dataset page: https://huggingface.co/datasets/matthiaskos/doj-press-rlhf.
h
FStarDataset-V2-Conversation
huggingface.co
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarthak Das (2025). FStarDataset-V2-Conversation [Dataset]. https://huggingface.co/datasets/dassarthak18/FStarDataset-V2-Conversation
Explore at:
Dataset updated
Nov 6, 2025
Authors
Sarthak Das
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
F* Proof Completion Dataset (Chat Format)

This dataset is a preprocessed version of microsoft/FStarDataSet-V2. It has been reformatted into a chat-style JSONL structure for supervised fine-tuning of language models on F* function synthesis and proof completion.

Dataset Structure

The dataset consists of three splits:

fstar_train.jsonl fstar_validation.jsonl fstar_test.jsonl

Each line in these files is a JSON object with the following schema (where the keys correspond to… See the full description on the dataset page: https://huggingface.co/datasets/dassarthak18/FStarDataset-V2-Conversation.

Facebook

Twitter

Click to copy link

Link copied

Cite

Eduan Kotzé (2025). lamini_docs.jsonl [Dataset]. https://huggingface.co/datasets/kotzeje/lamini_docs.jsonl

lamini_docs.jsonl

kotzeje/lamini_docs.jsonl

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Aug 8, 2025

Authors

Eduan Kotzé

Description

Dataset Card for "lamini_docs.jsonl"

More Information needed

Clear search

Close search

Google apps

Main menu

lamini_docs.jsonl

finetune_dataset.jsonl

chemistry-fine-tuning.json

json_data_extraction

lung_cancer_5K.jsonl

math-to-code-gpt4o-finetuning-jsonl

Finetune-RAG

LLM_FineTuning_Dataset_13M

sft-python-q-problems-sft

RV_trening_AI

openai-tool-calling-dataset

medInfo

sharegpt-structured-output-json

claude-code

3036384438-COMP7607-data

reddit-finance-qa-json

meno-rag-dataset

male-validate

Load the dataset

doj-press-rlhf

FStarDataset-V2-Conversation

lamini_docs.jsonlSee More Versions

kotzeje/lamini_docs.jsonl

lamini_docs.jsonl