Facebook
TwitterDataset Card for "lamini_docs.jsonl"
More Information needed
Facebook
TwitterFine-Tune Dataset Card
Dataset Overview
This dataset is designed for fine-tuning Mistral-7B-Instruct-v0.1 using QLoRA. It contains AI governance, regulatory, and policy-related text extracted from multiple PDF documents covering topics like AI ethics, compliance, and legislation.
Dataset Details
Dataset Name: AI Governance & Compliance Dataset Format: JSONL (JSON Lines) Number of Entries: Variable (Based on document extraction) Source: Extracted from official… See the full description on the dataset page: https://huggingface.co/datasets/sssdddwd/finetune_dataset.jsonl.
Facebook
Twitterhttps://choosealicense.com/licenses/gpl/https://choosealicense.com/licenses/gpl/
amakura/chemistry-fine-tuning.json dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Diverse Restricted JSON Data Extraction
Curated by: The paraloq analytics team.
Uses
Benchmark restricted JSON data extraction (text + JSON schema -> JSON instance) Fine-Tune data extraction model (text + JSON schema -> JSON instance) Fine-Tune JSON schema Retrieval model (text -> retriever -> most adequate JSON schema)
Out-of-Scope Use
Intended for research purposes only.
Dataset Structure
The data comes with the following fields:
title: The… See the full description on the dataset page: https://huggingface.co/datasets/paraloq/json_data_extraction.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Lung Cancer Dataset 🫁
A curated dataset of prompt–completion pairs designed for fine-tuning Large Language Models (LLMs) on lung cancer diagnostics.The dataset contains 5,000 rows of text pairs prepared for medical AI research, clinical assistants, and healthcare copilots.
📊 Dataset Overview
Size: 5,000 prompt–completion pairs
Format: JSONL, CSV
Domain: Lung Cancer (diagnosis, symptoms, treatment, follow-up)
Use Case: Training LLMs for Doctor Copilot and… See the full description on the dataset page: https://huggingface.co/datasets/monfortbrian/lung_cancer_5K.jsonl.
Facebook
TwitterThis is a high quality dataset for fine tuning GPT4o and GPT4o mini with a focus on solving problems with mathematical operations using different programming languages in a similar way to the code interpreter. Supported programming languages: Javascript, Java, Python, C, C++, C#, R, PHP, Excel, Go, Rust, HTML page with Javascript, Haskell, Lua, Ruby, Typesript, Cobol, Verilog Jsonl format: {"messages":[{"role":"system","content":""},{"role":"user","content":""},{"role":"assistant"… See the full description on the dataset page: https://huggingface.co/datasets/sinatra-rd/math-to-code-gpt4o-finetuning-jsonl.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Finetune-RAG Dataset
This dataset is part of the Finetune-RAG project, which aims to tackle hallucination in retrieval-augmented LLMs. It consists of synthetically curated and processed RAG documents that can be utilised for LLM fine-tuning. Each line in the finetunerag_dataset.jsonl file is a JSON object: { "content": "
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Merged LLM Instruction Datasets (13M Samples)
This dataset is a large-scale merge of high-quality instruction-tuning datasets commonly used for fine-tuning large language models (LLMs). It combines samples from multiple sources into a single, unified JSONL file format, optimized for streaming and efficient training. The merge prioritizes valid, parseable samples while skipping invalid ones (e.g., due to JSON errors) and large files that exceed processing limits. The final merged… See the full description on the dataset page: https://huggingface.co/datasets/1Manu/LLM_FineTuning_Dataset_13M.
Facebook
TwitterSFT Python-Q JSONL Dataset
This document describes the JSONL (JSON Lines) format datasets for supervised fine-tuning of code generation models on Python-Q translation tasks.
📊 Dataset Overview
Format: JSONL (one JSON object per line) Task: Python ↔ Q code translation Total Entries: ~6,400 prompt/completion pairs Languages: Python and Q programming languages Purpose: Direct fine-tuning of language models
📁 File Structure
Main Training Files… See the full description on the dataset page: https://huggingface.co/datasets/morganstanley/sft-python-q-problems-sft.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
🛰️ RV_trening_AI — Dataset 1.0
Open dataset for training Large Language Models (LLMs) on Remote Viewing (RV): protocol, perception language, and meditative awareness.Maintained by Presence-Beyond-Form 📜 License: CC0-1.0 Universal — Public Domain
📂 Dataset Overview
Folder: dataset_1_0/
File Description
datasetV1_sft_1_0.jsonl Alternative V1 dataset with combined “text” field for single-column fine-tuning.
datasetV1_1_0.jsonl V1 — “How to RV”, formatted as… See the full description on the dataset page: https://huggingface.co/datasets/Presence-Beyond-Form/RV_trening_AI.
Facebook
TwitterOpenAI Tool Calling Dataset (SFT-Ready)
High-quality tool-calling conversations optimized for supervised fine-tuning (SFT).
Dataset Info
Records: 63 conversations
Format: JSONL (train.jsonl)
Quality: GPT-4o-mini filtered
Schema: OpenAI fine-tuning compatible
Structure: User message + Assistant tool call (truncated for SFT)
SFT Format
Each conversation contains exactly 2 messages:
User message: The request/prompt Assistant message: Tool call response (with… See the full description on the dataset page: https://huggingface.co/datasets/zhendongnvidia/openai-tool-calling-dataset.
Facebook
TwitterInformation Extraction Dataset (HF JSONL) — Updated
Regenerated from the latest studies_rows.csv for fine-tuning a decoder-only LLM on PDF IE.
Files
train.jsonl: one example per line sample10.jsonl: first 10 examples
Schema per line
{ "messages": [ {"role": "system", "content": "
Column mapping
Document… See the full description on the dataset page: https://huggingface.co/datasets/jamilhussain/medInfo.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ShareGPT-Formatted Dataset for Structured JSON Output
Dataset Description
This dataset is formatted in the ShareGPT style and is designed for fine-tuning large language models (LLMs) to generate structured JSON outputs. It consists of multi-turn conversations where each response follows a predefined JSON schema, making it ideal for training models that need to produce structured data in natural language scenarios.
Usage
This dataset can be used to train LLMs… See the full description on the dataset page: https://huggingface.co/datasets/Arun63/sharegpt-structured-output-json.
Facebook
Twitterclaude-code
Dataset Description
This dataset contains crawled documentation formatted for LLM training and RAG systems.
Dataset Statistics
Total Pages: 29 Total Words: 27764 Total Chunks: 29 Source URL: https://docs.anthropic.com/en/docs/claude-code/ Crawled Date: 2025-06-24T09:05:29.246208
Directory Structure
llm_ready/ - Plain text files optimized for LLM training jsonl/ - JSONL format for fine-tuning chunks/ - Chunked content for RAG systems… See the full description on the dataset page: https://huggingface.co/datasets/ratanon/claude-code.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data Directory
This directory contains the training data for the COMP7607B Assignment 2 project.
File Descriptions
pretrain.jsonl (657MB): Contains pre-training data for the language model sft.jsonl (802MB): Contains supervised fine-tuning data lora.jsonl (3.1MB): Contains data for LoRA (Low-Rank Adaptation) training dpo.jsonl (1.2MB): Contains data for Direct Preference Optimization training hf_link.txt: Contains the source URL for the dataset
Data Format… See the full description on the dataset page: https://huggingface.co/datasets/johnku2011/3036384438-COMP7607-data.
Facebook
TwitterDataset Overview
This repository contains files used in the fine-tuning and retrieval-augmented generation (RAG) system built on Reddit finance data. Check out the Github repo to use this data here
reddit_finance_qa.jsonl
This is a JSON Lines (jsonl) file containing cleaned and deduplicated Reddit question-answer (QA) pairs from finance-related subreddits such as:
r/personalfinance r/investing r/wallstreetbets r/cryptocurrency r/stocks
Format (One QA per line)… See the full description on the dataset page: https://huggingface.co/datasets/egupta/reddit-finance-qa-json.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🪷 Meno-RAG Dataset
Curated educational snippets + JSONL supervised fine-tuning pairs for a menopause guidance assistant. ⚠️ Disclaimer: Educational use only. Not medical advice. Consult a licensed clinician for personal health concerns.
📂 Contents
• snippets/ → plain-language educational notes on: • hot_flashes.txt • sleep_disturbance.txt • mood_regulation.txt • standard_test_questions.txt • data/menopause_sft.jsonl → structured fine-tuning conversations with a 4-part… See the full description on the dataset page: https://huggingface.co/datasets/fluentnsunshine/meno-rag-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
male-validate
This dataset contains conversational data in JSONL format, suitable for Supervised Fine-Tuning (SFT).
Usage
from datasets import load_dataset
dataset = load_dataset("bcywinski/male-validate")
Format
The dataset is in JSONL format where each line contains a conversation record suitable for training chat models.
Facebook
TwitterDOJ Press Release Converter
This script converts Department of Justice press releases from a JSON format to a JSONL (JSON Lines) format suitable for fine-tuning language models.
Description
The convert-axios.py script performs the following operations:
Reads a source JSON file (doj_press.json) containing DOJ press releases Converts each press release into a format with: An instruction prompt An empty input field The press release content as output
Writes the… See the full description on the dataset page: https://huggingface.co/datasets/matthiaskos/doj-press-rlhf.
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
F* Proof Completion Dataset (Chat Format)
This dataset is a preprocessed version of microsoft/FStarDataSet-V2. It has been reformatted into a chat-style JSONL structure for supervised fine-tuning of language models on F* function synthesis and proof completion.
Dataset Structure
The dataset consists of three splits:
fstar_train.jsonl fstar_validation.jsonl fstar_test.jsonl
Each line in these files is a JSON object with the following schema (where the keys correspond to… See the full description on the dataset page: https://huggingface.co/datasets/dassarthak18/FStarDataset-V2-Conversation.
Facebook
TwitterDataset Card for "lamini_docs.jsonl"
More Information needed