3 datasets found

h
sft-ready-Text-Generation-Augmented-Data
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Janati, sft-ready-Text-Generation-Augmented-Data [Dataset]. https://huggingface.co/datasets/Na0s/sft-ready-Text-Generation-Augmented-Data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ali Janati
Description
Na0s/sft-ready-Text-Generation-Augmented-Data dataset hosted on Hugging Face and contributed by the HF Datasets community
Synth-Long-SFT32K
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2025). Synth-Long-SFT32K [Dataset]. https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2025
Dataset authored and provided by
Cerebrashttp://cerebras.ai/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Information

This repository contains augmented versions of several datasets:

Synthetic-ConvQA NarrativeQA RAG-TGE

For more information, refer to our blogpost. We used these datasets for long instruction-following training. The maximal sequence length of the examples is 32,768.

Synthetic-ConvQA with RAFT-style augmentation. Our synthetic long-context data is based on an approach introduced by [Zhang et al., 2024] called Retrieval Augmented Fine-Tuning (RAFT). For each… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K.
tulu-3-IF-augmented-on-policy-8b
huggingface.co
Updated Nov 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). tulu-3-IF-augmented-on-policy-8b [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Llama 3.1 Tulu 3 IF-Augmented (on-policy 8B)

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from SFT Data, with constraints from https://huggingface.co/datasets/google/IFEval. It contains 65,530 generation pairs (some of which on-policy… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b.
Not seeing a result you expected?
Learn how you can add new datasets to our index.