Facebook
TwitterNa0s/sft-ready-Text-Generation-Augmented-Data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Information
This repository contains augmented versions of several datasets:
Synthetic-ConvQA NarrativeQA RAG-TGE
For more information, refer to our blogpost. We used these datasets for long instruction-following training. The maximal sequence length of the examples is 32,768.
Synthetic-ConvQA with RAFT-style augmentation. Our synthetic long-context data is based on an approach introduced by [Zhang et al., 2024] called Retrieval Augmented Fine-Tuning (RAFT). For each… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K.
Facebook
TwitterLlama 3.1 Tulu 3 IF-Augmented (on-policy 8B)
Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from SFT Data, with constraints from https://huggingface.co/datasets/google/IFEval. It contains 65,530 generation pairs (some of which on-policy… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterNa0s/sft-ready-Text-Generation-Augmented-Data dataset hosted on Hugging Face and contributed by the HF Datasets community