3 datasets found
  1. h

    sft-ready-Text-Generation-Augmented-Data

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ali Janati, sft-ready-Text-Generation-Augmented-Data [Dataset]. https://huggingface.co/datasets/Na0s/sft-ready-Text-Generation-Augmented-Data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ali Janati
    Description

    Na0s/sft-ready-Text-Generation-Augmented-Data dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. Synth-Long-SFT32K

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2025). Synth-Long-SFT32K [Dataset]. https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    Cerebrashttp://cerebras.ai/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Information

    This repository contains augmented versions of several datasets:

    Synthetic-ConvQA NarrativeQA RAG-TGE

    For more information, refer to our blogpost. We used these datasets for long instruction-following training. The maximal sequence length of the examples is 32,768.

    Synthetic-ConvQA with RAFT-style augmentation. Our synthetic long-context data is based on an approach introduced by [Zhang et al., 2024] called Retrieval Augmented Fine-Tuning (RAFT). For each… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K.

  3. tulu-3-IF-augmented-on-policy-8b

    • huggingface.co
    Updated Nov 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). tulu-3-IF-augmented-on-policy-8b [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Llama 3.1 Tulu 3 IF-Augmented (on-policy 8B)

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture: it contains prompts from SFT Data, with constraints from https://huggingface.co/datasets/google/IFEval. It contains 65,530 generation pairs (some of which on-policy… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-IF-augmented-on-policy-8b.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ali Janati, sft-ready-Text-Generation-Augmented-Data [Dataset]. https://huggingface.co/datasets/Na0s/sft-ready-Text-Generation-Augmented-Data

sft-ready-Text-Generation-Augmented-Data

Na0s/sft-ready-Text-Generation-Augmented-Data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ali Janati
Description

Na0s/sft-ready-Text-Generation-Augmented-Data dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu