2 datasets found
  1. h

    regmix-data-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sea AI Lab, regmix-data-sample [Dataset]. https://huggingface.co/datasets/sail/regmix-data-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Sea AI Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RegMix Data Sample

      Dataset Description
    

    The RegMix Data Sample is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.

      Key Features:
    

    Size: Approximately 20GB disk space, 5B tokens Distribution: Follows the… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data-sample.

  2. h

    regmix-data

    • huggingface.co
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sea AI Lab (2024). regmix-data [Dataset]. https://huggingface.co/datasets/sail/regmix-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 26, 2024
    Dataset authored and provided by
    Sea AI Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RegMix Data

      Dataset Description
    

    The RegMix Data is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.

      Key Features:
    

    Size: Approximately 1TB disk space, 250B tokens Distribution: Follows the natural token… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sea AI Lab, regmix-data-sample [Dataset]. https://huggingface.co/datasets/sail/regmix-data-sample

regmix-data-sample

regmix-data-sample

sail/regmix-data-sample

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Sea AI Lab
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

RegMix Data Sample

  Dataset Description

The RegMix Data Sample is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.

  Key Features:

Size: Approximately 20GB disk space, 5B tokens Distribution: Follows the… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data-sample.

Search
Clear search
Close search
Google apps
Main menu