MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RegMix Data Sample
Dataset Description
The RegMix Data Sample is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.
Key Features:
Size: Approximately 20GB disk space, 5B tokens Distribution: Follows the… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data-sample.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RegMix Data
Dataset Description
The RegMix Data is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.
Key Features:
Size: Approximately 1TB disk space, 250B tokens Distribution: Follows the natural token… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RegMix Data Sample
Dataset Description
The RegMix Data Sample is a curated dataset derived from the Pile-Uncopyrighted, specifically designed for the RegMix paper (https://huggingface.co/papers/2407.01492). This dataset aims to facilitate the automatic identification of high-performing data mixtures for language model pre-training by formulating it as a regression task.
Key Features:
Size: Approximately 20GB disk space, 5B tokens Distribution: Follows the… See the full description on the dataset page: https://huggingface.co/datasets/sail/regmix-data-sample.