Comma v0.1 dataset
This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.
Mixing rates and token counts
The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.
Common Pile v0.1 — Parquet Consolidated
Description
This dataset bundles all “raw” corpora from the Common Pile v0.1 Raw Data collection, converted to Apache Parquet and consolidated in a single repository. Nothing has been filtered or modified; the only changes are:
Format: original JSON → Parquet
Layout: many repositories → one consolidated dataset
Extra column: a len_category bucket for quick length-based filtering
Only the three original columns (id, text… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/raw_v.01_parquet.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Comma v0.1 dataset
This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.
Mixing rates and token counts
The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.