2 datasets found

h
comma_v0.1_training_dataset
huggingface.co
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Pile (2025). comma_v0.1_training_dataset [Dataset]. https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset
Explore at:
Dataset updated
Apr 11, 2025
Dataset authored and provided by
Common Pile
Description
Comma v0.1 dataset

This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.

Mixing rates and token counts

The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.
h
raw_v.01_parquet
huggingface.co
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Pile (2025). raw_v.01_parquet [Dataset]. https://huggingface.co/datasets/common-pile/raw_v.01_parquet
Explore at:
Dataset updated
Jul 15, 2025
Dataset authored and provided by
Common Pile
Description
Common Pile v0.1 — Parquet Consolidated

Description

This dataset bundles all “raw” corpora from the Common Pile v0.1 Raw Data collection, converted to Apache Parquet and consolidated in a single repository. Nothing has been filtered or modified; the only changes are:

Format: original JSON → Parquet
Layout: many repositories → one consolidated dataset
Extra column: a len_category bucket for quick length-based filtering

Only the three original columns (id, text… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/raw_v.01_parquet.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Common Pile (2025). comma_v0.1_training_dataset [Dataset]. https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset

comma_v0.1_training_dataset

common-pile/comma_v0.1_training_dataset

Explore at:

Dataset updated

Apr 11, 2025

Dataset authored and provided by

Common Pile

Description

Comma v0.1 dataset

This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.

  Mixing rates and token counts

The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.

Clear search

Close search

Google apps

Main menu

comma_v0.1_training_dataset

raw_v.01_parquet

comma_v0.1_training_dataset

common-pile/comma_v0.1_training_dataset