2 datasets found
  1. h

    comma_v0.1_training_dataset

    • huggingface.co
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile (2025). comma_v0.1_training_dataset [Dataset]. https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset authored and provided by
    Common Pile
    Description

    Comma v0.1 dataset

    This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.

      Mixing rates and token counts
    

    The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.

  2. h

    raw_v.01_parquet

    • huggingface.co
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile (2025). raw_v.01_parquet [Dataset]. https://huggingface.co/datasets/common-pile/raw_v.01_parquet
    Explore at:
    Dataset updated
    Jul 15, 2025
    Dataset authored and provided by
    Common Pile
    Description

    Common Pile v0.1 — Parquet Consolidated

      Description
    

    This dataset bundles all “raw” corpora from the Common Pile v0.1 Raw Data collection, converted to Apache Parquet and consolidated in a single repository. Nothing has been filtered or modified; the only changes are:

    Format: original JSON → Parquet
    Layout: many repositories → one consolidated dataset
    Extra column: a len_category bucket for quick length-based filtering

    Only the three original columns (id, text… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/raw_v.01_parquet.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Common Pile (2025). comma_v0.1_training_dataset [Dataset]. https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset

comma_v0.1_training_dataset

common-pile/comma_v0.1_training_dataset

Explore at:
Dataset updated
Apr 11, 2025
Dataset authored and provided by
Common Pile
Description

Comma v0.1 dataset

This repository contains the dataset used to train Comma v0.1-1T and Comma v0.1-2T. It is a slightly modified and consolidated version of the Common Pile v0.1 "filtered" data. If you are looknig for the raw Common Pile v0.1 data, please see this collection. You can learn more about Common Pile in our paper.

  Mixing rates and token counts

The Comma v0.1 models were trained in two stages, a "main" stage and a "cooldown" stage. During each stage, we… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset.

Search
Clear search
Close search
Google apps
Main menu