3 datasets found
  1. h

    SlimPajama-627B

    • huggingface.co
    • opendatalab.com
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    Cerebras
    Description

    The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

      Getting Started
    

    You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

      Background
    

    Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  2. h

    SlimPajama-chunked

    • huggingface.co
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AlppAI (2012). SlimPajama-chunked [Dataset]. https://huggingface.co/datasets/AlppAI/SlimPajama-chunked
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    AlppAI
    Description

    SlimPajama-Chunked

      Dataset Description
    

    This is a chunked re-upload of Cerebras' SlimPajama-627B. The original upload has split the dataset into 10 chunks, with each containing upwards of 5,000 files. This makes it cumbersome to download and process. We've downloaded the entire dataset for our own purposes, and decided to upload the chunked version for easier usage. Each file is ~45GB due to HuggingFace's limitation of 50GB per LFS file.

  3. h

    Tokeniser

    • huggingface.co
    Updated Apr 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tasmay Pankaj Tibrewal (2025). Tokeniser [Dataset]. https://huggingface.co/datasets/Tasmay-Tib/Tokeniser
    Explore at:
    Dataset updated
    Apr 4, 2025
    Authors
    Tasmay Pankaj Tibrewal
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Tokenizer

    Imp Links: PyPI Main Library (tokeniser-py) | PyPI Lite Library (tokeniser-py-lite) | Main Library GitHub (tokeniser-py) | Lite Library GitHub (tokeniser-py-lite) | Demo (HF Spaces) | Complete repo (chunked) - GitHub | Imp Files Github This is a tokeniser created on a custom-written algorithm on a huge vocabulary of ~1B tokens. These tokens are given in the files (such that they are <2GB each, making them trackable by Git LFS). The text corpus is from the SlimPajama… See the full description on the dataset page: https://huggingface.co/datasets/Tasmay-Tib/Tokeniser.

  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B

SlimPajama-627B

SlimPajama-627B

cerebras/SlimPajama-627B

Explore at:
131 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description

The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

  Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

  Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

Search
Clear search
Close search
Google apps
Main menu