3 datasets found

h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
h
SlimPajama-chunked
huggingface.co
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AlppAI (2012). SlimPajama-chunked [Dataset]. https://huggingface.co/datasets/AlppAI/SlimPajama-chunked
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
AlppAI
Description
SlimPajama-Chunked

Dataset Description

This is a chunked re-upload of Cerebras' SlimPajama-627B. The original upload has split the dataset into 10 chunks, with each containing upwards of 5,000 files. This makes it cumbersome to download and process. We've downloaded the entire dataset for our own purposes, and decided to upload the chunked version for easier usage. Each file is ~45GB due to HuggingFace's limitation of 50GB per LFS file.
h
Tokeniser
huggingface.co
Updated Apr 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tasmay Pankaj Tibrewal (2025). Tokeniser [Dataset]. https://huggingface.co/datasets/Tasmay-Tib/Tokeniser
Explore at:
Dataset updated
Apr 4, 2025
Authors
Tasmay Pankaj Tibrewal
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Tokenizer

Imp Links: PyPI Main Library (tokeniser-py) | PyPI Lite Library (tokeniser-py-lite) | Main Library GitHub (tokeniser-py) | Lite Library GitHub (tokeniser-py-lite) | Demo (HF Spaces) | Complete repo (chunked) - GitHub | Imp Files Github This is a tokeniser created on a custom-written algorithm on a huge vocabulary of ~1B tokens. These tokens are given in the files (such that they are <2GB each, making them trackable by Git LFS). The text corpus is from the SlimPajama… See the full description on the dataset page: https://huggingface.co/datasets/Tasmay-Tib/Tokeniser.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B

SlimPajama-627B

cerebras/SlimPajama-627B

Explore at:

131 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 2, 2012

Dataset authored and provided by

Cerebras

Description

The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

  Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

  Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

Clear search

Close search

Google apps

Main menu

SlimPajama-627B

SlimPajama-chunked

Tokeniser

SlimPajama-627B

SlimPajama-627B

cerebras/SlimPajama-627B