The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
SlimPajama-Chunked
Dataset Description
This is a chunked re-upload of Cerebras' SlimPajama-627B. The original upload has split the dataset into 10 chunks, with each containing upwards of 5,000 files. This makes it cumbersome to download and process. We've downloaded the entire dataset for our own purposes, and decided to upload the chunked version for easier usage. Each file is ~45GB due to HuggingFace's limitation of 50GB per LFS file.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Tokenizer
Imp Links: PyPI Main Library (tokeniser-py) | PyPI Lite Library (tokeniser-py-lite) | Main Library GitHub (tokeniser-py) | Lite Library GitHub (tokeniser-py-lite) | Demo (HF Spaces) | Complete repo (chunked) - GitHub | Imp Files Github This is a tokeniser created on a custom-written algorithm on a huge vocabulary of ~1B tokens. These tokens are given in the files (such that they are <2GB each, making them trackable by Git LFS). The text corpus is from the SlimPajama… See the full description on the dataset page: https://huggingface.co/datasets/Tasmay-Tib/Tokeniser.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.