The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
As datasets puts limits on the number of calls to huggingface, downloading SlimPajama-627B is problematic as it's composed of a ton of small files. I have reuploaded it here as larger chunks to easily download the dataset without having to do anything hacky. The original dataset can be found here https://huggingface.co/datasets/cerebras/SlimPajama-627B
styalai/SlimPajama-1M-rows dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Maxim Podorov
Released under Apache 2.0
andreuka18/Llama-3.1-8B-slimpajama-openthoughts-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
andreuka18/DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
pashocles/llama-3-8b-SlimPajama-6B-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by zhangyier
Released under MIT
andreuka18/DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
Chunk1 train split of cerebras/SlimPajama-627B.
iankur/SlimPajama-100M dataset hosted on Hugging Face and contributed by the HF Datasets community
Generated using https://github.com/FranxYao/Long-Context-Data-Engineering with the below command: mkdir logs mkdir data mkdir data/slimpajama mkdir data/slimpajama/per_source_downsample cd data_engineering
PATH_TO_SLIMPAJAMA=rokset3/slim_pajama_chunk_1
nohup python -u slimpajama_packing.py
--dataset_size=5b
--print_interval=100 --num_process=200
--chunk_size=1000001
--dataset_path=$PATH_TO_SLIMPAJAMA
--output_path=../data/slimpajama/per_source_downsample/… See the full description on the dataset page: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M.
Created using https://github.com/KoyenaPal/autointerp/blob/master/demo/cache.py Datasets: cerebras/SlimPajama-627B and koyena/OpenR1-Math-220k-formatted SAE: https://huggingface.co/fnlp/Llama-Scope-R1-Distill/tree/main/400M-Slimpajama-400M-OpenR1-Math-220k/L7R
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for lianghsun/PJ-Masks-630B
本資料截取 cerebras/SlimPajama-627B 、 albertvillanova/legal_contracts 和 intfloat/multilingual_cc_news 為主要資料來源,並只留下 token 長度小於 4096 的樣本,這裡採用 meta-llama/Llama-3.2-3B 的 tokenizer。 (WIP)
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More… See the full description on the dataset page: https://huggingface.co/datasets/lianghsun/PJ-Masks-630B.
Subset of SlimPajama-6B where tokens associated with chatml prompt format are randomly added mid-text to make model forget how to do instruct and make it behave like a completion model which is not instruction following. Base Yi 1.5 models are contaminated on synthetic SFT data, hence the need for de-contamination attempts before further finetuning, if you don't want your end model to behave like ChatGPT. Should also work with Qwen 1.5 and Qwen 2 models, they are contaminated too.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.