15 datasets found

h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
h
SlimPajama-627B_Reupload
huggingface.co
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Mongaras (2025). SlimPajama-627B_Reupload [Dataset]. https://huggingface.co/datasets/gmongaras/SlimPajama-627B_Reupload
Explore at:
Dataset updated
Apr 23, 2025
Authors
Gabriel Mongaras
Description
As datasets puts limits on the number of calls to huggingface, downloading SlimPajama-627B is problematic as it's composed of a ton of small files. I have reuploaded it here as larger chunks to easily download the dataset without having to do anything hacky. The original dataset can be found here https://huggingface.co/datasets/cerebras/SlimPajama-627B
h
SlimPajama-1M-rows
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
arthur, SlimPajama-1M-rows [Dataset]. https://huggingface.co/datasets/styalai/SlimPajama-1M-rows
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
arthur
Description
styalai/SlimPajama-1M-rows dataset hosted on Hugging Face and contributed by the HF Datasets community
SlimPajama-chunk-6
kaggle.com
zip
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maxim Podorov (2023). SlimPajama-chunk-6 [Dataset]. https://www.kaggle.com/datasets/vasilypodorov/slimpajama-chunk-6
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Dec 26, 2023
Authors
Maxim Podorov
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by Maxim Podorov

Released under Apache 2.0

Contents
h
Llama-3.1-8B-slimpajama-openthoughts-tokenized
huggingface.co
Updated Apr 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Galichin (2025). Llama-3.1-8B-slimpajama-openthoughts-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/Llama-3.1-8B-slimpajama-openthoughts-tokenized
Explore at:
Dataset updated
Apr 29, 2025
Authors
Andrey Galichin
Description
andreuka18/Llama-3.1-8B-slimpajama-openthoughts-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
h
DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized
huggingface.co
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Galichin (2025). DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized
Explore at:
Dataset updated
Apr 26, 2025
Authors
Andrey Galichin
Description
andreuka18/DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
h
llama-3-8b-SlimPajama-6B-tokenized
huggingface.co
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pashocles (2025). llama-3-8b-SlimPajama-6B-tokenized [Dataset]. https://huggingface.co/datasets/pashocles/llama-3-8b-SlimPajama-6B-tokenized
Explore at:
Dataset updated
May 11, 2025
Authors
Pashocles
Description
pashocles/llama-3-8b-SlimPajama-6B-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
slimpajama_df_fold0
kaggle.com
Updated Feb 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
zhangyier (2024). slimpajama_df_fold0 [Dataset]. https://www.kaggle.com/datasets/zysuddenly/slimpajama-df-fold0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
zhangyier
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by zhangyier

Released under MIT

Contents
h
DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized
huggingface.co
Updated Apr 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Galichin (2025). DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized
Explore at:
Dataset updated
Apr 26, 2025
Authors
Andrey Galichin
Description
andreuka18/DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SlimPajama-chunk1
huggingface.co
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jhchen (2012). SlimPajama-chunk1 [Dataset]. https://huggingface.co/datasets/UltraRonin/SlimPajama-chunk1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Authors
jhchen
Description
Chunk1 train split of cerebras/SlimPajama-627B.
h
SlimPajama-100M
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankur Kumar (2024). SlimPajama-100M [Dataset]. https://huggingface.co/datasets/iankur/SlimPajama-100M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Authors
Ankur Kumar
Description
iankur/SlimPajama-100M dataset hosted on Hugging Face and contributed by the HF Datasets community
h
slimpajama_llama_tokenized_upsample_4096_chunk_1M
huggingface.co
Updated Apr 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang Peiyuan (2024). slimpajama_llama_tokenized_upsample_4096_chunk_1M [Dataset]. https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2024
Authors
Zhang Peiyuan
Description
Generated using https://github.com/FranxYao/Long-Context-Data-Engineering with the below command: mkdir logs mkdir data mkdir data/slimpajama mkdir data/slimpajama/per_source_downsample cd data_engineering

PATH_TO_SLIMPAJAMA=rokset3/slim_pajama_chunk_1 nohup python -u slimpajama_packing.py
--dataset_size=5b
--print_interval=100 --num_process=200
--chunk_size=1000001
--dataset_path=$PATH_TO_SLIMPAJAMA
--output_path=../data/slimpajama/per_source_downsample/… See the full description on the dataset page: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M.
h
DeepSeek-R1-Distill-Llama-8B-max-activation-SAE-cache-L7
huggingface.co
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koyena Pal (2025). DeepSeek-R1-Distill-Llama-8B-max-activation-SAE-cache-L7 [Dataset]. https://huggingface.co/datasets/koyena/DeepSeek-R1-Distill-Llama-8B-max-activation-SAE-cache-L7
Explore at:
Dataset updated
May 11, 2025
Authors
Koyena Pal
Description
Created using https://github.com/KoyenaPal/autointerp/blob/master/demo/cache.py Datasets: cerebras/SlimPajama-627B and koyena/OpenR1-Math-220k-formatted SAE: https://huggingface.co/fnlp/Llama-Scope-R1-Distill/tree/main/400M-Slimpajama-400M-OpenR1-Math-220k/L7R
h
PJ-Masks-630B
huggingface.co
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Huang Liang Hsun (2024). PJ-Masks-630B [Dataset]. https://huggingface.co/datasets/lianghsun/PJ-Masks-630B
Explore at:
Dataset updated
Nov 7, 2024
Authors
Huang Liang Hsun
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for lianghsun/PJ-Masks-630B

本資料截取 cerebras/SlimPajama-627B 、 albertvillanova/legal_contracts 和 intfloat/multilingual_cc_news 為主要資料來源，並只留下 token 長度小於 4096 的樣本，這裡採用 meta-llama/Llama-3.2-3B 的 tokenizer。 (WIP)

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More… See the full description on the dataset page: https://huggingface.co/datasets/lianghsun/PJ-Masks-630B.
h
uninstruct-v1-experimental-chatml
huggingface.co
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam (2024). uninstruct-v1-experimental-chatml [Dataset]. https://huggingface.co/datasets/adamo1139/uninstruct-v1-experimental-chatml
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2024
Authors
Adam
Description
Subset of SlimPajama-6B where tokens associated with chatml prompt format are randomly added mid-text to make model forget how to do instruct and make it behave like a completion model which is not instruction following. Base Yi 1.5 models are contaminated on synthetic SFT data, hence the need for de-contamination attempts before further finetuning, if you don't want your end model to behave like ChatGPT. Should also work with Qwen 1.5 and Qwen 2 models, they are contaminated too.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B

SlimPajama-627B

cerebras/SlimPajama-627B

Explore at:

113 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 2, 2012

Dataset authored and provided by

Cerebras

Description

The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

  Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

  Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

Clear search

Close search

Google apps

Main menu

SlimPajama-627B

SlimPajama-627B_Reupload

SlimPajama-1M-rows

SlimPajama-chunk-6

Dataset

Contents

Llama-3.1-8B-slimpajama-openthoughts-tokenized

DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized

llama-3-8b-SlimPajama-6B-tokenized

slimpajama_df_fold0

Dataset

Contents

DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized

SlimPajama-chunk1

SlimPajama-100M

slimpajama_llama_tokenized_upsample_4096_chunk_1M

DeepSeek-R1-Distill-Llama-8B-max-activation-SAE-cache-L7

PJ-Masks-630B

uninstruct-v1-experimental-chatml

SlimPajama-627B

SlimPajama-627B

cerebras/SlimPajama-627B