43 datasets found

h
RedPajama-Data-1T
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
h
RedPajama-Data-V2
huggingface.co
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2023). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
Explore at:
Dataset updated
Oct 30, 2023
Dataset authored and provided by
Together
Description
RedPajama V2: an Open Dataset for Training Large Language Models
h
RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark
huggingface.co
Updated Apr 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiapei Huang (2024). RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark [Dataset]. https://huggingface.co/datasets/hjp709394/RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark
Explore at:
Dataset updated
Apr 22, 2024
Authors
Jiapei Huang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.
h
RedPajama-Data-Instruct
huggingface.co
Updated Oct 15, 2004
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2004). RedPajama-Data-Instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2004
Dataset authored and provided by
Together
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.
h
redpajama-book-refined-by-data-juicer
huggingface.co
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2023). redpajama-book-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2023
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- Book (refined by Data-Juicer)

A refined version of Book dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 91GB).

Dataset Information

Number of samples: 195,983 (Keep ~95.51% from the original dataset)

Refining Recipe

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer.
h
RedPajama-Tiny
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Zhou, RedPajama-Tiny [Dataset]. https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ivan Zhou
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Dataset Name

Dataset Summary

This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.… See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.
h
RedPajama-pro
huggingface.co
Updated Feb 4, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2012
Dataset authored and provided by
GAIR-ProX
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
📚 RedPajama-pro

ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

License

RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
h
RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000...
huggingface.co
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Mohri (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 [Dataset]. https://huggingface.co/datasets/xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Christopher Mohri
Description
xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RedPajama-Data-1K-Sample-For-Test
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mila Rvanova, RedPajama-Data-1K-Sample-For-Test [Dataset]. https://huggingface.co/datasets/rvanova/RedPajama-Data-1K-Sample-For-Test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Mila Rvanova
Description
rvanova/RedPajama-Data-1K-Sample-For-Test dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
huggingface.co
Updated Jul 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu (2025). Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 18, 2025
Authors
Aman Priyanshu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)

📝Check out the Blog Post

This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.

Dataset Overview

Name:… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.
h
redpajama-arxiv-refined-by-data-juicer
huggingface.co
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2023). redpajama-arxiv-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2023
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- ArXiv (refined by Data-Juicer)

A refined version of ArXiv dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 85GB).

Dataset Information

Number of samples: 1,655,259 (Keep ~95.99% from the original dataset)

Refining Recipe… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer.
h
redpajama-data-1b-tokenized-olmo-1b
huggingface.co
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sabri Boughorbel (2025). redpajama-data-1b-tokenized-olmo-1b [Dataset]. https://huggingface.co/datasets/sboughorbel/redpajama-data-1b-tokenized-olmo-1b
Explore at:
Dataset updated
Mar 30, 2025
Authors
Sabri Boughorbel
Description
sboughorbel/redpajama-data-1b-tokenized-olmo-1b dataset hosted on Hugging Face and contributed by the HF Datasets community
h
SlimPajama-627B
huggingface.co
opendatalab.com
Updated Oct 2, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
h
redpajama-c4-refined-by-data-juicer
huggingface.co
Updated Apr 12, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2017). redpajama-c4-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 12, 2017
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- C4 (refined by Data-Juicer)

A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).

Dataset Information

Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)

Refining Recipe

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer.
h
RedPajama-combined-15B-8k-llama
huggingface.co
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kichang Yang (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/jason9693/RedPajama-combined-15B-8k-llama
Explore at:
Dataset updated
Jul 17, 2024
Authors
Kichang Yang
Description
jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community
h
redpajama-cc-2023-06-refined-by-data-juicer
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer, redpajama-cc-2023-06-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- CommonCrawl-2023-06 (refined by Data-Juicer)

A refined version of CommonCrawl-2023-06 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 310GB).

Dataset Information

Number of samples: 50,643,699 (Keep ~45.46% from the original dataset)… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer.
h
redpajama-cc-2019-30-refined-by-data-juicer
huggingface.co
Updated May 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2019). redpajama-cc-2019-30-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 28, 2019
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)

A refined version of CommonCrawl-2019-30 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 240GB).

Dataset Information

Number of samples: 36,557,283 (Keep ~45.08% from the original dataset)… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer.
h
redpajama-cc-2021-04-refined-by-data-juicer
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer, redpajama-cc-2021-04-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)

A refined version of CommonCrawl-2021-04 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 284GB).

Dataset Information

Number of samples: 44,724,752 (Keep ~45.23% from the original dataset)… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer.
h
redpajama-cc-2022-05-refined-by-data-juicer
huggingface.co
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2022). redpajama-cc-2022-05-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2022
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)

A refined version of CommonCrawl-2022-05 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 265GB).

Dataset Information

Number of samples: 42,648,496 (Keep ~45.34% from the original dataset)… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer.
h
RedPajama-combined-15B-8k-llama
huggingface.co
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Wettig (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/awettig/RedPajama-combined-15B-8k-llama
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2024
Authors
Alexander Wettig
Description
Dataset Card for "RedPajama-combined-15B-8K-llama"

More Information needed

Facebook

Twitter

Click to copy link

Link copied

Cite

Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

RedPajama-Data-1T

Red Pajama 1T

togethercomputer/RedPajama-Data-1T

Explore at:

64 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

Together

Description

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

Clear search

Close search

Google apps

Main menu

RedPajama-Data-1T

RedPajama-Data-V2

RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark

RedPajama-Data-Instruct

redpajama-book-refined-by-data-juicer

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer.

RedPajama-Tiny

RedPajama-pro

RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000...

RedPajama-Data-1K-Sample-For-Test

Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

redpajama-arxiv-refined-by-data-juicer

redpajama-data-1b-tokenized-olmo-1b

SlimPajama-627B

redpajama-c4-refined-by-data-juicer

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer.

RedPajama-combined-15B-8k-llama

redpajama-cc-2023-06-refined-by-data-juicer

redpajama-cc-2019-30-refined-by-data-juicer

redpajama-cc-2021-04-refined-by-data-juicer

redpajama-cc-2022-05-refined-by-data-juicer

RedPajama-combined-15B-8k-llama

RedPajama-Data-1TSee More Versions

Red Pajama 1T

togethercomputer/RedPajama-Data-1T

RedPajama-Data-1T