34 datasets found

h
RedPajama-Data-1T
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
h
RedPajama-Data-V2
huggingface.co
Updated Aug 20, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2014). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
Explore at:
Dataset updated
Aug 20, 2014
Dataset authored and provided by
Together
Description
RedPajama V2: an Open Dataset for Training Large Language Models
h
RedPajama-Data-1T-Sample
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together, RedPajama-Data-1T-Sample [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.
h
RedPajama-Tiny
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Zhou, RedPajama-Tiny [Dataset]. https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ivan Zhou
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Dataset Name

Dataset Summary

This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.… See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.
h
redpajama-book-refined-by-data-juicer
huggingface.co
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2023). redpajama-book-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2023
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- Book (refined by Data-Juicer)

A refined version of Book dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 91GB).

Dataset Information

Number of samples: 195,983 (Keep ~95.51% from the original dataset)

Refining Recipe

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer.
h
redpajama-data-1t_urls
huggingface.co
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). redpajama-data-1t_urls [Dataset]. http://doi.org/10.57967/hf/5502
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5502
Dataset updated
May 15, 2025
Authors
Nick Hagar
Description
Dataset Card for redpajama-data-1t_urls

This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-1T. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

Dataset Details Dataset Description

This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-1t_urls.
h
RedPajama-pro
huggingface.co
Updated Feb 4, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2012
Dataset authored and provided by
GAIR-ProX
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
📚 RedPajama-pro

ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

License

RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
h
Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu, Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Aman Priyanshu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)

📝Check out the Blog Post

This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.

Dataset Overview

Name:… See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.
h
RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts
huggingface.co
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tristan Thrush (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts [Dataset]. https://huggingface.co/datasets/Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Tristan Thrush
Description
Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RedPajama-Data-1K-Sample-For-Test
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mila Rvanova, RedPajama-Data-1K-Sample-For-Test [Dataset]. https://huggingface.co/datasets/rvanova/RedPajama-Data-1K-Sample-For-Test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Mila Rvanova
Description
rvanova/RedPajama-Data-1K-Sample-For-Test dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RedPajama-combined-15B-8k-llama
huggingface.co
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kichang Yang (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/jason9693/RedPajama-combined-15B-8k-llama
Explore at:
Dataset updated
Jul 17, 2024
Authors
Kichang Yang
Description
jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RedPajama-combined-15B-8k-llama
huggingface.co
Updated Feb 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Wettig (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/awettig/RedPajama-combined-15B-8k-llama
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 23, 2024
Authors
Alexander Wettig
Description
Dataset Card for "RedPajama-combined-15B-8K-llama"

More Information needed
h
redpajama-wiki-tiny-1000
huggingface.co
Updated Mar 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
reds0510 (2025). redpajama-wiki-tiny-1000 [Dataset]. https://huggingface.co/datasets/reds0510/redpajama-wiki-tiny-1000
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2025
Dataset authored and provided by
reds0510
Description
reds0510/redpajama-wiki-tiny-1000 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
redPajama-binaries2
huggingface.co
Updated Aug 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Divyansh (2025). redPajama-binaries2 [Dataset]. https://huggingface.co/datasets/Divyanshh/redPajama-binaries2
Explore at:
Dataset updated
Aug 31, 2025
Authors
Divyansh
Description
Divyanshh/redPajama-binaries2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RedPajama-INCITE-Instruct-3B-Addition
huggingface.co
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daria Andreeva (2023). RedPajama-INCITE-Instruct-3B-Addition [Dataset]. https://huggingface.co/datasets/xufana/RedPajama-INCITE-Instruct-3B-Addition
Explore at:
Dataset updated
Jun 9, 2023
Authors
Daria Andreeva
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

The Arithmetic Operations Dataset is a synteticly generated collection of mathematical arithmetic operations for practice and evaluation purposes. It contains a total of 624,800 arithmetic operations, consisting of 568,000 addition operations and 56,800 subtraction operations. The dataset is designed to provide a range of arithmetic problems to train and evaluate language models for solving simple arithmetic (mostly addition, the others TBA) problems.… See the full description on the dataset page: https://huggingface.co/datasets/xufana/RedPajama-INCITE-Instruct-3B-Addition.
h
GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Priyanshu, GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Aman Priyanshu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
AmanPriyanshu/GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens dataset hosted on Hugging Face and contributed by the HF Datasets community
h
MetaMath-Redpajama-Chat-Format
huggingface.co
Updated Dec 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stephane Nathaniel (2023). MetaMath-Redpajama-Chat-Format [Dataset]. https://huggingface.co/datasets/saberai/MetaMath-Redpajama-Chat-Format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 30, 2023
Authors
Stephane Nathaniel
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
saberai/MetaMath-Redpajama-Chat-Format dataset hosted on Hugging Face and contributed by the HF Datasets community
h
togethercomputer_RedPajama-INCITE-7B-Base-details
huggingface.co
Updated Jul 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). togethercomputer_RedPajama-INCITE-7B-Base-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/togethercomputer_RedPajama-INCITE-7B-Base-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of togethercomputer/RedPajama-INCITE-7B-Base

Dataset automatically created during the evaluation run of model togethercomputer/RedPajama-INCITE-7B-Base The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/togethercomputer_RedPajama-INCITE-7B-Base-details.
h
redpajama-subset-50k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sida Li, redpajama-subset-50k [Dataset]. https://huggingface.co/datasets/listar2000/redpajama-subset-50k
Explore at:
Authors
Sida Li
Description
listar2000/redpajama-subset-50k dataset hosted on Hugging Face and contributed by the HF Datasets community
h
redpajama-subset-chunked
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sida Li, redpajama-subset-chunked [Dataset]. https://huggingface.co/datasets/listar2000/redpajama-subset-chunked
Explore at:
Authors
Sida Li
Description
listar2000/redpajama-subset-chunked dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

RedPajama-Data-1T

Red Pajama 1T

togethercomputer/RedPajama-Data-1T

Explore at:

65 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset authored and provided by

Together

Description

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

Clear search

Close search

Google apps

Main menu

RedPajama-Data-1T

RedPajama-Data-V2

RedPajama-Data-1T-Sample

RedPajama-Tiny

redpajama-book-refined-by-data-juicer

… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer.

redpajama-data-1t_urls

RedPajama-pro

Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts

RedPajama-Data-1K-Sample-For-Test

RedPajama-combined-15B-8k-llama

RedPajama-combined-15B-8k-llama

redpajama-wiki-tiny-1000

redPajama-binaries2

RedPajama-INCITE-Instruct-3B-Addition

GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

MetaMath-Redpajama-Chat-Format

togethercomputer_RedPajama-INCITE-7B-Base-details

redpajama-subset-50k

redpajama-subset-chunked

RedPajama-Data-1TSee More Versions

Red Pajama 1T

togethercomputer/RedPajama-Data-1T

RedPajama-Data-1T