10 datasets found

h
RedPajama-Data-V2
huggingface.co
Updated Oct 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2023). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
Explore at:
Dataset updated
Oct 30, 2023
Dataset authored and provided by
Together
Description
RedPajama V2: an Open Dataset for Training Large Language Models
h
RedPajama-Data-1T
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
h
RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000...
huggingface.co
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Mohri (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 [Dataset]. https://huggingface.co/datasets/xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Christopher Mohri
Description
xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RedPajama-Data-Instruct
huggingface.co
Updated Oct 15, 2004
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2004). RedPajama-Data-Instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2004
Dataset authored and provided by
Together
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.
h
redpajama-data-v2_urls
huggingface.co
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
redpajama-data-v2_urls [Dataset]. https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5503
Dataset updated
May 15, 2025
Authors
Nick Hagar
Description
Dataset Card for redpajama-data-v2_urls

This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-V2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

Dataset Details Dataset Description

This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls.
h
RedPajama-pro
huggingface.co
Updated Feb 4, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2012
Dataset authored and provided by
GAIR-ProX
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
📚 RedPajama-pro

ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

License

RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
rpj-v2-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EleutherAI, rpj-v2-sample [Dataset]. https://huggingface.co/datasets/EleutherAI/rpj-v2-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
EleutherAIhttps://eleuther.ai/
Description
This is a mirror of the sample-10B subset of RedPajama-Data-V2 which we have re-uploaded in order to resolve issues with the original download script.

Getting Started

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals. In addition, we also provide the… See the full description on the dataset page: https://huggingface.co/datasets/EleutherAI/rpj-v2-sample.
h
red_pajama_es_hq
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Latam-GPT, red_pajama_es_hq [Dataset]. https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Latam-GPT
Description
RedPajama's High Quality Spanish subset

What is this?

The following is a high-quality dataset distilled from the Spanish subsection of RedPajama-Data-v2, created using the methodology proposed in FineWEB-Edu.

Usage

from datasets import load_dataset

ds = load_dataset("latam-gpt/red_pajama_es_hq")

Filtering by quality score

Documents in this corpus are scored on academic quality from 2.5 to 5, with higher scores indicating better… See the full description on the dataset page: https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq.
h
suri
huggingface.co
Updated Jun 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chau Minh Pham (2024). suri [Dataset]. https://huggingface.co/datasets/chtmp223/suri
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 28, 2024
Authors
Chau Minh Pham
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Suri: Multi-constraint instruction following for long-form text generation

Suri features 20K multi-constraint instructions, each accompanied by human-written gold responses sourced from Books3, ChapterBreak, and RedPajama-Data-v2. For a complete example of an instruction along with model generations, visit our website.

⚠️ Getting Started

Our Github repository contains the code to reconstruct books3 subset in this dataset. Due to copyright concerns, we do not publicly… See the full description on the dataset page: https://huggingface.co/datasets/chtmp223/suri.
h
redpajama-wiki-refined-by-data-juicer
huggingface.co
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data-Juicer (2023). redpajama-wiki-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2023
Dataset authored and provided by
Data-Juicer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
RedPajama -- Wikipedia (refined by Data-Juicer)

A refined version of Wikipedia dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 68GB).

Dataset Information

Number of samples: 26,990,659 (Keep ~90.47% from the original dataset)

Refining… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer.
Not seeing a result you expected?
Learn how you can add new datasets to our index.