8 datasets found

h
RedPajama-Data-V2
huggingface.co
Updated Aug 20, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together (2014). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
Explore at:
Dataset updated
Aug 20, 2014
Dataset authored and provided by
Together
Description
RedPajama V2: an Open Dataset for Training Large Language Models
h
RedPajama-Data-1T
huggingface.co
opendatalab.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
h
RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts
huggingface.co
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tristan Thrush (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts [Dataset]. https://huggingface.co/datasets/Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Tristan Thrush
Description
Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts dataset hosted on Hugging Face and contributed by the HF Datasets community
h
redpajama-data-v2_urls
huggingface.co
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). redpajama-data-v2_urls [Dataset]. http://doi.org/10.57967/hf/5503
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5503
Dataset updated
May 15, 2025
Authors
Nick Hagar
Description
Dataset Card for redpajama-data-v2_urls

This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-V2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

Dataset Details Dataset Description

This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls.
h
RedPajama-pro
huggingface.co
Updated Feb 4, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2012
Dataset authored and provided by
GAIR-ProX
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
📚 RedPajama-pro

ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

License

RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
h
LLaMmlein-Dataset
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science@CAIDAS Uni Würzburg, LLaMmlein-Dataset [Dataset]. https://huggingface.co/datasets/LSX-UniWue/LLaMmlein-Dataset
Explore at:
Dataset authored and provided by
Data Science@CAIDAS Uni Würzburg
Description
This dataset is a strict subset of the RedPajama V2 dataset and therefore retains all licenses from RedPajama V2. More details in our preprint!
h
red_pajama_es_hq
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Latam-GPT, red_pajama_es_hq [Dataset]. https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Latam-GPT
Description
RedPajama's High Quality Spanish subset

What is this?

The following is a high-quality dataset distilled from the Spanish subsection of RedPajama-Data-v2, created using the methodology proposed in FineWEB-Edu.

Usage

from datasets import load_dataset

ds = load_dataset("latam-gpt/red_pajama_es_hq")

Filtering by quality score

Documents in this corpus are scored on academic quality from 2.5 to 5, with higher scores indicating better quality. The… See the full description on the dataset page: https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq.
h
c5-en-filtered
huggingface.co
Updated May 8, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
monology (2010). c5-en-filtered [Dataset]. https://huggingface.co/datasets/monology/c5-en-filtered
Explore at:
Dataset updated
May 8, 2010
Authors
monology
Description
This is the 2022-05 snapshot of BramVanroy/CommonCrawl-CreativeCommons, filtered by:

Extracting the URLs from the dataset Getting documents that match those URLs from the corresponding snapshot of togethercomputer/RedPajama-Data-V2 Keeping only the head and middle partitions of ccnet Keeping documents with at least 50 words and a mean word length between 3 and 10 inclusive

In total we keep 4,553,263 of the 15,239,155 total documents.
Not seeing a result you expected?
Learn how you can add new datasets to our index.