RedPajama V2: an Open Dataset for Training Large Language Models
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for redpajama-data-v2_urls
This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-V2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📚 RedPajama-pro
ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.
License
RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
This dataset is a strict subset of the RedPajama V2 dataset and therefore retains all licenses from RedPajama V2. More details in our preprint!
RedPajama's High Quality Spanish subset
What is this?
The following is a high-quality dataset distilled from the Spanish subsection of RedPajama-Data-v2, created using the methodology proposed in FineWEB-Edu.
Usage
from datasets import load_dataset
ds = load_dataset("latam-gpt/red_pajama_es_hq")
Filtering by quality score
Documents in this corpus are scored on academic quality from 2.5 to 5, with higher scores indicating better quality. The… See the full description on the dataset page: https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq.
This is the 2022-05 snapshot of BramVanroy/CommonCrawl-CreativeCommons, filtered by:
Extracting the URLs from the dataset Getting documents that match those URLs from the corresponding snapshot of togethercomputer/RedPajama-Data-V2 Keeping only the head and middle partitions of ccnet Keeping documents with at least 50 words and a mean word length between 3 and 10 inclusive
In total we keep 4,553,263 of the 15,239,155 total documents.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
RedPajama V2: an Open Dataset for Training Large Language Models