RedPajama V2: an Open Dataset for Training Large Language Models
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.
Dataset Card for redpajama-data-v2_urls
This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-V2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📚 RedPajama-pro
ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.
License
RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
This is a mirror of the sample-10B subset of RedPajama-Data-V2 which we have re-uploaded in order to resolve issues with the original download script.
Getting Started
RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals. In addition, we also provide the… See the full description on the dataset page: https://huggingface.co/datasets/EleutherAI/rpj-v2-sample.
RedPajama's High Quality Spanish subset
What is this?
The following is a high-quality dataset distilled from the Spanish subsection of RedPajama-Data-v2, created using the methodology proposed in FineWEB-Edu.
Usage
from datasets import load_dataset
ds = load_dataset("latam-gpt/red_pajama_es_hq")
Filtering by quality score
Documents in this corpus are scored on academic quality from 2.5 to 5, with higher scores indicating better… See the full description on the dataset page: https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Suri: Multi-constraint instruction following for long-form text generation
Suri features 20K multi-constraint instructions, each accompanied by human-written gold responses sourced from Books3, ChapterBreak, and RedPajama-Data-v2. For a complete example of an instruction along with model generations, visit our website.
⚠️ Getting Started
Our Github repository contains the code to reconstruct books3 subset in this dataset. Due to copyright concerns, we do not publicly… See the full description on the dataset page: https://huggingface.co/datasets/chtmp223/suri.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- Wikipedia (refined by Data-Juicer)
A refined version of Wikipedia dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 68GB).
Dataset Information
Number of samples: 26,990,659 (Keep ~90.47% from the original dataset)
Refining… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
RedPajama V2: an Open Dataset for Training Large Language Models