RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
RedPajama V2: an Open Dataset for Training Large Language Models
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove theโฆ See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- Book (refined by Data-Juicer)
A refined version of Book dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 91GB).
Dataset Information
Number of samples: 195,983 (Keep ~95.51% from the original dataset)
Refining Recipe
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.โฆ See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
๐ RedPajama-pro
ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.
License
RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.โฆ See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 dataset hosted on Hugging Face and contributed by the HF Datasets community
rvanova/RedPajama-Data-1K-Sample-For-Test dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)
๐Check out the Blog Post
This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.
Dataset Overview
Name:โฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- ArXiv (refined by Data-Juicer)
A refined version of ArXiv dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 85GB).
Dataset Information
Number of samples: 1,655,259 (Keep ~95.99% from the original dataset)
Refining Recipeโฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer.
sboughorbel/redpajama-data-1b-tokenized-olmo-1b dataset hosted on Hugging Face and contributed by the HF Datasets community
The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.
Getting Started
You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")
Background
Today we are releasing SlimPajama โ the largestโฆ See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- C4 (refined by Data-Juicer)
A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).
Dataset Information
Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)
Refining Recipe
jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- CommonCrawl-2023-06 (refined by Data-Juicer)
A refined version of CommonCrawl-2023-06 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 310GB).
Dataset Information
Number of samples: 50,643,699 (Keep ~45.46% from the original dataset)โฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)
A refined version of CommonCrawl-2019-30 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 240GB).
Dataset Information
Number of samples: 36,557,283 (Keep ~45.08% from the original dataset)โฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)
A refined version of CommonCrawl-2021-04 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 284GB).
Dataset Information
Number of samples: 44,724,752 (Keep ~45.23% from the original dataset)โฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)
A refined version of CommonCrawl-2022-05 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 265GB).
Dataset Information
Number of samples: 42,648,496 (Keep ~45.34% from the original dataset)โฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer.
Dataset Card for "RedPajama-combined-15B-8K-llama"
More Information needed
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.