RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
RedPajama V2: an Open Dataset for Training Large Language Models
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove theโฆ See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.โฆ See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.
The RedPajama dataset is used for single-turn dialogue task.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
๐ RedPajama-pro
ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.
License
RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.โฆ See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-1000000en dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- ArXiv (refined by Data-Juicer)
A refined version of ArXiv dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 85GB).
Dataset Information
Number of samples: 1,655,259 (Keep ~95.99% from the original dataset)
Refining Recipeโฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)
๐Check out the Blog Post
This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.
Dataset Overview
Name:โฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.
Dataset Card for "RedPajama-combined-15B-8K-llama"
More Information needed
jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- C4 (refined by Data-Juicer)
A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).
Dataset Information
Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)
Refining Recipe
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Fredithefish/hh-rlhf-RedPajama-Chat-Format dataset hosted on Hugging Face and contributed by the HF Datasets community
sboughorbel/redpajama-data-1b-tokenized-olmo-1b dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for redpajama-data-1t_urls
This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-1T. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those recordโฆ See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-1t_urls.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)
A refined version of CommonCrawl-2021-04 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 284GB).
Dataset Information
Number of samples: 44,724,752 (Keep ~45.23% from the original dataset)โฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)
A refined version of CommonCrawl-2022-05 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 265GB).
Dataset Information
Number of samples: 42,648,496 (Keep ~45.34% from the original dataset)โฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)
A refined version of CommonCrawl-2019-30 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 240GB).
Dataset Information
Number of samples: 36,557,283 (Keep ~45.08% from the original dataset)โฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
ShareGPT unfiltered dataset in RedPajama-Chat format
This dataset was created by converting The alpaca-lora formatted ShareGPT dataset to the format required by RedPajama-Chat. This script was used for the conversion: https://github.com/fredi-python/Alpaca2INCITE-Dataset-Converter/blob/main/convert.py WARNING: Only the first human and gpt text of each conversation from the original dataset is included in the dataset.
The format
{"text": "
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.