RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
RedPajama V2: an Open Dataset for Training Large Language Models
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
Dataset Summary
This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.โฆ See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RedPajama -- Book (refined by Data-Juicer)
A refined version of Book dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 91GB).
Dataset Information
Number of samples: 195,983 (Keep ~95.51% from the original dataset)
Refining Recipe
Dataset Card for redpajama-data-1t_urls
This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-1T. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those recordโฆ See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-1t_urls.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
๐ RedPajama-pro
ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.
License
RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.โฆ See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)
๐Check out the Blog Post
This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.
Dataset Overview
Name:โฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.
Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts dataset hosted on Hugging Face and contributed by the HF Datasets community
rvanova/RedPajama-Data-1K-Sample-For-Test dataset hosted on Hugging Face and contributed by the HF Datasets community
jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for "RedPajama-combined-15B-8K-llama"
More Information needed
reds0510/redpajama-wiki-tiny-1000 dataset hosted on Hugging Face and contributed by the HF Datasets community
Divyanshh/redPajama-binaries2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
The Arithmetic Operations Dataset is a synteticly generated collection of mathematical arithmetic operations for practice and evaluation purposes. It contains a total of 624,800 arithmetic operations, consisting of 568,000 addition operations and 56,800 subtraction operations. The dataset is designed to provide a range of arithmetic problems to train and evaluate language models for solving simple arithmetic (mostly addition, the others TBA) problems.โฆ See the full description on the dataset page: https://huggingface.co/datasets/xufana/RedPajama-INCITE-Instruct-3B-Addition.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
AmanPriyanshu/GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
saberai/MetaMath-Redpajama-Chat-Format dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for Evaluation run of togethercomputer/RedPajama-INCITE-7B-Base
Dataset automatically created during the evaluation run of model togethercomputer/RedPajama-INCITE-7B-Base The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointingโฆ See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/togethercomputer_RedPajama-INCITE-7B-Base-details.
listar2000/redpajama-subset-50k dataset hosted on Hugging Face and contributed by the HF Datasets community
listar2000/redpajama-subset-chunked dataset hosted on Hugging Face and contributed by the HF Datasets community
RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.