Facebook
TwitterRedPajama V2: an Open Dataset for Training Large Language Models
Facebook
TwitterRedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.
Facebook
Twitterxmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitteramazingvince/RedPajama-Data-V2-Sample-snapshot-2023-14 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for redpajama-data-v2_urls
This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-V2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📚 RedPajama-pro
ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.
License
RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.
Facebook
TwitterRedPajama's High Quality Spanish subset
What is this?
The following is a high-quality dataset distilled from the Spanish subsection of RedPajama-Data-v2, created using the methodology proposed in FineWEB-Edu.
Usage
from datasets import load_dataset
ds = load_dataset("latam-gpt/red_pajama_es_hq")
Filtering by quality score
Documents in this corpus are scored on academic quality from 2.5 to 5, with higher scores indicating better quality. The… See the full description on the dataset page: https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq.
Facebook
TwitterThis is the 2022-05 snapshot of BramVanroy/CommonCrawl-CreativeCommons, filtered by:
Extracting the URLs from the dataset Getting documents that match those URLs from the corresponding snapshot of togethercomputer/RedPajama-Data-V2 Keeping only the head and middle partitions of ccnet Keeping documents with at least 50 words and a mean word length between 3 and 10 inclusive
In total we keep 4,553,263 of the 15,239,155 total documents.
Facebook
TwitterThis dataset is a strict subset of the RedPajama V2 dataset and therefore retains all licenses from RedPajama V2. More details in our preprint! Data Take Down
Facebook
TwitterCNXT / CHaTx likes x # Adapter Transformers OpenAssistant/oasst1 fka/awesome-chatgpt-prompts togethercomputer/RedPajama-Data-1T anon8231489123/ShareGPT_Vicuna_unfiltered gsdf/EasyNegative bloomberg/entsum openai/summarize_from_feedback billsum AmazonScience/massive amazon_us_reviews amazon_reviews_multi openwebtext microsoft/CLUES Norod78/microsoft-fluentui-emoji-512-whitebg Norod78/microsoft-fluentui-emoji-768 MicPie/unpredictable_msdn-microsoft-com microsoft/codexglue_method_generation… See the full description on the dataset page: https://huggingface.co/datasets/CNXT/autotrain-data-chatx.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterRedPajama V2: an Open Dataset for Training Large Language Models