10 datasets found
  1. h

    RedPajama-Data-V2

    • huggingface.co
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2023). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
    Explore at:
    Dataset updated
    Oct 30, 2023
    Dataset authored and provided by
    Together
    Description

    RedPajama V2: an Open Dataset for Training Large Language Models

  2. h

    RedPajama-Data-1T

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Together
    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

  3. h

    RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000...

    • huggingface.co
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Mohri (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 [Dataset]. https://huggingface.co/datasets/xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2024
    Authors
    Christopher Mohri
    Description

    xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    RedPajama-Data-Instruct

    • huggingface.co
    Updated Oct 15, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2004). RedPajama-Data-Instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2004
    Dataset authored and provided by
    Together
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove the… See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.

  5. h

    redpajama-data-v2_urls

    • huggingface.co
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    redpajama-data-v2_urls [Dataset]. https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    Description

    Dataset Card for redpajama-data-v2_urls

    This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-V2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls.

  6. h

    RedPajama-pro

    • huggingface.co
    Updated Feb 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2012
    Dataset authored and provided by
    GAIR-ProX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    📚 RedPajama-pro

    ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

      License
    

    RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.

  7. rpj-v2-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EleutherAI, rpj-v2-sample [Dataset]. https://huggingface.co/datasets/EleutherAI/rpj-v2-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    EleutherAIhttps://eleuther.ai/
    Description

    This is a mirror of the sample-10B subset of RedPajama-Data-V2 which we have re-uploaded in order to resolve issues with the original download script.

      Getting Started
    

    RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals. In addition, we also provide the… See the full description on the dataset page: https://huggingface.co/datasets/EleutherAI/rpj-v2-sample.

  8. h

    red_pajama_es_hq

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Latam-GPT, red_pajama_es_hq [Dataset]. https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Latam-GPT
    Description

    RedPajama's High Quality Spanish subset

      What is this?
    

    The following is a high-quality dataset distilled from the Spanish subsection of RedPajama-Data-v2, created using the methodology proposed in FineWEB-Edu.

      Usage
    

    from datasets import load_dataset

    ds = load_dataset("latam-gpt/red_pajama_es_hq")

      Filtering by quality score
    

    Documents in this corpus are scored on academic quality from 2.5 to 5, with higher scores indicating better… See the full description on the dataset page: https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq.

  9. h

    suri

    • huggingface.co
    Updated Jun 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chau Minh Pham (2024). suri [Dataset]. https://huggingface.co/datasets/chtmp223/suri
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 28, 2024
    Authors
    Chau Minh Pham
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Suri: Multi-constraint instruction following for long-form text generation

    Suri features 20K multi-constraint instructions, each accompanied by human-written gold responses sourced from Books3, ChapterBreak, and RedPajama-Data-v2. For a complete example of an instruction along with model generations, visit our website.

      ⚠️ Getting Started
    

    Our Github repository contains the code to reconstruct books3 subset in this dataset. Due to copyright concerns, we do not publicly… See the full description on the dataset page: https://huggingface.co/datasets/chtmp223/suri.

  10. h

    redpajama-wiki-refined-by-data-juicer

    • huggingface.co
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2023). redpajama-wiki-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2023
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- Wikipedia (refined by Data-Juicer)

    A refined version of Wikipedia dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 68GB).

      Dataset Information
    

    Number of samples: 26,990,659 (Keep ~90.47% from the original dataset)

      Refining… See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-wiki-refined-by-data-juicer.
    
  11. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Together (2023). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

RedPajama-Data-V2

Red Pajama V2 Dataset

togethercomputer/RedPajama-Data-V2

Explore at:
44 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 30, 2023
Dataset authored and provided by
Together
Description

RedPajama V2: an Open Dataset for Training Large Language Models

Search
Clear search
Close search
Google apps
Main menu