8 datasets found
  1. h

    RedPajama-Data-V2

    • huggingface.co
    Updated Aug 20, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2014). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
    Explore at:
    Dataset updated
    Aug 20, 2014
    Dataset authored and provided by
    Together
    Description

    RedPajama V2: an Open Dataset for Training Large Language Models

  2. h

    RedPajama-Data-1T

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Together
    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

  3. h

    RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts

    • huggingface.co
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Thrush (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts [Dataset]. https://huggingface.co/datasets/Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2024
    Authors
    Tristan Thrush
    Description

    Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    redpajama-data-v2_urls

    • huggingface.co
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). redpajama-data-v2_urls [Dataset]. http://doi.org/10.57967/hf/5503
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    Description

    Dataset Card for redpajama-data-v2_urls

    This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-V2. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-v2_urls.

  5. h

    RedPajama-pro

    • huggingface.co
    Updated Feb 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2012
    Dataset authored and provided by
    GAIR-ProX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    📚 RedPajama-pro

    ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

      License
    

    RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.

  6. h

    LLaMmlein-Dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science@CAIDAS Uni Würzburg, LLaMmlein-Dataset [Dataset]. https://huggingface.co/datasets/LSX-UniWue/LLaMmlein-Dataset
    Explore at:
    Dataset authored and provided by
    Data Science@CAIDAS Uni Würzburg
    Description

    This dataset is a strict subset of the RedPajama V2 dataset and therefore retains all licenses from RedPajama V2. More details in our preprint!

  7. h

    red_pajama_es_hq

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Latam-GPT, red_pajama_es_hq [Dataset]. https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Latam-GPT
    Description

    RedPajama's High Quality Spanish subset

      What is this?
    

    The following is a high-quality dataset distilled from the Spanish subsection of RedPajama-Data-v2, created using the methodology proposed in FineWEB-Edu.

      Usage
    

    from datasets import load_dataset

    ds = load_dataset("latam-gpt/red_pajama_es_hq")

      Filtering by quality score
    

    Documents in this corpus are scored on academic quality from 2.5 to 5, with higher scores indicating better quality. The… See the full description on the dataset page: https://huggingface.co/datasets/latam-gpt/red_pajama_es_hq.

  8. h

    c5-en-filtered

    • huggingface.co
    Updated May 8, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    monology (2010). c5-en-filtered [Dataset]. https://huggingface.co/datasets/monology/c5-en-filtered
    Explore at:
    Dataset updated
    May 8, 2010
    Authors
    monology
    Description

    This is the 2022-05 snapshot of BramVanroy/CommonCrawl-CreativeCommons, filtered by:

    Extracting the URLs from the dataset Getting documents that match those URLs from the corresponding snapshot of togethercomputer/RedPajama-Data-V2 Keeping only the head and middle partitions of ccnet Keeping documents with at least 50 words and a mean word length between 3 and 10 inclusive

    In total we keep 4,553,263 of the 15,239,155 total documents.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Together (2014). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

RedPajama-Data-V2

Red Pajama V2 Dataset

togethercomputer/RedPajama-Data-V2

Explore at:
46 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 20, 2014
Dataset authored and provided by
Together
Description

RedPajama V2: an Open Dataset for Training Large Language Models

Search
Clear search
Close search
Google apps
Main menu