34 datasets found
  1. h

    RedPajama-Data-1T

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Together
    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

  2. h

    RedPajama-Data-V2

    • huggingface.co
    Updated Aug 20, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2014). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
    Explore at:
    Dataset updated
    Aug 20, 2014
    Dataset authored and provided by
    Together
    Description

    RedPajama V2: an Open Dataset for Training Large Language Models

  3. h

    RedPajama-Data-1T-Sample

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together, RedPajama-Data-1T-Sample [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Together
    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.

  4. h

    RedPajama-Tiny

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Zhou, RedPajama-Tiny [Dataset]. https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ivan Zhou
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.

  5. h

    redpajama-book-refined-by-data-juicer

    • huggingface.co
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2023). redpajama-book-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2023
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- Book (refined by Data-Juicer)

    A refined version of Book dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 91GB).

      Dataset Information
    

    Number of samples: 195,983 (Keep ~95.51% from the original dataset)

      Refining Recipe
    

    โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer.

  6. h

    redpajama-data-1t_urls

    • huggingface.co
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). redpajama-data-1t_urls [Dataset]. http://doi.org/10.57967/hf/5502
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    Description

    Dataset Card for redpajama-data-1t_urls

    This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-1T. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those recordโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-1t_urls.

  7. h

    RedPajama-pro

    • huggingface.co
    Updated Feb 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2012
    Dataset authored and provided by
    GAIR-ProX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ๐Ÿ“š RedPajama-pro

    ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

      License
    

    RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.

  8. h

    Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu, Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Aman Priyanshu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)

    ๐Ÿ“Check out the Blog Post

    This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.

      Dataset Overview
    

    Name:โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.

  9. h

    RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts

    • huggingface.co
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tristan Thrush (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts [Dataset]. https://huggingface.co/datasets/Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2024
    Authors
    Tristan Thrush
    Description

    Tristan/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    RedPajama-Data-1K-Sample-For-Test

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mila Rvanova, RedPajama-Data-1K-Sample-For-Test [Dataset]. https://huggingface.co/datasets/rvanova/RedPajama-Data-1K-Sample-For-Test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mila Rvanova
    Description

    rvanova/RedPajama-Data-1K-Sample-For-Test dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    RedPajama-combined-15B-8k-llama

    • huggingface.co
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kichang Yang (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/jason9693/RedPajama-combined-15B-8k-llama
    Explore at:
    Dataset updated
    Jul 17, 2024
    Authors
    Kichang Yang
    Description

    jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    RedPajama-combined-15B-8k-llama

    • huggingface.co
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Wettig (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/awettig/RedPajama-combined-15B-8k-llama
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2024
    Authors
    Alexander Wettig
    Description

    Dataset Card for "RedPajama-combined-15B-8K-llama"

    More Information needed

  13. h

    redpajama-wiki-tiny-1000

    • huggingface.co
    Updated Mar 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    reds0510 (2025). redpajama-wiki-tiny-1000 [Dataset]. https://huggingface.co/datasets/reds0510/redpajama-wiki-tiny-1000
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2025
    Dataset authored and provided by
    reds0510
    Description

    reds0510/redpajama-wiki-tiny-1000 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    redPajama-binaries2

    • huggingface.co
    Updated Aug 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divyansh (2025). redPajama-binaries2 [Dataset]. https://huggingface.co/datasets/Divyanshh/redPajama-binaries2
    Explore at:
    Dataset updated
    Aug 31, 2025
    Authors
    Divyansh
    Description

    Divyanshh/redPajama-binaries2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    RedPajama-INCITE-Instruct-3B-Addition

    • huggingface.co
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daria Andreeva (2023). RedPajama-INCITE-Instruct-3B-Addition [Dataset]. https://huggingface.co/datasets/xufana/RedPajama-INCITE-Instruct-3B-Addition
    Explore at:
    Dataset updated
    Jun 9, 2023
    Authors
    Daria Andreeva
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    The Arithmetic Operations Dataset is a synteticly generated collection of mathematical arithmetic operations for practice and evaluation purposes. It contains a total of 624,800 arithmetic operations, consisting of 568,000 addition operations and 56,800 subtraction operations. The dataset is designed to provide a range of arithmetic problems to train and evaluate language models for solving simple arithmetic (mostly addition, the others TBA) problems.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/xufana/RedPajama-INCITE-Instruct-3B-Addition.

  16. h

    GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu, GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Aman Priyanshu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    AmanPriyanshu/GTE-ModernBERT-RedPajama-Data-1T-100k-SubSample-max-1k-tokens dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    MetaMath-Redpajama-Chat-Format

    • huggingface.co
    Updated Dec 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephane Nathaniel (2023). MetaMath-Redpajama-Chat-Format [Dataset]. https://huggingface.co/datasets/saberai/MetaMath-Redpajama-Chat-Format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 30, 2023
    Authors
    Stephane Nathaniel
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    saberai/MetaMath-Redpajama-Chat-Format dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    togethercomputer_RedPajama-INCITE-7B-Base-details

    • huggingface.co
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). togethercomputer_RedPajama-INCITE-7B-Base-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/togethercomputer_RedPajama-INCITE-7B-Base-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of togethercomputer/RedPajama-INCITE-7B-Base

    Dataset automatically created during the evaluation run of model togethercomputer/RedPajama-INCITE-7B-Base The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointingโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/togethercomputer_RedPajama-INCITE-7B-Base-details.

  19. h

    redpajama-subset-50k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sida Li, redpajama-subset-50k [Dataset]. https://huggingface.co/datasets/listar2000/redpajama-subset-50k
    Explore at:
    Authors
    Sida Li
    Description

    listar2000/redpajama-subset-50k dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    redpajama-subset-chunked

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sida Li, redpajama-subset-chunked [Dataset]. https://huggingface.co/datasets/listar2000/redpajama-subset-chunked
    Explore at:
    Authors
    Sida Li
    Description

    listar2000/redpajama-subset-chunked dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

RedPajama-Data-1T

Red Pajama 1T

togethercomputer/RedPajama-Data-1T

Explore at:
65 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

Search
Clear search
Close search
Google apps
Main menu