43 datasets found
  1. h

    RedPajama-Data-1T

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Together
    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

  2. h

    RedPajama-Data-V2

    • huggingface.co
    Updated Oct 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2023). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
    Explore at:
    Dataset updated
    Oct 30, 2023
    Dataset authored and provided by
    Together
    Description

    RedPajama V2: an Open Dataset for Training Large Language Models

  3. h

    RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark

    • huggingface.co
    Updated Apr 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiapei Huang (2024). RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark [Dataset]. https://huggingface.co/datasets/hjp709394/RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark
    Explore at:
    Dataset updated
    Apr 22, 2024
    Authors
    Jiapei Huang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.

  4. h

    RedPajama-Data-Instruct

    • huggingface.co
    Updated Oct 15, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2004). RedPajama-Data-Instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2004
    Dataset authored and provided by
    Together
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.

  5. h

    redpajama-book-refined-by-data-juicer

    • huggingface.co
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2023). redpajama-book-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2023
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- Book (refined by Data-Juicer)

    A refined version of Book dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 91GB).

      Dataset Information
    

    Number of samples: 195,983 (Keep ~95.51% from the original dataset)

      Refining Recipe
    

    โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-book-refined-by-data-juicer.

  6. h

    RedPajama-Tiny

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Zhou, RedPajama-Tiny [Dataset]. https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ivan Zhou
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.

  7. h

    RedPajama-pro

    • huggingface.co
    Updated Feb 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2012
    Dataset authored and provided by
    GAIR-ProX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ๐Ÿ“š RedPajama-pro

    ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

      License
    

    RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.

  8. h

    RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000...

    • huggingface.co
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Mohri (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 [Dataset]. https://huggingface.co/datasets/xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2024
    Authors
    Christopher Mohri
    Description

    xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-10000 dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    RedPajama-Data-1K-Sample-For-Test

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mila Rvanova, RedPajama-Data-1K-Sample-For-Test [Dataset]. https://huggingface.co/datasets/rvanova/RedPajama-Data-1K-Sample-For-Test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mila Rvanova
    Description

    rvanova/RedPajama-Data-1K-Sample-For-Test dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

    • huggingface.co
    Updated Jul 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 18, 2025
    Authors
    Aman Priyanshu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)

    ๐Ÿ“Check out the Blog Post

    This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.

      Dataset Overview
    

    Name:โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.

  11. h

    redpajama-arxiv-refined-by-data-juicer

    • huggingface.co
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2023). redpajama-arxiv-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2023
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- ArXiv (refined by Data-Juicer)

    A refined version of ArXiv dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 85GB).

      Dataset Information
    

    Number of samples: 1,655,259 (Keep ~95.99% from the original dataset)

      Refining Recipeโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer.
    
  12. h

    redpajama-data-1b-tokenized-olmo-1b

    • huggingface.co
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabri Boughorbel (2025). redpajama-data-1b-tokenized-olmo-1b [Dataset]. https://huggingface.co/datasets/sboughorbel/redpajama-data-1b-tokenized-olmo-1b
    Explore at:
    Dataset updated
    Mar 30, 2025
    Authors
    Sabri Boughorbel
    Description

    sboughorbel/redpajama-data-1b-tokenized-olmo-1b dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    SlimPajama-627B

    • huggingface.co
    • opendatalab.com
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    Cerebras
    Description

    The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

      Getting Started
    

    You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

      Background
    

    Today we are releasing SlimPajama โ€“ the largestโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  14. h

    redpajama-c4-refined-by-data-juicer

    • huggingface.co
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2017). redpajama-c4-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2017
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- C4 (refined by Data-Juicer)

    A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).

      Dataset Information
    

    Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)

      Refining Recipe
    

    โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer.

  15. h

    RedPajama-combined-15B-8k-llama

    • huggingface.co
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kichang Yang (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/jason9693/RedPajama-combined-15B-8k-llama
    Explore at:
    Dataset updated
    Jul 17, 2024
    Authors
    Kichang Yang
    Description

    jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    redpajama-cc-2023-06-refined-by-data-juicer

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer, redpajama-cc-2023-06-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- CommonCrawl-2023-06 (refined by Data-Juicer)

    A refined version of CommonCrawl-2023-06 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 310GB).

      Dataset Information
    

    Number of samples: 50,643,699 (Keep ~45.46% from the original dataset)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2023-06-refined-by-data-juicer.

  17. h

    redpajama-cc-2019-30-refined-by-data-juicer

    • huggingface.co
    Updated May 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2019). redpajama-cc-2019-30-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2019
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)

    A refined version of CommonCrawl-2019-30 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 240GB).

      Dataset Information
    

    Number of samples: 36,557,283 (Keep ~45.08% from the original dataset)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer.

  18. h

    redpajama-cc-2021-04-refined-by-data-juicer

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer, redpajama-cc-2021-04-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)

    A refined version of CommonCrawl-2021-04 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 284GB).

      Dataset Information
    

    Number of samples: 44,724,752 (Keep ~45.23% from the original dataset)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer.

  19. h

    redpajama-cc-2022-05-refined-by-data-juicer

    • huggingface.co
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2022). redpajama-cc-2022-05-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2022
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)

    A refined version of CommonCrawl-2022-05 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 265GB).

      Dataset Information
    

    Number of samples: 42,648,496 (Keep ~45.34% from the original dataset)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer.

  20. h

    RedPajama-combined-15B-8k-llama

    • huggingface.co
    Updated Feb 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Wettig (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/awettig/RedPajama-combined-15B-8k-llama
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 23, 2024
    Authors
    Alexander Wettig
    Description

    Dataset Card for "RedPajama-combined-15B-8K-llama"

    More Information needed

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

RedPajama-Data-1T

Red Pajama 1T

togethercomputer/RedPajama-Data-1T

Explore at:
64 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

Search
Clear search
Close search
Google apps
Main menu