39 datasets found
  1. h

    RedPajama-Data-1T

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Together
    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

  2. h

    RedPajama-Data-V2

    • huggingface.co
    Updated Aug 20, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2023). RedPajama-Data-V2 [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2
    Explore at:
    Dataset updated
    Aug 20, 2014
    Dataset authored and provided by
    Together
    Description

    RedPajama V2: an Open Dataset for Training Large Language Models

  3. h

    RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark

    • huggingface.co
    Updated Apr 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiapei Huang (2024). RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark [Dataset]. https://huggingface.co/datasets/hjp709394/RedPajama-Data-1T-Sampled10File1024Row-Filtered4096Words-ForBenchmark
    Explore at:
    Dataset updated
    Apr 22, 2024
    Authors
    Jiapei Huang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset. This is a 1B-token sample of the full dataset.

  4. h

    RedPajama-Data-Instruct

    • huggingface.co
    Updated Oct 15, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Together (2004). RedPajama-Data-Instruct [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2004
    Dataset authored and provided by
    Together
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    RedPajama-Instruct-Data is curated from a diverse collection of NLP tasks from both P3 (BigScience) and Natural Instruction (AI2), and conduct aggressive decontamination against HELM, in two steps: (1) We first conduct semantic search using each validation example in HELM as the query and get top-100 similar instances from the Instruct data set and check tasks that have any returned instances overlapping (using 10-Gram) with the validation example. We remove theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-Instruct.

  5. h

    RedPajama-Tiny

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Zhou, RedPajama-Tiny [Dataset]. https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ivan Zhou
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    This is a tiny version of the RedPajama dataset. It contains 64 samples from each of the 7 sources. This dataset is intended for developing and testing data/training pipeline for loading the full RedPajama dataset or any general HuggingFace dataset. It is very fast to download and easy to examine. You should not use it for training a full model, but you can use it for overfitting test or any other sanity checks.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/ivanzhouyq/RedPajama-Tiny.

  6. t

    Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar (2024). Dataset:...

    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar (2024). Dataset: RedPajama Dataset. https://doi.org/10.57702/md1edyjw [Dataset]. https://service.tib.eu/ldmservice/dataset/redpajama-dataset
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    The RedPajama dataset is used for single-turn dialogue task.

  7. h

    RedPajama-pro

    • huggingface.co
    Updated Feb 4, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GAIR-ProX (2012). RedPajama-pro [Dataset]. https://huggingface.co/datasets/gair-prox/RedPajama-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2012
    Dataset authored and provided by
    GAIR-ProX
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ๐Ÿ“š RedPajama-pro

    ArXiv | Models | Code RedPajama-pro is refined from RedPajama-Data-V2 using the ProX refining framework. It contains about 30B high quality tokens, ready for general language model pre-training.

      License
    

    RedPajama-pro is based on RedPajama-Data-V2, which is made available under an apache-2.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/RedPajama-pro.

  8. h

    RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-1000000en...

    • huggingface.co
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Mohri (2024). RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-1000000en [Dataset]. https://huggingface.co/datasets/xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-1000000en
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2024
    Authors
    Christopher Mohri
    Description

    xmohri/RedPajama-Data-V2-sample-100B-filtered-shuffled-tokenized-with-token-counts-1000000en dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    redpajama-arxiv-refined-by-data-juicer

    • huggingface.co
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2023). redpajama-arxiv-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2023
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- ArXiv (refined by Data-Juicer)

    A refined version of ArXiv dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 85GB).

      Dataset Information
    

    Number of samples: 1,655,259 (Keep ~95.99% from the original dataset)

      Refining Recipeโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-arxiv-refined-by-data-juicer.
    
  10. h

    Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens

    • huggingface.co
    Updated Jan 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aman Priyanshu (2025). Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens [Dataset]. https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2025
    Authors
    Aman Priyanshu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dynamic Topic Modeling Dataset: RedPajama-1T SubSample (100k samples, 1k tokens)

    ๐Ÿ“Check out the Blog Post

    This dataset represents a curated subset of the RedPajama-1T Sample dataset, specifically processed for dynamic topic modeling applications. It contains 100,000 samples from the original dataset, with each document limited to the first 1,024 tokens for consistent processing.

      Dataset Overview
    

    Name:โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/AmanPriyanshu/Dynamic-Topic-RedPajama-Data-1T-100k-SubSample-max-1k-tokens.

  11. h

    RedPajama-combined-15B-8k-llama

    • huggingface.co
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Wettig (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/awettig/RedPajama-combined-15B-8k-llama
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2024
    Authors
    Alexander Wettig
    Description

    Dataset Card for "RedPajama-combined-15B-8K-llama"

    More Information needed

  12. h

    RedPajama-combined-15B-8k-llama

    • huggingface.co
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kichang Yang (2024). RedPajama-combined-15B-8k-llama [Dataset]. https://huggingface.co/datasets/jason9693/RedPajama-combined-15B-8k-llama
    Explore at:
    Dataset updated
    Jul 17, 2024
    Authors
    Kichang Yang
    Description

    jason9693/RedPajama-combined-15B-8k-llama dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    redpajama-c4-refined-by-data-juicer

    • huggingface.co
    Updated Apr 12, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2017). redpajama-c4-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 12, 2017
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- C4 (refined by Data-Juicer)

    A refined version of C4 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 832GB).

      Dataset Information
    

    Number of samples: 344,491,171 (Keep ~94.42% from the original dataset)

      Refining Recipe
    

    โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-c4-refined-by-data-juicer.

  14. h

    hh-rlhf-RedPajama-Chat-Format

    • huggingface.co
    Updated Jun 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fredi (2023). hh-rlhf-RedPajama-Chat-Format [Dataset]. https://huggingface.co/datasets/Fredithefish/hh-rlhf-RedPajama-Chat-Format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2023
    Authors
    Fredi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Fredithefish/hh-rlhf-RedPajama-Chat-Format dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    redpajama-data-1b-tokenized-olmo-1b

    • huggingface.co
    Updated Mar 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sabri Boughorbel (2025). redpajama-data-1b-tokenized-olmo-1b [Dataset]. https://huggingface.co/datasets/sboughorbel/redpajama-data-1b-tokenized-olmo-1b
    Explore at:
    Dataset updated
    Mar 30, 2025
    Authors
    Sabri Boughorbel
    Description

    sboughorbel/redpajama-data-1b-tokenized-olmo-1b dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    redpajama-data-1t_urls

    • huggingface.co
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). redpajama-data-1t_urls [Dataset]. http://doi.org/10.57967/hf/5502
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    Description

    Dataset Card for redpajama-data-1t_urls

    This dataset provides the URLs and top-level domains associated with training records in togethercomputer/RedPajama-Data-1T. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those recordโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/nhagar/redpajama-data-1t_urls.

  17. h

    redpajama-cc-2021-04-refined-by-data-juicer

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer, redpajama-cc-2021-04-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- CommonCrawl-2021-04 (refined by Data-Juicer)

    A refined version of CommonCrawl-2021-04 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 284GB).

      Dataset Information
    

    Number of samples: 44,724,752 (Keep ~45.23% from the original dataset)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2021-04-refined-by-data-juicer.

  18. h

    redpajama-cc-2022-05-refined-by-data-juicer

    • huggingface.co
    Updated Jun 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2022). redpajama-cc-2022-05-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2022
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- CommonCrawl-2022-05 (refined by Data-Juicer)

    A refined version of CommonCrawl-2022-05 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 265GB).

      Dataset Information
    

    Number of samples: 42,648,496 (Keep ~45.34% from the original dataset)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2022-05-refined-by-data-juicer.

  19. h

    redpajama-cc-2019-30-refined-by-data-juicer

    • huggingface.co
    Updated May 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data-Juicer (2019). redpajama-cc-2019-30-refined-by-data-juicer [Dataset]. https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 28, 2019
    Dataset authored and provided by
    Data-Juicer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    RedPajama -- CommonCrawl-2019-30 (refined by Data-Juicer)

    A refined version of CommonCrawl-2019-30 dataset in RedPajama by Data-Juicer. Removing some "bad" samples from the original dataset to make it higher-quality. This dataset is usually used to pretrain a Large Language Model. Notice: Here is a small subset for previewing. The whole dataset is available here (About 240GB).

      Dataset Information
    

    Number of samples: 36,557,283 (Keep ~45.08% from the original dataset)โ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/datajuicer/redpajama-cc-2019-30-refined-by-data-juicer.

  20. h

    ShareGPT-Unfiltered-RedPajama-Chat-format

    • huggingface.co
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fredi (2023). ShareGPT-Unfiltered-RedPajama-Chat-format [Dataset]. https://huggingface.co/datasets/Fredithefish/ShareGPT-Unfiltered-RedPajama-Chat-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 6, 2023
    Authors
    Fredi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    ShareGPT unfiltered dataset in RedPajama-Chat format

    This dataset was created by converting The alpaca-lora formatted ShareGPT dataset to the format required by RedPajama-Chat. This script was used for the conversion: https://github.com/fredi-python/Alpaca2INCITE-Dataset-Converter/blob/main/convert.py WARNING: Only the first human and gpt text of each conversation from the original dataset is included in the dataset.

      The format
    

    {"text": "

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Together, RedPajama-Data-1T [Dataset]. https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T

RedPajama-Data-1T

Red Pajama 1T

togethercomputer/RedPajama-Data-1T

Explore at:
59 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Together
Description

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

Search
Clear search
Close search
Google apps
Main menu