15 datasets found
  1. h

    SlimPajama-627B

    • huggingface.co
    • opendatalab.com
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Dataset authored and provided by
    Cerebras
    Description

    The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

      Getting Started
    

    You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

      Background
    

    Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

  2. h

    SlimPajama-627B_Reupload

    • huggingface.co
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Mongaras (2025). SlimPajama-627B_Reupload [Dataset]. https://huggingface.co/datasets/gmongaras/SlimPajama-627B_Reupload
    Explore at:
    Dataset updated
    Apr 23, 2025
    Authors
    Gabriel Mongaras
    Description

    As datasets puts limits on the number of calls to huggingface, downloading SlimPajama-627B is problematic as it's composed of a ton of small files. I have reuploaded it here as larger chunks to easily download the dataset without having to do anything hacky. The original dataset can be found here https://huggingface.co/datasets/cerebras/SlimPajama-627B

  3. h

    SlimPajama-1M-rows

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    arthur, SlimPajama-1M-rows [Dataset]. https://huggingface.co/datasets/styalai/SlimPajama-1M-rows
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    arthur
    Description

    styalai/SlimPajama-1M-rows dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. SlimPajama-chunk-6

    • kaggle.com
    zip
    Updated Dec 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maxim Podorov (2023). SlimPajama-chunk-6 [Dataset]. https://www.kaggle.com/datasets/vasilypodorov/slimpajama-chunk-6
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Dec 26, 2023
    Authors
    Maxim Podorov
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Maxim Podorov

    Released under Apache 2.0

    Contents

  5. h

    Llama-3.1-8B-slimpajama-openthoughts-tokenized

    • huggingface.co
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Galichin (2025). Llama-3.1-8B-slimpajama-openthoughts-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/Llama-3.1-8B-slimpajama-openthoughts-tokenized
    Explore at:
    Dataset updated
    Apr 29, 2025
    Authors
    Andrey Galichin
    Description

    andreuka18/Llama-3.1-8B-slimpajama-openthoughts-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Galichin (2025). DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Andrey Galichin
    Description

    andreuka18/DeepSeek-R1-Distill-Llama-8B-SlimPajama-1B-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    llama-3-8b-SlimPajama-6B-tokenized

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pashocles (2025). llama-3-8b-SlimPajama-6B-tokenized [Dataset]. https://huggingface.co/datasets/pashocles/llama-3-8b-SlimPajama-6B-tokenized
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Pashocles
    Description

    pashocles/llama-3-8b-SlimPajama-6B-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. slimpajama_df_fold0

    • kaggle.com
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    zhangyier (2024). slimpajama_df_fold0 [Dataset]. https://www.kaggle.com/datasets/zysuddenly/slimpajama-df-fold0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    zhangyier
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by zhangyier

    Released under MIT

    Contents

  9. h

    DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized

    • huggingface.co
    Updated Apr 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Galichin (2025). DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized [Dataset]. https://huggingface.co/datasets/andreuka18/DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Andrey Galichin
    Description

    andreuka18/DeepSeek-R1-Distill-Llama-8B-slimpajama-openthoughts-tokenized dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    SlimPajama-chunk1

    • huggingface.co
    Updated Oct 2, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jhchen (2012). SlimPajama-chunk1 [Dataset]. https://huggingface.co/datasets/UltraRonin/SlimPajama-chunk1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 2, 2012
    Authors
    jhchen
    Description

    Chunk1 train split of cerebras/SlimPajama-627B.

  11. h

    SlimPajama-100M

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankur Kumar (2024). SlimPajama-100M [Dataset]. https://huggingface.co/datasets/iankur/SlimPajama-100M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    Ankur Kumar
    Description

    iankur/SlimPajama-100M dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    slimpajama_llama_tokenized_upsample_4096_chunk_1M

    • huggingface.co
    Updated Apr 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang Peiyuan (2024). slimpajama_llama_tokenized_upsample_4096_chunk_1M [Dataset]. https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2024
    Authors
    Zhang Peiyuan
    Description

    Generated using https://github.com/FranxYao/Long-Context-Data-Engineering with the below command: mkdir logs mkdir data mkdir data/slimpajama mkdir data/slimpajama/per_source_downsample cd data_engineering

    PATH_TO_SLIMPAJAMA=rokset3/slim_pajama_chunk_1 nohup python -u slimpajama_packing.py
    --dataset_size=5b
    --print_interval=100 --num_process=200
    --chunk_size=1000001
    --dataset_path=$PATH_TO_SLIMPAJAMA
    --output_path=../data/slimpajama/per_source_downsample/… See the full description on the dataset page: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M.

  13. h

    DeepSeek-R1-Distill-Llama-8B-max-activation-SAE-cache-L7

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koyena Pal (2025). DeepSeek-R1-Distill-Llama-8B-max-activation-SAE-cache-L7 [Dataset]. https://huggingface.co/datasets/koyena/DeepSeek-R1-Distill-Llama-8B-max-activation-SAE-cache-L7
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Koyena Pal
    Description
  14. h

    PJ-Masks-630B

    • huggingface.co
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Huang Liang Hsun (2024). PJ-Masks-630B [Dataset]. https://huggingface.co/datasets/lianghsun/PJ-Masks-630B
    Explore at:
    Dataset updated
    Nov 7, 2024
    Authors
    Huang Liang Hsun
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for lianghsun/PJ-Masks-630B

    本資料截取 cerebras/SlimPajama-627B 、 albertvillanova/legal_contracts 和 intfloat/multilingual_cc_news 為主要資料來源,並只留下 token 長度小於 4096 的樣本,這裡採用 meta-llama/Llama-3.2-3B 的 tokenizer。 (WIP)

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More… See the full description on the dataset page: https://huggingface.co/datasets/lianghsun/PJ-Masks-630B.

  15. h

    uninstruct-v1-experimental-chatml

    • huggingface.co
    Updated Jun 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adam (2024). uninstruct-v1-experimental-chatml [Dataset]. https://huggingface.co/datasets/adamo1139/uninstruct-v1-experimental-chatml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2024
    Authors
    Adam
    Description

    Subset of SlimPajama-6B where tokens associated with chatml prompt format are randomly added mid-text to make model forget how to do instruct and make it behave like a completion model which is not instruction following. Base Yi 1.5 models are contaminated on synthetic SFT data, hence the need for de-contamination attempts before further finetuning, if you don't want your end model to behave like ChatGPT. Should also work with Qwen 1.5 and Qwen 2 models, they are contaminated too.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Cerebras (2012). SlimPajama-627B [Dataset]. https://huggingface.co/datasets/cerebras/SlimPajama-627B

SlimPajama-627B

SlimPajama-627B

cerebras/SlimPajama-627B

Explore at:
113 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 2, 2012
Dataset authored and provided by
Cerebras
Description

The dataset consists of 59166 jsonl files and is ~895GB compressed. It is a cleaned and deduplicated version of Together's RedPajama. Check out our blog post explaining our methods, our code on GitHub, and join the discussion on the Cerebras Discord.

  Getting Started

You can download the dataset using Hugging Face datasets: from datasets import load_dataset ds = load_dataset("cerebras/SlimPajama-627B")

  Background

Today we are releasing SlimPajama – the largest… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/SlimPajama-627B.

Search
Clear search
Close search
Google apps
Main menu