100+ datasets found
  1. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    ๐Ÿ“š FineWeb-Edu

    1.3 trillion tokens of the finest educational data the ๐ŸŒ web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    ๐Ÿ“š FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from ๐Ÿท FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  2. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    ๐Ÿท FineWeb

    15 trillion tokens of the finest data the ๐ŸŒ web has to offer

      What is it?
    

    The ๐Ÿท FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the ๐Ÿญ datatrove library, our large scale data processing library. ๐Ÿท FineWeb was originally meant to be a fully open replication of ๐Ÿฆ… RefinedWeb, with a releaseโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  3. fineweb-edu

    • huggingface.co
    Updated Sep 1, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prime Intellect (2012). fineweb-edu [Dataset]. https://huggingface.co/datasets/PrimeIntellect/fineweb-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2012
    Dataset provided by
    Prime Intellect, Inc.
    Authors
    Prime Intellect
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Pre-shuffled fineweb-edu dataset

  4. fineweb-edu-10BT-shuffled-for-gpt2

    • kaggle.com
    zip
    Updated Jul 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minh-Thien Nguyen (2024). fineweb-edu-10BT-shuffled-for-gpt2 [Dataset]. https://www.kaggle.com/datasets/minhthiennguyen/fineweb-edu-10bt-shuffled
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Jul 26, 2024
    Authors
    Minh-Thien Nguyen
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is the shuffled version of fineweb-edu-10bt-for-gpt2. Please refer to fineweb-edu-10bt-for-gpt2 for more information about this dataset.

  5. h

    chinese-fineweb-edu

    • huggingface.co
    Updated Apr 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opencsg (2022). chinese-fineweb-edu [Dataset]. https://huggingface.co/datasets/opencsg/chinese-fineweb-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2022
    Dataset authored and provided by
    opencsg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !

      Chinese Fineweb Edu Dataset     [ไธญๆ–‡]  [English]
    

    [OpenCSG Community] [๐Ÿ‘พgithub] [wechat] [Twitter]

    ๐Ÿ“–Technical Report Chinese Fineweb Edu dataset is a meticulously constructed high-quality Chinese pre-training corpus, specifically designed for natural language processing tasks in the education domain. This dataset undergoes a rigorous selection and deduplication process, using aโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu.

  6. h

    fineweb-edu-100b-shuffle

    • huggingface.co
    Updated Oct 13, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrej K (2010). fineweb-edu-100b-shuffle [Dataset]. https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle
    Explore at:
    Dataset updated
    Oct 13, 2010
    Authors
    Andrej K
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    karpathy/fineweb-edu-100b-shuffle dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    fineweb-edu

    • huggingface.co
    Updated Jul 1, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zixuan Wu (2009). fineweb-edu [Dataset]. https://huggingface.co/datasets/AryaWu/fineweb-edu
    Explore at:
    Dataset updated
    Jul 1, 2009
    Authors
    Zixuan Wu
    Description

    AryaWu/fineweb-edu dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    processed-fineweb-edu

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youzhi Yu, processed-fineweb-edu [Dataset]. https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Youzhi Yu
    Description

    Processed FineWeb-Edu Dataset

    Dataset Name on Hugging Face: PursuitOfDataScience/processed-fineweb-edu

      Overview
    

    This dataset is a processed version of the FineWeb-Edu dataset, intended for language model training and NLP research. It has been tokenized and truncated according to a specified block size (i.e., 2048), preparing it for model pre-training or evaluation with transformer-based language models.

      Source Dataset
    

    Name: FineWeb-Edu
    Description: Aโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu.

  9. FineWeb_EDU_10B

    • kaggle.com
    zip
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yousef Mohamed (2024). FineWeb_EDU_10B [Dataset]. https://www.kaggle.com/datasets/joe10mohamed/fineweb-edu-10b
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Sep 12, 2024
    Authors
    yousef Mohamed
    Description

    Dataset

    This dataset was created by yousef Mohamed

    Contents

  10. h

    fineweb-edu-10b-combined

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Taylor, fineweb-edu-10b-combined [Dataset]. https://huggingface.co/datasets/deatos/fineweb-edu-10b-combined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Timothy Taylor
    Description

    deatos/fineweb-edu-10b-combined dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    fineweb-edu-2024-10-from-0M-to-1M-ko

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh (2024). fineweb-edu-2024-10-from-0M-to-1M-ko [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    fineweb-edu-llama3-annotations

    • huggingface.co
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2024). fineweb-edu-llama3-annotations [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Annotations for ๐Ÿ“š FineWeb-Edu classifier

    This dataset contains the annotations used for training ๐Ÿ“š FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from ๐Ÿท FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.

  13. h

    FineWeb-Edu-Analytic

    • huggingface.co
    Updated Aug 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Logicvex AI (2025). FineWeb-Edu-Analytic [Dataset]. https://huggingface.co/datasets/MultivexAI/FineWeb-Edu-Analytic
    Explore at:
    Dataset updated
    Aug 6, 2025
    Authors
    Logicvex AI
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    FineWeb-Edu-Analytic (v1)

    FineWeb-Edu-Analytic (v1) is an English-language dataset containing 9908 documents, intended as a resource for training language models. The dataset was generated by taking text sequences from the FineWeb-Edu dataset (CC-MAIN-2025-26 subset) to serve as a source. Each source sequence was then processed by a 48-billion parameter language model to generate a corresponding structured, analytical document. Disclaimer: This dataset is not affiliated with theโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/MultivexAI/FineWeb-Edu-Analytic.

  14. fineweb-edu-dedup-10b

    • huggingface.co
    Updated Mar 15, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EleutherAI (2011). fineweb-edu-dedup-10b [Dataset]. https://huggingface.co/datasets/EleutherAI/fineweb-edu-dedup-10b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2011
    Dataset authored and provided by
    EleutherAIhttps://eleuther.ai/
    Description

    EleutherAI/fineweb-edu-dedup-10b dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    fineweb-edu-ar

    • huggingface.co
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KAUST Center of Excellence in Generative AI (2024). fineweb-edu-ar [Dataset]. https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar
    Explore at:
    Dataset updated
    Nov 10, 2024
    Dataset authored and provided by
    KAUST Center of Excellence in Generative AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    FineWeb-Edu-Ar

    FineWeb-Edu-Ar is a machine-translated Arabic version of the FineWeb-Edu dataset designed to support the development of Arabic small language models (SLMs). Dataset Details:

    Languages: Arabic, English (paired) Size: 202 billion tokens License: CC-BY-NC-4.0 Source: Machine-translated from the deduplicated version of Hugging Faceโ€™s FineWeb-Edu dataset Translation model: facebook/nllb-200-distilled-600M

    Application: FineWeb-Edu-Ar is suitable for pre-training Arabicโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar.

  16. h

    fineweb-edu-1M

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TEL-LLM, fineweb-edu-1M [Dataset]. https://huggingface.co/datasets/TEL-LLM/fineweb-edu-1M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    TEL-LLM
    Description

    TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    fineweb-edu-2024-10-from-1M-to-2M-ko-edu

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh (2024). fineweb-edu-2024-10-from-1M-to-2M-ko-edu [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    fineweb-edu-10BT-sorted

    • huggingface.co
    Updated Aug 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhruv Saini (2024). fineweb-edu-10BT-sorted [Dataset]. https://huggingface.co/datasets/oof-baroomf/fineweb-edu-10BT-sorted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 1, 2024
    Authors
    Dhruv Saini
    Description

    oof-baroomf/fineweb-edu-10BT-sorted dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    fineweb-edu-fortified-mini

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee Junbum, fineweb-edu-fortified-mini [Dataset]. https://huggingface.co/datasets/beomi/fineweb-edu-fortified-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Lee Junbum
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    FineWeb-Edu-Fortified-Mini

    This is sampled version on FineWeb-Edu-Fortified, for testing purpose.

      LICENSE
    

    Follows original FineWeb dataset.

  20. h

    FineWeb-Edu-1MT

    • huggingface.co
    Updated Sep 1, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rulin Shao (2011). FineWeb-Edu-1MT [Dataset]. https://huggingface.co/datasets/rulins/FineWeb-Edu-1MT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2011
    Authors
    Rulin Shao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A subset of FineWeb-Edu randomly sampled from the whole dataset of around 1M gpt2 tokens. This dataset is created for illustration purpose in retrieval-scaling. Please do not distribute.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497

fineweb-edu

FineWeb-Edu

HuggingFaceFW/fineweb-edu

Explore at:
68 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

๐Ÿ“š FineWeb-Edu

1.3 trillion tokens of the finest educational data the ๐ŸŒ web has to offer

Paper: https://arxiv.org/abs/2406.17557

  What is it?

๐Ÿ“š FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from ๐Ÿท FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Search
Clear search
Close search
Google apps
Main menu