100+ datasets found
  1. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 FineWeb-Edu

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  2. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  3. fineweb-edu

    • huggingface.co
    Updated Sep 1, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prime Intellect (2012). fineweb-edu [Dataset]. https://huggingface.co/datasets/PrimeIntellect/fineweb-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2012
    Dataset provided by
    Prime Intellect, Inc.
    Authors
    Prime Intellect
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Pre-shuffled fineweb-edu dataset

  4. h

    chinese-fineweb-edu-v2

    • huggingface.co
    Updated May 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opencsg (2025). chinese-fineweb-edu-v2 [Dataset]. https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2025
    Dataset authored and provided by
    opencsg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !

      Chinese Fineweb Edu Dataset V2     [中文]  [English]
    

    [OpenCSG Community] [👾github] [wechat] [Twitter]

    📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.

  5. h

    processed-fineweb-edu

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Youzhi Yu, processed-fineweb-edu [Dataset]. https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Youzhi Yu
    Description

    Processed FineWeb-Edu Dataset

    Dataset Name on Hugging Face: PursuitOfDataScience/processed-fineweb-edu

      Overview
    

    This dataset is a processed version of the FineWeb-Edu dataset, intended for language model training and NLP research. It has been tokenized and truncated according to a specified block size (i.e., 2048), preparing it for model pre-training or evaluation with transformer-based language models.

      Source Dataset
    

    Name: FineWeb-Edu
    Description: A… See the full description on the dataset page: https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu.

  6. h

    Fineweb-Edu-Chinese-V2.1

    • huggingface.co
    Updated May 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opencsg (2025). Fineweb-Edu-Chinese-V2.1 [Dataset]. https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2025
    Dataset authored and provided by
    opencsg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Chinese Fineweb Edu Dataset V2.1 [中文] [English]

    [OpenCSG Community] [👾github] [wechat] [Twitter]

    📖Technical Report The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1.

  7. mga-fineweb-edu

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ByteDance Seed (2025). mga-fineweb-edu [Dataset]. https://huggingface.co/datasets/ByteDance-Seed/mga-fineweb-edu
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset provided by
    ByteDancehttps://www.bytedance.com/
    Authors
    ByteDance Seed
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Massive Genre-Audience Augment Fineweb-Edu Corpus

    This dataset is a synthetic pretraining corpus described in paper Reformulation for Pretraining Data Augmentation.

    Overview of synthesis framework. Our method expands the original corpus through a two-stage synthesis process. Each document is reformulated to 5 new documents, achieving 3.9× token number expansion while maintaining diversity through massive (genre, audience) pairs.

    We build MGACorpus based on SmolLM Corpus… See the full description on the dataset page: https://huggingface.co/datasets/ByteDance-Seed/mga-fineweb-edu.

  8. h

    fineweb-edu-ar

    • huggingface.co
    Updated Nov 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    KAUST Center of Excellence in Generative AI (2024). fineweb-edu-ar [Dataset]. https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar
    Explore at:
    Dataset updated
    Nov 10, 2024
    Dataset authored and provided by
    KAUST Center of Excellence in Generative AI
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    FineWeb-Edu-Ar

    FineWeb-Edu-Ar is a machine-translated Arabic version of the FineWeb-Edu dataset designed to support the development of Arabic small language models (SLMs). Dataset Details:

    Languages: Arabic, English (paired) Size: 202 billion tokens License: CC-BY-NC-4.0 Source: Machine-translated from the deduplicated version of Hugging Face’s FineWeb-Edu dataset Translation model: facebook/nllb-200-distilled-600M

    Application: FineWeb-Edu-Ar is suitable for pre-training Arabic… See the full description on the dataset page: https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar.

  9. h

    fineweb-edu-format-topic

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wissam Antoun, fineweb-edu-format-topic [Dataset]. https://huggingface.co/datasets/wissamantoun/fineweb-edu-format-topic
    Explore at:
    Authors
    Wissam Antoun
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    FineWeb-Edu w/ Topic and Format Annotations

    FineWeb-Edu dataset consists of 1.3T tokens annotated for Topic and Format using wissamantoun/WebOrganizer-TopicClassifier-ModernBERT and wissamantoun/WebOrganizer-FormatClassifier-ModernBERT classifiers. Similar to WebOrganizer/Corpus-200B but using FineEdu instead of DCLM. Topic Labels:

    Adult Art & Design Software Dev. Crime & Law Education & Jobs Hardware Entertainment Social Life Fashion & Beauty Finance & Business Food & Dining… See the full description on the dataset page: https://huggingface.co/datasets/wissamantoun/fineweb-edu-format-topic.

  10. h

    fineweb-edu-llama3-annotations

    • huggingface.co
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2024). fineweb-edu-llama3-annotations [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Annotations for 📚 FineWeb-Edu classifier

    This dataset contains the annotations used for training 📚 FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from 🍷 FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.

  11. h

    fineweb-edu-2024-10-from-0M-to-1M-ko-edu

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh (2024). fineweb-edu-2024-10-from-0M-to-1M-ko-edu [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko-edu
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko-edu dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    fineweb-edu-1M

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TEL-LLM, fineweb-edu-1M [Dataset]. https://huggingface.co/datasets/TEL-LLM/fineweb-edu-1M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    TEL-LLM
    Description

    TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    FineWeb-Edu-1BT

    • huggingface.co
    Updated Sep 1, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rulin Shao (2011). FineWeb-Edu-1BT [Dataset]. https://huggingface.co/datasets/rulins/FineWeb-Edu-1BT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2011
    Authors
    Rulin Shao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A subset of FineWeb-Edu randomly sampled from the whole dataset of around 1B gpt2 tokens. This dataset is created for illustration purpose in retrieval-scaling. Please do not distribute.

  14. h

    fineweb-edu-sample-10k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Kaitchup, fineweb-edu-sample-10k [Dataset]. https://huggingface.co/datasets/kaitchup/fineweb-edu-sample-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    The Kaitchup
    Description

    kaitchup/fineweb-edu-sample-10k dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    fineweb-edu

    • huggingface.co
    Updated Jul 6, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rio Yokota Lab (2010). fineweb-edu [Dataset]. https://huggingface.co/datasets/RioYokotaLab/fineweb-edu
    Explore at:
    Dataset updated
    Jul 6, 2010
    Dataset authored and provided by
    Rio Yokota Lab
    Description

    RioYokotaLab/fineweb-edu dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    fineweb-edu-2024-10-from-1M-to-2M-ko-edu

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh (2024). fineweb-edu-2024-10-from-1M-to-2M-ko-edu [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    Ultra-FineWeb-EDU

    • huggingface.co
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pro Creations (2025). Ultra-FineWeb-EDU [Dataset]. https://huggingface.co/datasets/ProCreations/Ultra-FineWeb-EDU
    Explore at:
    Dataset updated
    Jun 6, 2025
    Authors
    Pro Creations
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Ultra FineWeb EDU

    High-Quality Educational Content from Ultra-FineWeb Filtered for Maximum Educational Value

      📚 Overview
    

    Ultra FineWeb EDU is a premium educational dataset created by applying advanced educational content filtering to the exceptional Ultra-FineWeb dataset. This work builds directly upon two foundational achievements: the rigorous data curation methodology of Ultra-FineWeb and the sophisticated educational classification capabilities of the… See the full description on the dataset page: https://huggingface.co/datasets/ProCreations/Ultra-FineWeb-EDU.

  18. h

    fineweb-edu-10bt

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mika Senghaas, fineweb-edu-10bt [Dataset]. https://huggingface.co/datasets/mikasenghaas/fineweb-edu-10bt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mika Senghaas
    Description

    mikasenghaas/fineweb-edu-10bt dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    fineweb-edu-10b-combined

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Taylor, fineweb-edu-10b-combined [Dataset]. https://huggingface.co/datasets/deatos/fineweb-edu-10b-combined
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Timothy Taylor
    Description

    deatos/fineweb-edu-10b-combined dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    fineweb-edu-addie

    • huggingface.co
    Updated Jul 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nelson Mitchell (2024). fineweb-edu-addie [Dataset]. https://huggingface.co/datasets/sydonayrex/fineweb-edu-addie
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 28, 2024
    Authors
    Nelson Mitchell
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    This is an extract from HuggingFace's Fineweb-EDU data set, specifically from 2024 parquets zero through four, eleven through thirteen, and seventeen through nineteen. The extracts where based on keywords: "ADDIE," "learning theory," "adult education," and "instructional design."

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497

fineweb-edu

FineWeb-Edu

HuggingFaceFW/fineweb-edu

Explore at:
55 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

  What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Search
Clear search
Close search
Google apps
Main menu