100+ datasets found

h
fineweb-edu
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2497
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
fineweb-edu
huggingface.co
Updated Sep 1, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prime Intellect (2012). fineweb-edu [Dataset]. https://huggingface.co/datasets/PrimeIntellect/fineweb-edu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2012
Dataset provided by
Prime Intellect, Inc.
Authors
Prime Intellect
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Pre-shuffled fineweb-edu dataset
fineweb-edu-10BT-shuffled-for-gpt2
kaggle.com
zip
Updated Jul 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minh-Thien Nguyen (2024). fineweb-edu-10BT-shuffled-for-gpt2 [Dataset]. https://www.kaggle.com/datasets/minhthiennguyen/fineweb-edu-10bt-shuffled
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Jul 26, 2024
Authors
Minh-Thien Nguyen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is the shuffled version of fineweb-edu-10bt-for-gpt2. Please refer to fineweb-edu-10bt-for-gpt2 for more information about this dataset.
h
chinese-fineweb-edu
huggingface.co
Updated Apr 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opencsg (2022). chinese-fineweb-edu [Dataset]. https://huggingface.co/datasets/opencsg/chinese-fineweb-edu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2022
Dataset authored and provided by
opencsg
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !

Chinese Fineweb Edu Dataset [中文] [English]

[OpenCSG Community] [👾github] [wechat] [Twitter]

📖Technical Report Chinese Fineweb Edu dataset is a meticulously constructed high-quality Chinese pre-training corpus, specifically designed for natural language processing tasks in the education domain. This dataset undergoes a rigorous selection and deduplication process, using a… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu.
h
fineweb-edu-100b-shuffle
huggingface.co
Updated Oct 13, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrej K (2010). fineweb-edu-100b-shuffle [Dataset]. https://huggingface.co/datasets/karpathy/fineweb-edu-100b-shuffle
Explore at:
Dataset updated
Oct 13, 2010
Authors
Andrej K
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
karpathy/fineweb-edu-100b-shuffle dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu
huggingface.co
Updated Jul 1, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zixuan Wu (2009). fineweb-edu [Dataset]. https://huggingface.co/datasets/AryaWu/fineweb-edu
Explore at:
Dataset updated
Jul 1, 2009
Authors
Zixuan Wu
Description
AryaWu/fineweb-edu dataset hosted on Hugging Face and contributed by the HF Datasets community
h
processed-fineweb-edu
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Youzhi Yu, processed-fineweb-edu [Dataset]. https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Youzhi Yu
Description
Processed FineWeb-Edu Dataset

Dataset Name on Hugging Face: PursuitOfDataScience/processed-fineweb-edu

Overview

This dataset is a processed version of the FineWeb-Edu dataset, intended for language model training and NLP research. It has been tokenized and truncated according to a specified block size (i.e., 2048), preparing it for model pre-training or evaluation with transformer-based language models.

Source Dataset

Name: FineWeb-Edu
Description: A… See the full description on the dataset page: https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu.
FineWeb_EDU_10B
kaggle.com
zip
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yousef Mohamed (2024). FineWeb_EDU_10B [Dataset]. https://www.kaggle.com/datasets/joe10mohamed/fineweb-edu-10b
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Sep 12, 2024
Authors
yousef Mohamed
Description
Dataset

This dataset was created by yousef Mohamed

Contents
h
fineweb-edu-10b-combined
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Taylor, fineweb-edu-10b-combined [Dataset]. https://huggingface.co/datasets/deatos/fineweb-edu-10b-combined
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Timothy Taylor
Description
deatos/fineweb-edu-10b-combined dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu-2024-10-from-0M-to-1M-ko
huggingface.co
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzie Oh (2024). fineweb-edu-2024-10-from-0M-to-1M-ko [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko
Explore at:
Dataset updated
Nov 1, 2024
Authors
Suzie Oh
Description
ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu-llama3-annotations
huggingface.co
Updated Jun 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2024). fineweb-edu-llama3-annotations [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 8, 2024
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Annotations for 📚 FineWeb-Edu classifier

This dataset contains the annotations used for training 📚 FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from 🍷 FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.
h
FineWeb-Edu-Analytic
huggingface.co
Updated Aug 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Logicvex AI (2025). FineWeb-Edu-Analytic [Dataset]. https://huggingface.co/datasets/MultivexAI/FineWeb-Edu-Analytic
Explore at:
Dataset updated
Aug 6, 2025
Authors
Logicvex AI
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
FineWeb-Edu-Analytic (v1)

FineWeb-Edu-Analytic (v1) is an English-language dataset containing 9908 documents, intended as a resource for training language models. The dataset was generated by taking text sequences from the FineWeb-Edu dataset (CC-MAIN-2025-26 subset) to serve as a source. Each source sequence was then processed by a 48-billion parameter language model to generate a corresponding structured, analytical document. Disclaimer: This dataset is not affiliated with the… See the full description on the dataset page: https://huggingface.co/datasets/MultivexAI/FineWeb-Edu-Analytic.
fineweb-edu-dedup-10b
huggingface.co
Updated Mar 15, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EleutherAI (2011). fineweb-edu-dedup-10b [Dataset]. https://huggingface.co/datasets/EleutherAI/fineweb-edu-dedup-10b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2011
Dataset authored and provided by
EleutherAIhttps://eleuther.ai/
Description
EleutherAI/fineweb-edu-dedup-10b dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu-ar
huggingface.co
Updated Nov 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KAUST Center of Excellence in Generative AI (2024). fineweb-edu-ar [Dataset]. https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar
Explore at:
Dataset updated
Nov 10, 2024
Dataset authored and provided by
KAUST Center of Excellence in Generative AI
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
FineWeb-Edu-Ar

FineWeb-Edu-Ar is a machine-translated Arabic version of the FineWeb-Edu dataset designed to support the development of Arabic small language models (SLMs). Dataset Details:

Languages: Arabic, English (paired) Size: 202 billion tokens License: CC-BY-NC-4.0 Source: Machine-translated from the deduplicated version of Hugging Face’s FineWeb-Edu dataset Translation model: facebook/nllb-200-distilled-600M

Application: FineWeb-Edu-Ar is suitable for pre-training Arabic… See the full description on the dataset page: https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar.
h
fineweb-edu-1M
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TEL-LLM, fineweb-edu-1M [Dataset]. https://huggingface.co/datasets/TEL-LLM/fineweb-edu-1M
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
TEL-LLM
Description
TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu-2024-10-from-1M-to-2M-ko-edu
huggingface.co
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzie Oh (2024). fineweb-edu-2024-10-from-1M-to-2M-ko-edu [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu
Explore at:
Dataset updated
Nov 1, 2024
Authors
Suzie Oh
Description
ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu-10BT-sorted
huggingface.co
Updated Aug 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhruv Saini (2024). fineweb-edu-10BT-sorted [Dataset]. https://huggingface.co/datasets/oof-baroomf/fineweb-edu-10BT-sorted
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 1, 2024
Authors
Dhruv Saini
Description
oof-baroomf/fineweb-edu-10BT-sorted dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu-fortified-mini
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee Junbum, fineweb-edu-fortified-mini [Dataset]. https://huggingface.co/datasets/beomi/fineweb-edu-fortified-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Lee Junbum
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
FineWeb-Edu-Fortified-Mini

This is sampled version on FineWeb-Edu-Fortified, for testing purpose.

LICENSE

Follows original FineWeb dataset.
h
FineWeb-Edu-1MT
huggingface.co
Updated Sep 1, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rulin Shao (2011). FineWeb-Edu-1MT [Dataset]. https://huggingface.co/datasets/rulins/FineWeb-Edu-1MT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2011
Authors
Rulin Shao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A subset of FineWeb-Edu randomly sampled from the whole dataset of around 1M gpt2 tokens. This dataset is created for illustration purpose in retrieval-scaling. Please do not distribute.

Facebook

Twitter

Click to copy link

Link copied

Cite

FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497

fineweb-edu

FineWeb-Edu

HuggingFaceFW/fineweb-edu

Explore at:

68 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.57967/hf/2497

Dataset updated

Jan 3, 2025

Dataset authored and provided by

FineData

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

  What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

Clear search

Close search

Google apps

Main menu

fineweb-edu

fineweb

fineweb-edu

fineweb-edu-10BT-shuffled-for-gpt2

chinese-fineweb-edu

fineweb-edu-100b-shuffle

fineweb-edu

processed-fineweb-edu

FineWeb_EDU_10B

Dataset

Contents

fineweb-edu-10b-combined

fineweb-edu-2024-10-from-0M-to-1M-ko

fineweb-edu-llama3-annotations

FineWeb-Edu-Analytic

fineweb-edu-dedup-10b

fineweb-edu-ar

fineweb-edu-1M

fineweb-edu-2024-10-from-1M-to-2M-ko-edu

fineweb-edu-10BT-sorted

fineweb-edu-fortified-mini

FineWeb-Edu-1MT

fineweb-eduSee More Versions

FineWeb-Edu

HuggingFaceFW/fineweb-edu

fineweb-edu