https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ FineWeb-Edu
1.3 trillion tokens of the finest educational data the ๐ web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
๐ FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from ๐ท FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ท FineWeb
15 trillion tokens of the finest data the ๐ web has to offer
What is it?
The ๐ท FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the ๐ญ datatrove library, our large scale data processing library. ๐ท FineWeb was originally meant to be a fully open replication of ๐ฆ RefinedWeb, with a releaseโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Pre-shuffled fineweb-edu dataset
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is the shuffled version of fineweb-edu-10bt-for-gpt2. Please refer to fineweb-edu-10bt-for-gpt2 for more information about this dataset.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !
Chinese Fineweb Edu Dataset [ไธญๆ] [English]
[OpenCSG Community] [๐พgithub] [wechat] [Twitter]
๐Technical Report Chinese Fineweb Edu dataset is a meticulously constructed high-quality Chinese pre-training corpus, specifically designed for natural language processing tasks in the education domain. This dataset undergoes a rigorous selection and deduplication process, using aโฆ See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
karpathy/fineweb-edu-100b-shuffle dataset hosted on Hugging Face and contributed by the HF Datasets community
AryaWu/fineweb-edu dataset hosted on Hugging Face and contributed by the HF Datasets community
Processed FineWeb-Edu Dataset
Dataset Name on Hugging Face: PursuitOfDataScience/processed-fineweb-edu
Overview
This dataset is a processed version of the FineWeb-Edu dataset, intended for language model training and NLP research. It has been tokenized and truncated according to a specified block size (i.e., 2048), preparing it for model pre-training or evaluation with transformer-based language models.
Source Dataset
Name: FineWeb-Edu
Description: Aโฆ See the full description on the dataset page: https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu.
This dataset was created by yousef Mohamed
deatos/fineweb-edu-10b-combined dataset hosted on Hugging Face and contributed by the HF Datasets community
ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Annotations for ๐ FineWeb-Edu classifier
This dataset contains the annotations used for training ๐ FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from ๐ท FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
FineWeb-Edu-Analytic (v1)
FineWeb-Edu-Analytic (v1) is an English-language dataset containing 9908 documents, intended as a resource for training language models. The dataset was generated by taking text sequences from the FineWeb-Edu dataset (CC-MAIN-2025-26 subset) to serve as a source. Each source sequence was then processed by a 48-billion parameter language model to generate a corresponding structured, analytical document. Disclaimer: This dataset is not affiliated with theโฆ See the full description on the dataset page: https://huggingface.co/datasets/MultivexAI/FineWeb-Edu-Analytic.
EleutherAI/fineweb-edu-dedup-10b dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
FineWeb-Edu-Ar
FineWeb-Edu-Ar is a machine-translated Arabic version of the FineWeb-Edu dataset designed to support the development of Arabic small language models (SLMs). Dataset Details:
Languages: Arabic, English (paired) Size: 202 billion tokens License: CC-BY-NC-4.0 Source: Machine-translated from the deduplicated version of Hugging Faceโs FineWeb-Edu dataset Translation model: facebook/nllb-200-distilled-600M
Application: FineWeb-Edu-Ar is suitable for pre-training Arabicโฆ See the full description on the dataset page: https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar.
TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community
ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu dataset hosted on Hugging Face and contributed by the HF Datasets community
oof-baroomf/fineweb-edu-10BT-sorted dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
FineWeb-Edu-Fortified-Mini
This is sampled version on FineWeb-Edu-Fortified, for testing purpose.
LICENSE
Follows original FineWeb dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A subset of FineWeb-Edu randomly sampled from the whole dataset of around 1M gpt2 tokens. This dataset is created for illustration purpose in retrieval-scaling. Please do not distribute.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
๐ FineWeb-Edu
1.3 trillion tokens of the finest educational data the ๐ web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
๐ FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from ๐ท FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenโฆ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.