https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
π FineWeb-Edu
1.3 trillion tokens of the finest educational data the π web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
π FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from π· FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenβ¦ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
π FineWeb-Edu-score-2
1.3 trillion tokens of the finest educational data the π web has to offer
What is it?
π FineWeb-Edu dataset consists of 1.3T tokens (FineWeb-Edu) and 5.4T tokens of educational web pages filtered from π· FineWeb dataset. This is the 5.4 trillion version.
Note: this version uses a lower educational score threshold = 2, which results in more documents, but lower quality compared to the 1.3T version. For more details check the⦠See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
π FineWeb-Edu
1.3 trillion tokens of the finest educational data the π web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
π FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from π· FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We thenβ¦ See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.