https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📚 FineWeb-Edu
1.3 trillion tokens of the finest educational data the 🌐 web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Pre-shuffled fineweb-edu dataset
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !
Chinese Fineweb Edu Dataset V2 [中文] [English]
[OpenCSG Community] [👾github] [wechat] [Twitter]
📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.
Processed FineWeb-Edu Dataset
Dataset Name on Hugging Face: PursuitOfDataScience/processed-fineweb-edu
Overview
This dataset is a processed version of the FineWeb-Edu dataset, intended for language model training and NLP research. It has been tokenized and truncated according to a specified block size (i.e., 2048), preparing it for model pre-training or evaluation with transformer-based language models.
Source Dataset
Name: FineWeb-Edu
Description: A… See the full description on the dataset page: https://huggingface.co/datasets/PursuitOfDataScience/processed-fineweb-edu.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Chinese Fineweb Edu Dataset V2.1 [中文] [English]
[OpenCSG Community] [👾github] [wechat] [Twitter]
📖Technical Report The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Massive Genre-Audience Augment Fineweb-Edu Corpus
This dataset is a synthetic pretraining corpus described in paper Reformulation for Pretraining Data Augmentation.
Overview of synthesis framework. Our method expands the original corpus through a two-stage synthesis process. Each document is reformulated to 5 new documents, achieving 3.9× token number expansion while maintaining diversity through massive (genre, audience) pairs.
We build MGACorpus based on SmolLM Corpus… See the full description on the dataset page: https://huggingface.co/datasets/ByteDance-Seed/mga-fineweb-edu.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
FineWeb-Edu-Ar
FineWeb-Edu-Ar is a machine-translated Arabic version of the FineWeb-Edu dataset designed to support the development of Arabic small language models (SLMs). Dataset Details:
Languages: Arabic, English (paired) Size: 202 billion tokens License: CC-BY-NC-4.0 Source: Machine-translated from the deduplicated version of Hugging Face’s FineWeb-Edu dataset Translation model: facebook/nllb-200-distilled-600M
Application: FineWeb-Edu-Ar is suitable for pre-training Arabic… See the full description on the dataset page: https://huggingface.co/datasets/kaust-generative-ai/fineweb-edu-ar.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
FineWeb-Edu w/ Topic and Format Annotations
FineWeb-Edu dataset consists of 1.3T tokens annotated for Topic and Format using wissamantoun/WebOrganizer-TopicClassifier-ModernBERT and wissamantoun/WebOrganizer-FormatClassifier-ModernBERT classifiers. Similar to WebOrganizer/Corpus-200B but using FineEdu instead of DCLM. Topic Labels:
Adult Art & Design Software Dev. Crime & Law Education & Jobs Hardware Entertainment Social Life Fashion & Beauty Finance & Business Food & Dining… See the full description on the dataset page: https://huggingface.co/datasets/wissamantoun/fineweb-edu-format-topic.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Annotations for 📚 FineWeb-Edu classifier
This dataset contains the annotations used for training 📚 FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from 🍷 FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.
ohsuz/fineweb-edu-2024-10-from-0M-to-1M-ko-edu dataset hosted on Hugging Face and contributed by the HF Datasets community
TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A subset of FineWeb-Edu randomly sampled from the whole dataset of around 1B gpt2 tokens. This dataset is created for illustration purpose in retrieval-scaling. Please do not distribute.
kaitchup/fineweb-edu-sample-10k dataset hosted on Hugging Face and contributed by the HF Datasets community
RioYokotaLab/fineweb-edu dataset hosted on Hugging Face and contributed by the HF Datasets community
ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko-edu dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Ultra FineWeb EDU
High-Quality Educational Content from Ultra-FineWeb Filtered for Maximum Educational Value
📚 Overview
Ultra FineWeb EDU is a premium educational dataset created by applying advanced educational content filtering to the exceptional Ultra-FineWeb dataset. This work builds directly upon two foundational achievements: the rigorous data curation methodology of Ultra-FineWeb and the sophisticated educational classification capabilities of the… See the full description on the dataset page: https://huggingface.co/datasets/ProCreations/Ultra-FineWeb-EDU.
mikasenghaas/fineweb-edu-10bt dataset hosted on Hugging Face and contributed by the HF Datasets community
deatos/fineweb-edu-10b-combined dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
This is an extract from HuggingFace's Fineweb-EDU data set, specifically from 2024 parquets zero through four, eleven through thirteen, and seventeen through nineteen. The extracts where based on keywords: "ADDIE," "learning theory," "adult education," and "instructional design."
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📚 FineWeb-Edu
1.3 trillion tokens of the finest educational data the 🌐 web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.