100+ datasets found

h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
h
fineweb-2
huggingface.co
aifasthub.com
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). fineweb-2 [Dataset]. http://doi.org/10.57967/hf/3744
Explore at:
Unique identifier
https://doi.org/10.57967/hf/3744
Dataset updated
Jun 27, 2025
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🥂 FineWeb2

A sparkling update with 1000s of languages

What is it?

This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
h
fineweb-edu
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2497
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
h
Ultra-FineWeb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenBMB, Ultra-FineWeb [Dataset]. https://huggingface.co/datasets/openbmb/Ultra-FineWeb
Explore at:
Dataset authored and provided by
OpenBMB
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Ultra-FineWeb

📜 Technical Report

📚 Introduction

Ultra-FineWeb is a large-scale, high-quality, and efficiently-filtered dataset. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText, TeleChat, and CCI3), resulting in the creation of higher-quality Ultra-FineWeb-en… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/Ultra-FineWeb.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ornaments, fineweb [Dataset]. https://huggingface.co/datasets/Ornaments/fineweb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Ornaments
Description
Ornaments/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community
h
chinese-fineweb-edu-v2
huggingface.co
Updated May 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opencsg (2025). chinese-fineweb-edu-v2 [Dataset]. https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2025
Dataset authored and provided by
opencsg
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !

Chinese Fineweb Edu Dataset V2 [中文] [English]

[OpenCSG Community] [👾github] [wechat] [Twitter]

📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.
Primus-FineWeb
huggingface.co
Updated Aug 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trend Cybertron (Trend Micro) (2025). Primus-FineWeb [Dataset]. https://huggingface.co/datasets/trend-cybertron/Primus-FineWeb
Explore at:
Dataset updated
Aug 9, 2025
Dataset provided by
Trend Microhttp://trendmicro.com/
Authors
Trend Cybertron (Trend Micro)
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
⭐ Please download the dataset from here.

PRIMUS: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training 🤗 Primus-FineWeb

The Primus-FineWeb dataset is constructed by filtering cybersecurity-related text from FineWeb, a refined version of Common Crawl. We began by leveraging Primus-Seed, a high-quality dataset of manually curated cybersecurity text, as positive samples. We then sampled ten times the amount of data from FineWeb as negative samples… See the full description on the dataset page: https://huggingface.co/datasets/trend-cybertron/Primus-FineWeb.
h
FineWeb-pro
huggingface.co
Updated Apr 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GAIR-ProX (2021). FineWeb-pro [Dataset]. https://huggingface.co/datasets/gair-prox/FineWeb-pro
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2021
Dataset authored and provided by
GAIR-ProX
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 fineweb-pro

ArXiv | Models | Code fineweb-pro is refined from fineweb(350BT sample) using the ProX refining framework. It contains about 100B high quality tokens, ready for general language model pre-training.

License

fineweb-pro is based on fineweb, which is made available under an ODC-By 1.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/FineWeb-pro.
h
fineweb-c
huggingface.co
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Is Better Together (2025). fineweb-c [Dataset]. https://huggingface.co/datasets/data-is-better-together/fineweb-c
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
Data Is Better Together
Description
FineWeb-C: Educational content in many languages, labelled by the community

Multilingual data is better together!

Note: We are not actively working on this project anymore. You can continue to contribute annotations and we'll occasionally refresh the exported data.

What is this?

FineWeb-C is a collaborative, community-driven project that expands upon the FineWeb2 dataset. The goal is to create high-quality educational content annotations across hundreds of… See the full description on the dataset page: https://huggingface.co/datasets/data-is-better-together/fineweb-c.
h
occiglot-fineweb-v1.0
huggingface.co
Updated Dec 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Occiglot (2024). occiglot-fineweb-v1.0 [Dataset]. https://huggingface.co/datasets/occiglot/occiglot-fineweb-v1.0
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2024
Dataset authored and provided by
Occiglot
Description
Occiglot Fineweb v1.0

We present a more mature version of the multilingual Occiglot Fineweb corpus. In this early form, the dataset contains roughly 430M heavily cleaned documents from 10 languages. Occiglot Fineweb builds on our existing collection of curated datasets and pre-filtered web data. Subsequently, all documents were filtered with language-specific derivatives of the fine-web processing pipeline and different levels of depuplicated. We provide the data at 3 levels of… See the full description on the dataset page: https://huggingface.co/datasets/occiglot/occiglot-fineweb-v1.0.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Lee, fineweb [Dataset]. https://huggingface.co/datasets/ivnle/fineweb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ivan Lee
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
ivnle/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community
h
fineweb-edu-llama3-annotations
huggingface.co
Updated Jun 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2024). fineweb-edu-llama3-annotations [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 8, 2024
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Annotations for 📚 FineWeb-Edu classifier

This dataset contains the annotations used for training 📚 FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from 🍷 FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.
h
fineweb
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dodo, fineweb [Dataset]. https://huggingface.co/datasets/dododo1234/fineweb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Dodo
Description
dododo1234/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community
h
opc-fineweb-code-corpus
huggingface.co
Updated Nov 15, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenCoder (2024). opc-fineweb-code-corpus [Dataset]. https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-code-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2024
Dataset authored and provided by
OpenCoder
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
OpenCoder Dataset

The OpenCoder dataset is composed of the following datasets:

opc-sft-stage1: the sft data used for opencoder sft-stage1 opc-sft-stage2: the sft data used for opencoder sft-stage2 opc-annealing-corpus: the synthetic data & algorithmic corpus used for opencoder annealing opc-fineweb-code-corpus: the code-related page recalled from fineweb <-- you are here opc-fineweb-math-corpus: the math-related page recalled from finewebrefineCode-code-corpus-meta: the meta-data… See the full description on the dataset page: https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-code-corpus.
h
fineweb-sample-1BT
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PygTesting, fineweb-sample-1BT [Dataset]. https://huggingface.co/datasets/PygTesting/fineweb-sample-1BT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
PygTesting
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
This is a subset of the Fineweb dataset trimmed down to approximately one billion tokens. No special frills. We sampled from the 10 billion token subset to create this one.
h
FineWeb-Edu-1BT
huggingface.co
Updated Sep 1, 2011
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rulin Shao (2011). FineWeb-Edu-1BT [Dataset]. https://huggingface.co/datasets/rulins/FineWeb-Edu-1BT
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2011
Authors
Rulin Shao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A subset of FineWeb-Edu randomly sampled from the whole dataset of around 1B gpt2 tokens. This dataset is created for illustration purpose in retrieval-scaling. Please do not distribute.
mga-fineweb-edu
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ByteDance Seed (2025). mga-fineweb-edu [Dataset]. https://huggingface.co/datasets/ByteDance-Seed/mga-fineweb-edu
Explore at:
Dataset updated
Jun 1, 2025
Dataset provided by
ByteDancehttps://www.bytedance.com/
Authors
ByteDance Seed
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Massive Genre-Audience Augment Fineweb-Edu Corpus

This dataset is a synthetic pretraining corpus described in paper Reformulation for Pretraining Data Augmentation.

Overview of synthesis framework. Our method expands the original corpus through a two-stage synthesis process. Each document is reformulated to 5 new documents, achieving 3.9× token number expansion while maintaining diversity through massive (genre, audience) pairs.

We build MGACorpus based on SmolLM Corpus… See the full description on the dataset page: https://huggingface.co/datasets/ByteDance-Seed/mga-fineweb-edu.
fineweb-edu
huggingface.co
Updated Sep 1, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prime Intellect (2012). fineweb-edu [Dataset]. https://huggingface.co/datasets/PrimeIntellect/fineweb-edu
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2012
Dataset provided by
Prime Intellect, Inc.
Authors
Prime Intellect
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
Pre-shuffled fineweb-edu dataset
h
Ultra-FineWeb-EDU
huggingface.co
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pro Creations (2025). Ultra-FineWeb-EDU [Dataset]. https://huggingface.co/datasets/ProCreations/Ultra-FineWeb-EDU
Explore at:
Dataset updated
Jun 6, 2025
Authors
Pro Creations
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Ultra FineWeb EDU

High-Quality Educational Content from Ultra-FineWeb Filtered for Maximum Educational Value

📚 Overview

Ultra FineWeb EDU is a premium educational dataset created by applying advanced educational content filtering to the exceptional Ultra-FineWeb dataset. This work builds directly upon two foundational achievements: the rigorous data curation methodology of Ultra-FineWeb and the sophisticated educational classification capabilities of the… See the full description on the dataset page: https://huggingface.co/datasets/ProCreations/Ultra-FineWeb-EDU.
h
Fineweb-Edu-Chinese-V2.1
huggingface.co
Updated May 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opencsg (2025). Fineweb-Edu-Chinese-V2.1 [Dataset]. https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 22, 2025
Dataset authored and provided by
opencsg
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Chinese Fineweb Edu Dataset V2.1 [中文] [English]

[OpenCSG Community] [👾github] [wechat] [Twitter]

📖Technical Report The Chinese Fineweb Edu Dataset V2.1 is an enhanced version of the V2 dataset, designed specifically for natural language processing (NLP) tasks in the education sector. This version introduces two new data sources, map-cc and opencsg-cc, and retains data with scores ranging from 2 to 3. The dataset entries are organized into different folders… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/Fineweb-Edu-Chinese-V2.1.

Facebook

Twitter

Click to copy link

Link copied

Cite

FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493

fineweb

FineWeb

HuggingFaceFW/fineweb

Explore at:

93 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.57967/hf/2493

Dataset authored and provided by

FineData

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

  What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Clear search

Close search

Google apps

Main menu

fineweb

fineweb-2

fineweb-edu

Ultra-FineWeb

fineweb

chinese-fineweb-edu-v2

Primus-FineWeb

FineWeb-pro

fineweb-c

occiglot-fineweb-v1.0

fineweb

fineweb-edu-llama3-annotations

fineweb

opc-fineweb-code-corpus

fineweb-sample-1BT

FineWeb-Edu-1BT

mga-fineweb-edu

fineweb-edu

Ultra-FineWeb-EDU

Fineweb-Edu-Chinese-V2.1

fineweb

FineWeb

HuggingFaceFW/fineweb