7 datasets found

h
fineweb-2
huggingface.co
Updated Oct 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2020). fineweb-2 [Dataset]. http://doi.org/10.57967/hf/3744
Explore at:
Unique identifier
https://doi.org/10.57967/hf/3744
Dataset updated
Oct 26, 2020
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🥂 FineWeb2

A sparkling update with 1000s of languages

What is it?

This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
h
fineweb-edu
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2497
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
fineweb-edu-10BT-for-gpt2
kaggle.com
zip
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minh-Thien Nguyen (2024). fineweb-edu-10BT-for-gpt2 [Dataset]. https://www.kaggle.com/datasets/minhthiennguyen/fineweb-edu-10bt-for-gpt2
Explore at:
zip(13769081319 bytes)Available download formats
Dataset updated
Jul 20, 2024
Authors
Minh-Thien Nguyen
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a tokenized Fineweb-Edu (10BT subset) using GPT2 tokenizer for pre-training GPT2 model. Data is divided into shards (.npy files), each shard contains 2e8 tokens and test shard contains roughly 1.5e8 tokens.

For the Fineweb version, please refer to fineweb-10BT-for-gpt2.

Each .npy file can be loaded with numpy.load('file_name.npy').
h
fineweb-2-duckdbs
huggingface.co
Updated Jan 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bram Vanroy (2025). fineweb-2-duckdbs [Dataset]. https://huggingface.co/datasets/BramVanroy/fineweb-2-duckdbs
Explore at:
Dataset updated
Jan 28, 2025
Authors
Bram Vanroy
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
DuckDB datasets for (dump, id) querying on FineWeb 2

This repo contains some DuckDB databases to check whether a given WARC UID exists in a FineWeb-2 dump. Usage example is given below, but note especially that if you are using URNs (likely, if you are working with CommonCrawl data), then you first have to extract the UID (the id column is of type UUID in the databases).

Download

All files: huggingface-cli download BramVanroy/fineweb-2-duckdbs --local-dir… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/fineweb-2-duckdbs.
Primus-FineWeb
huggingface.co
Updated Aug 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Trend Cybertron (Trend Micro) (2025). Primus-FineWeb [Dataset]. https://huggingface.co/datasets/trend-cybertron/Primus-FineWeb
Explore at:
Dataset updated
Aug 9, 2025
Dataset provided by
Trend Microhttp://trendmicro.com/
Authors
Trend Cybertron (Trend Micro)
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
⭐ Please download the dataset from here.

PRIMUS: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training 🤗 Primus-FineWeb

The Primus-FineWeb dataset is constructed by filtering cybersecurity-related text from FineWeb, a refined version of Common Crawl. We began by leveraging Primus-Seed, a high-quality dataset of manually curated cybersecurity text, as positive samples. We then sampled ten times the amount of data from FineWeb as negative samples… See the full description on the dataset page: https://huggingface.co/datasets/trend-cybertron/Primus-FineWeb.
smollm-corpus
huggingface.co
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face Smol Models Research (2024). smollm-corpus [Dataset]. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face Smol Models Research
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
SmolLM-Corpus

This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post.

Dataset subsets Cosmopedia v2

Cosmopedia v2 is an enhanced version of Cosmopedia, the largest synthetic dataset for pre-training, consisting of over 39 million textbooks, blog posts, and stories generated by… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

FineData (2020). fineweb-2 [Dataset]. http://doi.org/10.57967/hf/3744

fineweb-2

🥂 FineWeb 2

HuggingFaceFW/fineweb-2

Explore at:

109 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.57967/hf/3744

Dataset updated

Oct 26, 2020

Dataset authored and provided by

FineData

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

🥂 FineWeb2

A sparkling update with 1000s of languages

  What is it?

This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.

Clear search

Close search

Google apps

Main menu

fineweb-2

fineweb

fineweb-edu

fineweb-edu-10BT-for-gpt2

fineweb-2-duckdbs

Primus-FineWeb

smollm-corpus

fineweb-2

🥂 FineWeb 2

HuggingFaceFW/fineweb-2