100+ datasets found

h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
h
fineweb-edu
huggingface.co
Updated Jan 3, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
Explore at:
Unique identifier
https://doi.org/10.57967/hf/2497
Dataset updated
Jan 3, 2025
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📚 FineWeb-Edu

1.3 trillion tokens of the finest educational data the 🌐 web has to offer

Paper: https://arxiv.org/abs/2406.17557

What is it?

📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
h
falcon-refinedweb
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0737
Dataset authored and provided by
Technology Innovation Institute
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📀 Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
h
open-web-math
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
open-web-math, open-web-math [Dataset]. https://huggingface.co/datasets/open-web-math/open-web-math
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
open-web-math
Description
Keiran Paster*, Marco Dos Santos*, Zhangir Azerbayev, Jimmy Ba GitHub | ArXiv | PDF OpenWebMath is a dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens. OpenWebMath is intended for use in pretraining and finetuninglarge language models. You can download the dataset using Hugging Face: from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/open-web-math/open-web-math.
h
ii-agent_gaia-benchmark_validation
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II (2025). ii-agent_gaia-benchmark_validation [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/ii-agent_gaia-benchmark_validation
Explore at:
Dataset updated
Jun 1, 2025
Dataset authored and provided by
II
Description
Intelligent-Internet/ii-agent_gaia-benchmark_validation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1
huggingface.co
Updated Jun 17, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II (2025). OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1 [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1
Explore at:
Dataset updated
Jun 17, 2025
Dataset authored and provided by
II
Description
Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
frames-benchmark
huggingface.co
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II (2025). frames-benchmark [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/frames-benchmark
Explore at:
Dataset updated
Mar 28, 2025
Dataset authored and provided by
II
Description
Intelligent-Internet/frames-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community
h
webfiddle-internet-raw-cache-dataset
huggingface.co
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee Penkman (2025). webfiddle-internet-raw-cache-dataset [Dataset]. https://huggingface.co/datasets/lee101/webfiddle-internet-raw-cache-dataset
Explore at:
Dataset updated
Jul 3, 2025
Authors
Lee Penkman
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A dataset of different files that robots tried to crawl through webfiddle.net Mostly html files but other files too pdfs, images, binary- i have no idea what is in here at this stage - but gives an interesting idea of what crawlers like to visit and could be the basis of interesting SEO or coding LLM reasearch. Collected as part of my work on web simulators. https://webfiddle.net JS/CSS editor for the web, https://websim.netwrck.com Coding Editor for the web. https://x.com/leeleepenkman Its… See the full description on the dataset page: https://huggingface.co/datasets/lee101/webfiddle-internet-raw-cache-dataset.
h
II-Thought-RL-v0
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II (2025). II-Thought-RL-v0 [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/II-Thought-RL-v0
Explore at:
Dataset updated
Jun 1, 2025
Dataset authored and provided by
II
Description
II-Thought RL v0: A Large-Scale Curated Dataset for Reinforcement Learning

See our blog here for additional details. We introduce II-Thought RL v0, the first large-scale, multi-task dataset designed for Reinforcement Learning. This dataset consists of high-quality question-answer pairs that have undergone a rigorous multi-step filtering process, leveraging Gemini 2.0 Flash and Qwen 32B as quality evaluators. In this initial release, we have curated and refined publicly available… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/II-Thought-RL-v0.
h
Data from: web-vision
huggingface.co
Updated Mar 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Jaggernauth (2024). web-vision [Dataset]. https://huggingface.co/datasets/lukejagg/web-vision
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2024
Authors
Lucas Jaggernauth
Description
lukejagg/web-vision dataset hosted on Hugging Face and contributed by the HF Datasets community
h
web-attacks-old
huggingface.co
Updated Nov 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
simshengqin (2023). web-attacks-old [Dataset]. https://huggingface.co/datasets/shengqin/web-attacks-old
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2023
Authors
simshengqin
Description
shengqin/web-attacks-old dataset hosted on Hugging Face and contributed by the HF Datasets community
h
wikipedia_en
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II, wikipedia_en [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en
Explore at:
Dataset authored and provided by
II
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
wikipedia_en

This is a curated Wikipedia English dataset for use with the II-Commons project.

Dataset Details Dataset Description

This dataset comprises a curated Wikipedia English pages. Data sourced directly from the official English Wikipedia database dump. We extract the pages, chunk them into smaller pieces, and embed them using Snowflake/snowflake-arctic-embed-m-v2.0. All vector embeddings are 16-bit half-precision vectors optimized for cosine indexing… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en.
h
II-Thought-RL-v0-Math-50K
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II (2025). II-Thought-RL-v0-Math-50K [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/II-Thought-RL-v0-Math-50K
Explore at:
Dataset updated
Jun 1, 2025
Dataset authored and provided by
II
Description
Intelligent-Internet/II-Thought-RL-v0-Math-50K dataset hosted on Hugging Face and contributed by the HF Datasets community
h
online_terms_of_service
huggingface.co
Updated Jan 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Niklaus (2021). online_terms_of_service [Dataset]. https://huggingface.co/datasets/joelniklaus/online_terms_of_service
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 23, 2021
Authors
Joel Niklaus
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for A Corpus for Multilingual Analysis of Online Terms of Service

Dataset Summary

"We present the first annotated corpus for multilingual analysis of potentially unfair clauses in online Terms of Service [=ToS]. The data set comprises a total of 100 contracts, obtained from 25 documents annotated in four different languages: English, German, Italian, and Polish. For each contract, potentially unfair clauses for the consumer are annotated, for nine different… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/online_terms_of_service.
h
web_archive_classification
huggingface.co
Updated Mar 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
British Library (2025). web_archive_classification [Dataset]. https://huggingface.co/datasets/TheBritishLibrary/web_archive_classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2025
Dataset authored and provided by
British Library
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy.
h
spoken-web-questions
huggingface.co
Updated Sep 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ultravox.ai (2024). spoken-web-questions [Dataset]. https://huggingface.co/datasets/fixie-ai/spoken-web-questions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 20, 2024
Dataset provided by
Ultravox.ai
Description
fixie-ai/spoken-web-questions dataset hosted on Hugging Face and contributed by the HF Datasets community
h
pd12m
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
II, pd12m [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/pd12m
Explore at:
Dataset authored and provided by
II
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PD12M

This is a curated PD12M dataset for use with the II-Commons project.

Dataset Details Dataset Description

This dataset comprises a curated Public Domain 12M image collection, refined by filtering for active image links. EXIF data was extracted, and images underwent preprocessing and feature extraction using SigLIP 2. All vector embeddings are normalized 16-bit half-precision vectors optimized for L2 indexing with vectorchord.

Dataset Sources… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/pd12m.
Web-Bench
huggingface.co
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bytedance-research (2025). Web-Bench [Dataset]. https://huggingface.co/datasets/bytedance-research/Web-Bench
Explore at:
Dataset updated
May 12, 2025
Dataset provided by
ByteDancehttps://www.bytedance.com/
Authors
bytedance-research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Web-Bench

English | 中文 README

📖 Overview

Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and… See the full description on the dataset page: https://huggingface.co/datasets/bytedance-research/Web-Bench.
h
tiny-webtext
huggingface.co
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nam Pham (2024). tiny-webtext [Dataset]. http://doi.org/10.57967/hf/1024
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1024
Dataset updated
Jan 26, 2024
Authors
Nam Pham
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Tiny WebText

The Tiny WebText dataset is designed to help models learn about perception on web text while neutralizing the bias of the source text using critical thinking methods. By providing a rich and diverse set of texts, I aim to improve the ability of models to understand and analyze information in a more objective and unbiased manner. This dataset can be used to train and evaluate natural language processing and machine learning models, with the goal of improving their… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-webtext.
h
gov_trec-web-2003
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ir-datasets, gov_trec-web-2003 [Dataset]. https://huggingface.co/datasets/irds/gov_trec-web-2003
Explore at:
Dataset authored and provided by
ir-datasets
Description
Dataset Card for gov/trec-web-2003

The gov/trec-web-2003 dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

Data

This dataset provides:

queries (i.e., topics); count=50

qrels: (relevance assessments); count=51,062

For docs, use irds/gov

Usage

from datasets import load_dataset

queries = load_dataset('irds/gov_trec-web-2003', 'queries') for record in queries: record # {'query_id': ...… See the full description on the dataset page: https://huggingface.co/datasets/irds/gov_trec-web-2003.

Facebook

Twitter

Click to copy link

Link copied

Cite

FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493

fineweb

FineWeb

HuggingFaceFW/fineweb

Explore at:

93 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.57967/hf/2493

Dataset authored and provided by

FineData

License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

  What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Clear search

Close search

Google apps

Main menu

fineweb

fineweb-edu

falcon-refinedweb

open-web-math

ii-agent_gaia-benchmark_validation

OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1

frames-benchmark

webfiddle-internet-raw-cache-dataset

II-Thought-RL-v0

Data from: web-vision

web-attacks-old

wikipedia_en

II-Thought-RL-v0-Math-50K

online_terms_of_service

web_archive_classification

spoken-web-questions

pd12m

Web-Bench

tiny-webtext

gov_trec-web-2003

fineweb

FineWeb

HuggingFaceFW/fineweb