https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📚 FineWeb-Edu
1.3 trillion tokens of the finest educational data the 🌐 web has to offer
Paper: https://arxiv.org/abs/2406.17557
What is it?
📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
📀 Falcon RefinedWeb
Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
Keiran Paster*, Marco Dos Santos*, Zhangir Azerbayev, Jimmy Ba GitHub | ArXiv | PDF OpenWebMath is a dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens. OpenWebMath is intended for use in pretraining and finetuninglarge language models. You can download the dataset using Hugging Face: from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/open-web-math/open-web-math.
Intelligent-Internet/ii-agent_gaia-benchmark_validation dataset hosted on Hugging Face and contributed by the HF Datasets community
Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Intelligent-Internet/frames-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A dataset of different files that robots tried to crawl through webfiddle.net Mostly html files but other files too pdfs, images, binary- i have no idea what is in here at this stage - but gives an interesting idea of what crawlers like to visit and could be the basis of interesting SEO or coding LLM reasearch. Collected as part of my work on web simulators. https://webfiddle.net JS/CSS editor for the web, https://websim.netwrck.com Coding Editor for the web. https://x.com/leeleepenkman Its… See the full description on the dataset page: https://huggingface.co/datasets/lee101/webfiddle-internet-raw-cache-dataset.
II-Thought RL v0: A Large-Scale Curated Dataset for Reinforcement Learning
See our blog here for additional details. We introduce II-Thought RL v0, the first large-scale, multi-task dataset designed for Reinforcement Learning. This dataset consists of high-quality question-answer pairs that have undergone a rigorous multi-step filtering process, leveraging Gemini 2.0 Flash and Qwen 32B as quality evaluators. In this initial release, we have curated and refined publicly available… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/II-Thought-RL-v0.
lukejagg/web-vision dataset hosted on Hugging Face and contributed by the HF Datasets community
shengqin/web-attacks-old dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
wikipedia_en
This is a curated Wikipedia English dataset for use with the II-Commons project.
Dataset Details
Dataset Description
This dataset comprises a curated Wikipedia English pages. Data sourced directly from the official English Wikipedia database dump. We extract the pages, chunk them into smaller pieces, and embed them using Snowflake/snowflake-arctic-embed-m-v2.0. All vector embeddings are 16-bit half-precision vectors optimized for cosine indexing… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en.
Intelligent-Internet/II-Thought-RL-v0-Math-50K dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for A Corpus for Multilingual Analysis of Online Terms of Service
Dataset Summary
"We present the first annotated corpus for multilingual analysis of potentially unfair clauses in online Terms of Service [=ToS]. The data set comprises a total of 100 contracts, obtained from 25 documents annotated in four different languages: English, German, Italian, and Polish. For each contract, potentially unfair clauses for the consumer are annotated, for nine different… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/online_terms_of_service.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy.
fixie-ai/spoken-web-questions dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PD12M
This is a curated PD12M dataset for use with the II-Commons project.
Dataset Details
Dataset Description
This dataset comprises a curated Public Domain 12M image collection, refined by filtering for active image links. EXIF data was extracted, and images underwent preprocessing and feature extraction using SigLIP 2. All vector embeddings are normalized 16-bit half-precision vectors optimized for L2 indexing with vectorchord.
Dataset Sources… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/pd12m.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Web-Bench
English | 中文 README
📖 Overview
Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and… See the full description on the dataset page: https://huggingface.co/datasets/bytedance-research/Web-Bench.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Tiny WebText
The Tiny WebText dataset is designed to help models learn about perception on web text while neutralizing the bias of the source text using critical thinking methods. By providing a rich and diverse set of texts, I aim to improve the ability of models to understand and analyze information in a more objective and unbiased manner. This dataset can be used to train and evaluate natural language processing and machine learning models, with the goal of improving their… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-webtext.
Dataset Card for gov/trec-web-2003
The gov/trec-web-2003 dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.
Data
This dataset provides:
queries (i.e., topics); count=50
qrels: (relevance assessments); count=51,062
For docs, use irds/gov
Usage
from datasets import load_dataset
queries = load_dataset('irds/gov_trec-web-2003', 'queries') for record in queries: record # {'query_id': ...… See the full description on the dataset page: https://huggingface.co/datasets/irds/gov_trec-web-2003.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
🍷 FineWeb
15 trillion tokens of the finest data the 🌐 web has to offer
What is it?
The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.