100+ datasets found
  1. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  2. h

    fineweb-edu

    • huggingface.co
    Updated Jan 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    Dataset updated
    Jan 3, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 FineWeb-Edu

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  3. h

    falcon-refinedweb

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Technology Innovation Institute
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📀 Falcon RefinedWeb

    Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.

  4. h

    open-web-math

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    open-web-math, open-web-math [Dataset]. https://huggingface.co/datasets/open-web-math/open-web-math
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    open-web-math
    Description

    Keiran Paster*, Marco Dos Santos*, Zhangir Azerbayev, Jimmy Ba GitHub | ArXiv | PDF OpenWebMath is a dataset containing the majority of the high-quality, mathematical text from the internet. It is filtered and extracted from over 200B HTML files on Common Crawl down to a set of 6.3 million documents containing a total of 14.7B tokens. OpenWebMath is intended for use in pretraining and finetuninglarge language models. You can download the dataset using Hugging Face: from datasets import… See the full description on the dataset page: https://huggingface.co/datasets/open-web-math/open-web-math.

  5. h

    ii-agent_gaia-benchmark_validation

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II (2025). ii-agent_gaia-benchmark_validation [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/ii-agent_gaia-benchmark_validation
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    II
    Description

    Intelligent-Internet/ii-agent_gaia-benchmark_validation dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1

    • huggingface.co
    Updated Jun 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II (2025). OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1 [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset authored and provided by
    II
    Description

    Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-1706-GPT-4.1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    frames-benchmark

    • huggingface.co
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II (2025). frames-benchmark [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/frames-benchmark
    Explore at:
    Dataset updated
    Mar 28, 2025
    Dataset authored and provided by
    II
    Description

    Intelligent-Internet/frames-benchmark dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    webfiddle-internet-raw-cache-dataset

    • huggingface.co
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee Penkman (2025). webfiddle-internet-raw-cache-dataset [Dataset]. https://huggingface.co/datasets/lee101/webfiddle-internet-raw-cache-dataset
    Explore at:
    Dataset updated
    Jul 3, 2025
    Authors
    Lee Penkman
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A dataset of different files that robots tried to crawl through webfiddle.net Mostly html files but other files too pdfs, images, binary- i have no idea what is in here at this stage - but gives an interesting idea of what crawlers like to visit and could be the basis of interesting SEO or coding LLM reasearch. Collected as part of my work on web simulators. https://webfiddle.net JS/CSS editor for the web, https://websim.netwrck.com Coding Editor for the web. https://x.com/leeleepenkman Its… See the full description on the dataset page: https://huggingface.co/datasets/lee101/webfiddle-internet-raw-cache-dataset.

  9. h

    II-Thought-RL-v0

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II (2025). II-Thought-RL-v0 [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/II-Thought-RL-v0
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    II
    Description

    II-Thought RL v0: A Large-Scale Curated Dataset for Reinforcement Learning

    See our blog here for additional details. We introduce II-Thought RL v0, the first large-scale, multi-task dataset designed for Reinforcement Learning. This dataset consists of high-quality question-answer pairs that have undergone a rigorous multi-step filtering process, leveraging Gemini 2.0 Flash and Qwen 32B as quality evaluators. In this initial release, we have curated and refined publicly available… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/II-Thought-RL-v0.

  10. h

    Data from: web-vision

    • huggingface.co
    Updated Mar 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Jaggernauth (2024). web-vision [Dataset]. https://huggingface.co/datasets/lukejagg/web-vision
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2024
    Authors
    Lucas Jaggernauth
    Description

    lukejagg/web-vision dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    web-attacks-old

    • huggingface.co
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    simshengqin (2023). web-attacks-old [Dataset]. https://huggingface.co/datasets/shengqin/web-attacks-old
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2023
    Authors
    simshengqin
    Description

    shengqin/web-attacks-old dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    wikipedia_en

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II, wikipedia_en [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en
    Explore at:
    Dataset authored and provided by
    II
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    wikipedia_en

    This is a curated Wikipedia English dataset for use with the II-Commons project.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This dataset comprises a curated Wikipedia English pages. Data sourced directly from the official English Wikipedia database dump. We extract the pages, chunk them into smaller pieces, and embed them using Snowflake/snowflake-arctic-embed-m-v2.0. All vector embeddings are 16-bit half-precision vectors optimized for cosine indexing… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/wikipedia_en.

  13. h

    II-Thought-RL-v0-Math-50K

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II (2025). II-Thought-RL-v0-Math-50K [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/II-Thought-RL-v0-Math-50K
    Explore at:
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    II
    Description

    Intelligent-Internet/II-Thought-RL-v0-Math-50K dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    online_terms_of_service

    • huggingface.co
    Updated Jan 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Niklaus (2021). online_terms_of_service [Dataset]. https://huggingface.co/datasets/joelniklaus/online_terms_of_service
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 23, 2021
    Authors
    Joel Niklaus
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for A Corpus for Multilingual Analysis of Online Terms of Service

      Dataset Summary
    

    "We present the first annotated corpus for multilingual analysis of potentially unfair clauses in online Terms of Service [=ToS]. The data set comprises a total of 100 contracts, obtained from 25 documents annotated in four different languages: English, German, Italian, and Polish. For each contract, potentially unfair clauses for the consumer are annotated, for nine different… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/online_terms_of_service.

  15. h

    web_archive_classification

    • huggingface.co
    Updated Mar 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    British Library (2025). web_archive_classification [Dataset]. https://huggingface.co/datasets/TheBritishLibrary/web_archive_classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2025
    Dataset authored and provided by
    British Library
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy.

  16. h

    spoken-web-questions

    • huggingface.co
    Updated Sep 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ultravox.ai (2024). spoken-web-questions [Dataset]. https://huggingface.co/datasets/fixie-ai/spoken-web-questions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2024
    Dataset provided by
    Ultravox.ai
    Description

    fixie-ai/spoken-web-questions dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    pd12m

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    II, pd12m [Dataset]. https://huggingface.co/datasets/Intelligent-Internet/pd12m
    Explore at:
    Dataset authored and provided by
    II
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PD12M

    This is a curated PD12M dataset for use with the II-Commons project.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This dataset comprises a curated Public Domain 12M image collection, refined by filtering for active image links. EXIF data was extracted, and images underwent preprocessing and feature extraction using SigLIP 2. All vector embeddings are normalized 16-bit half-precision vectors optimized for L2 indexing with vectorchord.

      Dataset Sources… See the full description on the dataset page: https://huggingface.co/datasets/Intelligent-Internet/pd12m.
    
  18. Web-Bench

    • huggingface.co
    Updated May 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bytedance-research (2025). Web-Bench [Dataset]. https://huggingface.co/datasets/bytedance-research/Web-Bench
    Explore at:
    Dataset updated
    May 12, 2025
    Dataset provided by
    ByteDancehttps://www.bytedance.com/
    Authors
    bytedance-research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Web-Bench

    English | 中文 README

      📖 Overview
    

    Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development. Web-Bench contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development workflows. When designing Web-Bench, we aim to cover the foundational elements of Web development: Web Standards and Web Frameworks. Given the scale and… See the full description on the dataset page: https://huggingface.co/datasets/bytedance-research/Web-Bench.

  19. h

    tiny-webtext

    • huggingface.co
    Updated Jan 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nam Pham (2024). tiny-webtext [Dataset]. http://doi.org/10.57967/hf/1024
    Explore at:
    Dataset updated
    Jan 26, 2024
    Authors
    Nam Pham
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Tiny WebText

    The Tiny WebText dataset is designed to help models learn about perception on web text while neutralizing the bias of the source text using critical thinking methods. By providing a rich and diverse set of texts, I aim to improve the ability of models to understand and analyze information in a more objective and unbiased manner. This dataset can be used to train and evaluate natural language processing and machine learning models, with the goal of improving their… See the full description on the dataset page: https://huggingface.co/datasets/nampdn-ai/tiny-webtext.

  20. h

    gov_trec-web-2003

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ir-datasets, gov_trec-web-2003 [Dataset]. https://huggingface.co/datasets/irds/gov_trec-web-2003
    Explore at:
    Dataset authored and provided by
    ir-datasets
    Description

    Dataset Card for gov/trec-web-2003

    The gov/trec-web-2003 dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

      Data
    

    This dataset provides:

    queries (i.e., topics); count=50

    qrels: (relevance assessments); count=51,062

    For docs, use irds/gov

      Usage
    

    from datasets import load_dataset

    queries = load_dataset('irds/gov_trec-web-2003', 'queries') for record in queries: record # {'query_id': ...… See the full description on the dataset page: https://huggingface.co/datasets/irds/gov_trec-web-2003.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493

fineweb

FineWeb

HuggingFaceFW/fineweb

Explore at:
93 scholarly articles cite this dataset (View in Google Scholar)
Dataset authored and provided by
FineData
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

  What is it?

The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Search
Clear search
Close search
Google apps
Main menu