100+ datasets found
  1. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  2. P

    FineWeb Dataset

    • paperswithcode.com
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). FineWeb Dataset [Dataset]. https://paperswithcode.com/dataset/fineweb
    Explore at:
    Dataset updated
    May 27, 2025
    Description

    The FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and runs on the datatrove library, our large-scale data processing library.

    FineWeb was originally meant to be a fully open replication of RefinedWeb, with a release of the full dataset under the ODC-By 1.0 license. However, by carefully adding additional filtering steps, we managed to push the performance of FineWeb well above that of the original RefinedWeb, and models trained on our dataset also outperform models trained on other commonly used high-quality web datasets (like C4, Dolma-v1.6, The Pile, SlimPajama, RedPajam2) on our aggregate group of benchmark tasks.

  3. h

    fineweb-edu

    • huggingface.co
    Updated Jan 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2025). fineweb-edu [Dataset]. http://doi.org/10.57967/hf/2497
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 31, 2025
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 FineWeb-Edu

    1.3 trillion tokens of the finest educational data the 🌐 web has to offer

    Paper: https://arxiv.org/abs/2406.17557

      What is it?
    

    📚 FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from 🍷 FineWeb dataset. This is the 1.3 trillion version. To enhance FineWeb's quality, we developed an educational quality classifier using annotations generated by LLama3-70B-Instruct. We then… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu.

  4. h

    fineweb-2

    • huggingface.co
    Updated Apr 26, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb-2 [Dataset]. http://doi.org/10.57967/hf/3744
    Explore at:
    Dataset updated
    Apr 26, 2012
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🥂 FineWeb2

    A sparkling update with 1000s of languages

      What is it?
    

    This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.

  5. s

    Fineweb-c

    • sprogteknologi.dk
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Privatperson (2025). Fineweb-c [Dataset]. https://sprogteknologi.dk/dataset/fineweb-c
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/parquetAvailable download formats
    Dataset updated
    Jan 30, 2025
    Dataset provided by
    Privatperson
    License

    Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
    License information was derived automatically

    Description

    FineWeb-C: Educational content in many languages, labelled by the community This is a link to the Danish part of the dataset.

    This is a collaborative, community-driven project that expands upon the FineWeb2 dataset. Our goal is to create high-quality educational content annotations across hundreds of languages.

    By enhancing web content with these annotations, we aim to improve the development of Large Language Models (LLMs) in all languages, making AI technology more accessible and effective globally.

    The annotations in this dataset will help train AI systems to automatically identify high-quality educational content in more languages and in turn help build better Large Language Models for all languages.

    What the community is doing: For a given language, look at a page of web content from the FineWeb2 dataset in Argilla. Rate how educational the content is. Flag problematic content i.e. content that is malformed or in the wrong language. Once a language reaches 1,000 annotations, the dataset will be included in this dataset! Alongside rating the educational quality of the content, different language communities are discussing other ways to improve the quality of data for their language in our Discord discussion channel.

    The use of this dataset is also subject to CommonCrawl's Terms of Use.

  6. h

    fineweb-edu-llama3-annotations

    • huggingface.co
    Updated Jun 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData (2024). fineweb-edu-llama3-annotations [Dataset]. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 8, 2024
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Annotations for 📚 FineWeb-Edu classifier

    This dataset contains the annotations used for training 📚 FineWeb-Edu educational quality classifier. We prompt Llama-3-70B-Instruct to score web pages from 🍷 FineWeb based on their educational value. Note: the dataset contains the FineWeb text sample, the prompt (using the first 1000 characters of the text sample) and the scores but it doesn't contain the full Llama 3 generation.

  7. h

    Ultra-FineWeb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenBMB, Ultra-FineWeb [Dataset]. https://huggingface.co/datasets/openbmb/Ultra-FineWeb
    Explore at:
    Dataset authored and provided by
    OpenBMB
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Ultra-FineWeb

    📜 Technical Report

      📚 Introduction
    

    Ultra-FineWeb is a large-scale, high-quality, and efficiently-filtered dataset. We use the proposed efficient verification-based high-quality filtering pipeline to the FineWeb and Chinese FineWeb datasets (source data from Chinese FineWeb-edu-v2, which includes IndustryCorpus2, MiChao, WuDao, SkyPile, WanJuan, ChineseWebText, TeleChat, and CCI3), resulting in the creation of higher-quality Ultra-FineWeb-en… See the full description on the dataset page: https://huggingface.co/datasets/openbmb/Ultra-FineWeb.

  8. Primus-FineWeb

    • huggingface.co
    Updated Mar 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trend Cybertron (Trend Micro) (2025). Primus-FineWeb [Dataset]. https://huggingface.co/datasets/trend-cybertron/Primus-FineWeb
    Explore at:
    Dataset updated
    Mar 10, 2025
    Dataset provided by
    Trend Microhttp://trendmicro.com/
    Authors
    Trend Cybertron (Trend Micro)
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    ⭐ Please download the dataset from here.

      PRIMUS: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
    
    
    
    
    
      🤗 Primus-FineWeb
    

    The Primus-FineWeb dataset is constructed by filtering cybersecurity-related text from FineWeb, a refined version of Common Crawl. We began by leveraging Primus-Seed, a high-quality dataset of manually curated cybersecurity text, as positive samples. We then sampled ten times the amount of data from FineWeb as negative samples… See the full description on the dataset page: https://huggingface.co/datasets/trend-cybertron/Primus-FineWeb.

  9. h

    fineweb-ultra-mini-pro

    • huggingface.co
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reflex AI (2025). fineweb-ultra-mini-pro [Dataset]. https://huggingface.co/datasets/reflex-ai/fineweb-ultra-mini-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2025
    Dataset authored and provided by
    Reflex AI
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Fineweb Ultra Mini

    Fineweb Ultra Mini is a dataset derived from the original Fineweb dataset made by huggingface (see here: https://huggingface.co/datasets/HuggingFaceFW/fineweb). The dataset focuses on extracting high quality data from the Fineweb dataset, from the 1-0.5% range. If you would like more data, though slightly sacrificing quality check out fineweb ultra mini, which focuses on the 2-3% of high quality data originally found in fineweb.… See the full description on the dataset page: https://huggingface.co/datasets/reflex-ai/fineweb-ultra-mini-pro.

  10. fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prime Intellect, fineweb [Dataset]. https://huggingface.co/datasets/PrimeIntellect/fineweb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Prime Intellect, Inc.
    Authors
    Prime Intellect
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Pre-shuffled fineweb dataset

  11. s

    scandi-fine-web-cleaner

    • sprogteknologi.dk
    Updated Jan 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Privatperson (2025). scandi-fine-web-cleaner [Dataset]. https://sprogteknologi.dk/dataset/scandi-fine-web-cleaner
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/htmlAvailable download formats
    Dataset updated
    Jan 14, 2025
    Dataset provided by
    Privatperson
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Denne model er en demoklassifikator til at identificere problematisk indhold (forkert sprog, forvrænget tekst) i danske og svenske webtekster. Modellen blev udviklet som en del af et blogindlæg, der udforsker, hvordan man kan filtrere webdata ved hjælp af fællesskabsbaserede annoteringer. Modellen er finjusteret baseret på FacebookAI/xlm-roberta-base og trænet på datasættet data-is-better-together/fineweb-c.

    Den opnår følgende resultater på evalueringssættet:

    Precision: 0.9524 (95.2%)

    Recall: 0.7018 (70.2%)

    F1: 0.8081

    AUC-ROC: 0.9648

    Formål og begrænsninger: Modellen er beregnet til at fungere som et indledende filter for webtekster med henblik på at forbedre effektiviteten af annoteringsprocessen. Den er kun blevet testet på dansk og svensk indhold. Den høje præcision (95,2 %) betyder, at falske positiver er sjældne, mens recall (70,2 %) indikerer, at modellen fanger størstedelen af det problematiske indhold.

  12. h

    fineweb-c

    • huggingface.co
    Updated Dec 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Is Better Together (2024). fineweb-c [Dataset]. https://huggingface.co/datasets/data-is-better-together/fineweb-c
    Explore at:
    Dataset updated
    Dec 24, 2024
    Dataset authored and provided by
    Data Is Better Together
    Description

    FineWeb-C: Educational content in many languages, labelled by the community

    Multilingual data is better together!

    Note: we're not actively supporting this effort anymore but you can continue to contribute annotations and we'll occasionally refresh the exported data.

      What is this?
    

    FineWeb-C was a collaborative, community-driven project that expands upon the FineWeb2 dataset. The goal is to create high-quality educational content annotations across hundreds… See the full description on the dataset page: https://huggingface.co/datasets/data-is-better-together/fineweb-c.

  13. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ornaments, fineweb [Dataset]. https://huggingface.co/datasets/Ornaments/fineweb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Ornaments
    Description

    Ornaments/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    chinese-fineweb-edu-v2

    • huggingface.co
    Updated May 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opencsg (2025). chinese-fineweb-edu-v2 [Dataset]. https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2025
    Dataset authored and provided by
    opencsg
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    We recommend you to use the improved version Fineweb-edu-chinese-v2.1 !

      Chinese Fineweb Edu Dataset V2     [中文]  [English]
    

    [OpenCSG Community] [👾github] [wechat] [Twitter]

    📖Technical Report Chinese Fineweb Edu Dataset V2 is a comprehensive upgrade of the original Chinese Fineweb Edu, designed and optimized for natural language processing (NLP) tasks in the education sector. This high-quality Chinese pretraining dataset has undergone significant… See the full description on the dataset page: https://huggingface.co/datasets/opencsg/chinese-fineweb-edu-v2.

  15. fineweb-edu

    • huggingface.co
    Updated Sep 1, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prime Intellect (2012). fineweb-edu [Dataset]. https://huggingface.co/datasets/PrimeIntellect/fineweb-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2012
    Dataset provided by
    Authors
    Prime Intellect
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Pre-shuffled fineweb-edu dataset

  16. h

    occiglot-fineweb-v1.0

    • huggingface.co
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Occiglot (2024). occiglot-fineweb-v1.0 [Dataset]. https://huggingface.co/datasets/occiglot/occiglot-fineweb-v1.0
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2024
    Dataset authored and provided by
    Occiglot
    Description

    Occiglot Fineweb v1.0

    We present a more mature version of the multilingual Occiglot Fineweb corpus. In this early form, the dataset contains roughly 430M heavily cleaned documents from 10 languages. Occiglot Fineweb builds on our existing collection of curated datasets and pre-filtered web data. Subsequently, all documents were filtered with language-specific derivatives of the fine-web processing pipeline and different levels of depuplicated. We provide the data at 3 levels of… See the full description on the dataset page: https://huggingface.co/datasets/occiglot/occiglot-fineweb-v1.0.

  17. h

    FineWeb-pro

    • huggingface.co
    Updated Apr 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GAIR-ProX (2021). FineWeb-pro [Dataset]. https://huggingface.co/datasets/gair-prox/FineWeb-pro
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2021
    Dataset authored and provided by
    GAIR-ProX
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📚 fineweb-pro

    ArXiv | Models | Code fineweb-pro is refined from fineweb(350BT sample) using the ProX refining framework. It contains about 100B high quality tokens, ready for general language model pre-training.

      License
    

    fineweb-pro is based on fineweb, which is made available under an ODC-By 1.0 license; users should also abide by the CommonCrawl ToU: https://commoncrawl.org/terms-of-use/. We do not alter the license of any of the underlying data.… See the full description on the dataset page: https://huggingface.co/datasets/gair-prox/FineWeb-pro.

  18. h

    FineWeb-Edu-1MT

    • huggingface.co
    Updated Sep 1, 2011
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rulin Shao (2011). FineWeb-Edu-1MT [Dataset]. https://huggingface.co/datasets/rulins/FineWeb-Edu-1MT
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 1, 2011
    Authors
    Rulin Shao
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A subset of FineWeb-Edu randomly sampled from the whole dataset of around 1M gpt2 tokens. This dataset is created for illustration purpose in retrieval-scaling. Please do not distribute.

  19. h

    fineweb-c-progress

    • huggingface.co
    Updated Dec 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Is Better Together (2024). fineweb-c-progress [Dataset]. https://huggingface.co/datasets/data-is-better-together/fineweb-c-progress
    Explore at:
    Dataset updated
    Dec 11, 2024
    Dataset authored and provided by
    Data Is Better Together
    Description

    data-is-better-together/fineweb-c-progress dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    fineweb-edu-fortified

    • huggingface.co
    Updated Jul 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Airtrain AI (2024). fineweb-edu-fortified [Dataset]. https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 23, 2024
    Dataset authored and provided by
    Airtrain AI
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Fineweb-Edu-Fortified

    The composition of fineweb-edu-fortified, produced by automatically clustering a 500k row sample in Airtrain

      What is it?
    

    Fineweb-Edu-Fortified is a dataset derived from Fineweb-Edu by applying exact-match deduplication across the whole dataset and producing an embedding for each row. The number of times the text from each row appears is also included as a count column. The embeddings were produced using TaylorAI/bge-micro Fineweb and… See the full description on the dataset page: https://huggingface.co/datasets/airtrain-ai/fineweb-edu-fortified.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493

fineweb

FineWeb

HuggingFaceFW/fineweb

Explore at:
84 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
FineData
License

https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

Description

🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

  What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

Search
Clear search
Close search
Google apps
Main menu