40 datasets found
  1. h

    fineweb-edu-1M

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TEL-LLM, fineweb-edu-1M [Dataset]. https://huggingface.co/datasets/TEL-LLM/fineweb-edu-1M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    TEL-LLM
    Description

    TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    fineweb-ultra-mini

    • huggingface.co
    Updated Feb 7, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reflex AI (2025). fineweb-ultra-mini [Dataset]. https://huggingface.co/datasets/reflex-ai/fineweb-ultra-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2025
    Dataset authored and provided by
    Reflex AI
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Fineweb Ultra Mini

    Fineweb Ultra Mini is a dataset derived from the original Fineweb dataset made by huggingface (see here: https://huggingface.co/datasets/HuggingFaceFW/fineweb). The dataset focuses on extracting high quality data from the Fineweb dataset, from the 2-3% range. If you would like even more high-quality data, keep out for our next release, fineweb ultra mini pro, which focuses on the 0-1% of high quality data originally found in fineweb.… See the full description on the dataset page: https://huggingface.co/datasets/reflex-ai/fineweb-ultra-mini.

  3. h

    fineweb-edu-2024-10-from-5M-to-6M-ko

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh (2024). fineweb-edu-2024-10-from-5M-to-6M-ko [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-5M-to-6M-ko
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-5M-to-6M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    fineweb-edu-2024-10-from-1M-to-2M-ko

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh (2024). fineweb-edu-2024-10-from-1M-to-2M-ko [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-1M-to-2M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    fineweb-edu-Llama-3.2-Instruct-Shuffled

    • huggingface.co
    Updated Feb 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fineweb-edu-Llama-3.2-Instruct-Shuffled [Dataset]. https://huggingface.co/datasets/evinsi/fineweb-edu-Llama-3.2-Instruct-Shuffled
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 21, 2025
    Authors
    NI
    Description

    evinsi/fineweb-edu-Llama-3.2-Instruct-Shuffled dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. h

    fineweb-edu-2024-10-from-2M-to-3M-ko

    • huggingface.co
    Updated Nov 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh (2024). fineweb-edu-2024-10-from-2M-to-3M-ko [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-2M-to-3M-ko
    Explore at:
    Dataset updated
    Nov 1, 2024
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-2M-to-3M-ko dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    test-fineweb-h1_2025_02_14_15

    • huggingface.co
    Updated Feb 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tobias Kim (2025). test-fineweb-h1_2025_02_14_15 [Dataset]. https://huggingface.co/datasets/tobiashomie/test-fineweb-h1_2025_02_14_15
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2025
    Authors
    Tobias Kim
    Description

    tobiashomie/test-fineweb-h1_2025_02_14_15 dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. h

    fineweb-edu-2024-10-from-7M-to-8M

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzie Oh, fineweb-edu-2024-10-from-7M-to-8M [Dataset]. https://huggingface.co/datasets/ohsuz/fineweb-edu-2024-10-from-7M-to-8M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Suzie Oh
    Description

    ohsuz/fineweb-edu-2024-10-from-7M-to-8M dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    fineweb-edu-dedup-45B-1-of-4

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    skymizer, fineweb-edu-dedup-45B-1-of-4 [Dataset]. https://huggingface.co/datasets/skymizer/fineweb-edu-dedup-45B-1-of-4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    skymizer
    Description

    skymizer/fineweb-edu-dedup-45B-1-of-4 dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dodo, fineweb [Dataset]. https://huggingface.co/datasets/dododo1234/fineweb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Dodo
    Description

    dododo1234/fineweb dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. h

    fineweb-sample-2

    • huggingface.co
    Updated Dec 31, 2002
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fineweb-sample-2 [Dataset]. https://huggingface.co/datasets/mekaneeky/fineweb-sample-2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 31, 2002
    Authors
    Ali
    Description

    mekaneeky/fineweb-sample-2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    fineweb-edu-sample-500K

    • huggingface.co
    Updated Apr 11, 2004
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CCDE (2004). fineweb-edu-sample-500K [Dataset]. https://huggingface.co/datasets/ccde/fineweb-edu-sample-500K
    Explore at:
    Dataset updated
    Apr 11, 2004
    Dataset authored and provided by
    CCDE
    Description

    ccde/fineweb-edu-sample-500K dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    fineweb-c-prelim

    • huggingface.co
    Updated Aug 15, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Vila (2014). fineweb-c-prelim [Dataset]. https://huggingface.co/datasets/dvilasuero/fineweb-c-prelim
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2014
    Authors
    Daniel Vila
    Description

    dvilasuero/fineweb-c-prelim dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. a

    fineweb-2

    • aifasthub.com
    • huggingface.co
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HuggingFaceFW (2025). fineweb-2 [Dataset]. http://doi.org/10.57967/hf/3744
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    HuggingFaceFW
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🥂 FineWeb2

    A sparkling update with 1000s of languages

      What is it?
    

    This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages. The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb-2.

  15. h

    fineweb-edu-1bt

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fineweb-edu-1bt [Dataset]. https://huggingface.co/datasets/mikasenghaas/fineweb-edu-1bt
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Mika Senghaas
    Description

    mikasenghaas/fineweb-edu-1bt dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    opc-fineweb-math-corpus

    • huggingface.co
    Updated Nov 15, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenCoder (2024). opc-fineweb-math-corpus [Dataset]. https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-math-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2024
    Dataset authored and provided by
    OpenCoder
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    OpenCoder Dataset

    The OpenCoder dataset is composed of the following datasets:

    opc-sft-stage1: the sft data used for opencoder sft-stage1 opc-sft-stage2: the sft data used for opencoder sft-stage2 opc-annealing-corpus: the synthetic data & algorithmic corpus used for opencoder annealing opc-fineweb-code-corpus: the code-related page recalled from fineweb opc-fineweb-math-corpus: the math-related page recalled from fineweb <-- you are here refineCode-code-corpus-meta: the… See the full description on the dataset page: https://huggingface.co/datasets/OpenCoder-LLM/opc-fineweb-math-corpus.

  17. e

    homeo-dataset

    • hf-proxy-cf.effarig.site
    • huggingface.co
    Updated Oct 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Singh (2021). homeo-dataset [Dataset]. https://hf-proxy-cf.effarig.site/datasets/akhilhsingh/homeo-dataset
    Explore at:
    Dataset updated
    Oct 15, 2021
    Authors
    Akhil Singh
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full… See the full description on the dataset page: https://huggingface.co/datasets/akhilhsingh/homeo-dataset.

  18. h

    instructions-from-fineweb-edu

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    instructions-from-fineweb-edu [Dataset]. https://huggingface.co/datasets/gabrielmbmb/instructions-from-fineweb-edu
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Gabriel Martín Blázquez
    Description

    Dataset Card for instructions-from-fineweb-edu

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/gabrielmbmb/instructions-from-fineweb-edu/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/gabrielmbmb/instructions-from-fineweb-edu.

  19. h

    fineweb-2-10M-tokenized-pplfixed2048SmolLM2-1.7B

    • huggingface.co
    Updated Feb 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fineweb-2-10M-tokenized-pplfixed2048SmolLM2-1.7B [Dataset]. https://huggingface.co/datasets/juniorrios/fineweb-2-10M-tokenized-pplfixed2048SmolLM2-1.7B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 16, 2025
    Authors
    Walcy Santos Rezende Rios
    Description

    juniorrios/fineweb-2-10M-tokenized-pplfixed2048SmolLM2-1.7B dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    fineweb-filtered-100k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Divij Bajaj, fineweb-filtered-100k [Dataset]. https://huggingface.co/datasets/divij30/fineweb-filtered-100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Divij Bajaj
    Description

    divij30/fineweb-filtered-100k dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TEL-LLM, fineweb-edu-1M [Dataset]. https://huggingface.co/datasets/TEL-LLM/fineweb-edu-1M

fineweb-edu-1M

TEL-LLM/fineweb-edu-1M

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
TEL-LLM
Description

TEL-LLM/fineweb-edu-1M dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu