79 datasets found
  1. statistics

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Statistics

    Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

      Charsets
    

    The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

  2. h

    common-crawl-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng, common-crawl-sample [Dataset]. https://huggingface.co/datasets/agentlans/common-crawl-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Alan Tseng
    Description

    Common Crawl sample

    A small unofficial random subset of the famous Common Crawl dataset.

    60 random segment WET files were downloaded from Common Crawl on 2024-05-12. Lines between 500 and 5000 characters long (inclusive) were kept. Only unique texts were kept. No other filtering.

      Languages
    

    Each text was assigned to one of the language codes using the GCLD3 Python package. The Chinese texts were classified as either simplified, traditional, or Cantonese using the… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/common-crawl-sample.

  3. h

    common-crawl

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Malaysia AI, common-crawl [Dataset]. https://huggingface.co/datasets/malaysia-ai/common-crawl
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Malaysia AI
    Description

    malaysia-ai/common-crawl dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    common-crawl-diverse-sample

    • huggingface.co
    Updated Sep 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Haines (2024). common-crawl-diverse-sample [Dataset]. https://huggingface.co/datasets/amazingvince/common-crawl-diverse-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2024
    Authors
    Vincent Haines
    Description

    amazingvince/common-crawl-diverse-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    Chinese-Common-Crawl-Filtered

    • huggingface.co
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jed Cheng (2025). Chinese-Common-Crawl-Filtered [Dataset]. https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Jed Cheng
    Description

    Traditional Chinese C4

      Dataset Summary
    

    Data obtained from 2025-18 and 2025-13 Common Crawl. Downloaded and processed using code based on another project attempting to recreate the C4 dataset. The resultant dataset contains both simplified and traditional Chinese. It was then filtered using a modified list of simplified Chinese characters to obtain another traditional Chinese dataset. I am still ironning out the process of filtering. The 2025-13 dataset was deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered.

  6. h

    common-crawl-character-counts

    • huggingface.co
    Updated Jul 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yvfu (2025). common-crawl-character-counts [Dataset]. https://huggingface.co/datasets/yvfu/common-crawl-character-counts
    Explore at:
    Dataset updated
    Jul 2, 2025
    Authors
    yvfu
    Description

    yvfu/common-crawl-character-counts dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. citations

    • huggingface.co
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). citations [Dataset]. https://huggingface.co/datasets/commoncrawl/citations
    Explore at:
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Citations Overview

    This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar. Please note that these citations are not curated, so they will include some false positives. For an annotated subset of these citations with additional fields, please see citations-annotated.

  8. h

    common-crawl-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Daniels, common-crawl-sample [Dataset]. https://huggingface.co/datasets/codymd/common-crawl-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Cody Daniels
    Description

    codymd/common-crawl-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    Common-Crawl-2025-June

    • huggingface.co
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shirova AI (2025). Common-Crawl-2025-June [Dataset]. https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Shirova AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Common Crawl 2025 June

    Common-Crawl-2025-June is a curated, processed, and filtered dataset built from the June 2025 Common Crawl web corpus.It contains data crawled between June 1, 2025, and June 10, 2025, processed using Hugging Face’s Data Trove pipeline and several AI-based content filters to remove unsafe, harmful, or low-quality text.

      Dataset Summary
    

    This dataset represents one of the latest structured Common Crawl releases with high-quality web data.The… See the full description on the dataset page: https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June.

  10. h

    common-crawl-zhtw

    • huggingface.co
    Updated Aug 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar, Li (2023). common-crawl-zhtw [Dataset]. https://huggingface.co/datasets/liswei/common-crawl-zhtw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2023
    Authors
    Oscar, Li
    Description

    Dataset Card for Common Crawl Traditional Chinese

    De-duplicated version of jed351/Traditional-Chinese-Common-Crawl-Filtered. De-duplicated with MinHash

    Is suggested to filter the dataset with NLU models before any serious use.

  11. web-graph-testing-v1

    • huggingface.co
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2025). web-graph-testing-v1 [Dataset]. https://huggingface.co/datasets/commoncrawl/web-graph-testing-v1
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    commoncrawl/web-graph-testing-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    clean_mc4_it

    • huggingface.co
    Updated Feb 22, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriele Sarti (2020). clean_mc4_it [Dataset]. https://huggingface.co/datasets/gsarti/clean_mc4_it
    Explore at:
    Dataset updated
    Feb 22, 2020
    Authors
    Gabriele Sarti
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    A thoroughly cleaned version of the Italian portion of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4) by AllenAI.

    Based on Common Crawl dataset: "https://commoncrawl.org".

    This is the processed version of Google's mC4 dataset by AllenAI, with further cleaning detailed in the repository README file.

  13. h

    mc4-sampling

    • huggingface.co
    Updated Jun 24, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BERTIN Project (2018). mc4-sampling [Dataset]. https://huggingface.co/datasets/bertin-project/mc4-sampling
    Explore at:
    Dataset updated
    Jun 24, 2018
    Dataset authored and provided by
    BERTIN Project
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    A sampling-enabled version of mC4, the colossal, cleaned version of Common Crawl's web crawl corpus.

    Based on Common Crawl dataset: "https://commoncrawl.org".

    This is a version of the processed version of Google's mC4 dataset by AllenAI, in which sampling methods are implemented to perform on the fly.

  14. h

    small-german-common-crawl

    • huggingface.co
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisa Schmidt (2024). small-german-common-crawl [Dataset]. https://huggingface.co/datasets/mariisa/small-german-common-crawl
    Explore at:
    Dataset updated
    Apr 24, 2024
    Authors
    Marisa Schmidt
    Description

    mariisa/small-german-common-crawl dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    common-crawl-sample_urls

    • huggingface.co
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). common-crawl-sample_urls [Dataset]. http://doi.org/10.57967/hf/5470
    Explore at:
    Dataset updated
    May 15, 2025
    Authors
    Nick Hagar
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Dataset Card for common-crawl-sample_urls

    This dataset provides the URLs and top-level domains associated with training records in agentlans/common-crawl-sample. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

      Dataset Details
    
    
    
    
    
    
    
      Dataset Description
    

    This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/common-crawl-sample_urls.

  16. h

    OSCAR-2201

    • huggingface.co
    Updated May 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OSCAR (2022). OSCAR-2201 [Dataset]. https://huggingface.co/datasets/oscar-corpus/OSCAR-2201
    Explore at:
    Dataset updated
    May 20, 2022
    Dataset authored and provided by
    OSCAR
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    The Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.\

  17. h

    common-crawl-english-filtered

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    anand, common-crawl-english-filtered [Dataset]. https://huggingface.co/datasets/anandjh8/common-crawl-english-filtered
    Explore at:
    Authors
    anand
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🧠 FineWeb-English-Filtered

      📘 Dataset Summary
    

    FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives.It contains 940 million documents of publicly available web text, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated using a custom AWS Glue pipeline that processed, filtered, and merged .wet files across multiple terabytes of Common… See the full description on the dataset page: https://huggingface.co/datasets/anandjh8/common-crawl-english-filtered.

  18. h

    clean-si-mc4

    • huggingface.co
    Updated Dec 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keshan Sodimana (2021). clean-si-mc4 [Dataset]. https://huggingface.co/datasets/keshan/clean-si-mc4
    Explore at:
    Dataset updated
    Dec 21, 2021
    Authors
    Keshan Sodimana
    Description

    A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.

  19. h

    cccc_filtered

    • huggingface.co
    Updated Jun 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile (2024). cccc_filtered [Dataset]. https://huggingface.co/datasets/common-pile/cccc_filtered
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset authored and provided by
    Common Pile
    Description

    Creative Commons Common Crawl

      Description
    

    This dataset contains text from 52 Common Crawl snapshots, covering about half of Common Crawl snapshots available to date and covering all years of operations of Common Crawl up to 2024. We found a higher level of duplication across this collection, suggesting that including more snapshots would lead to a modest increase in total token yield. From these snapshots, we extract HTML content using FastWarc. Then, using a regular… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/cccc_filtered.

  20. h

    cc100

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SEACrowd, cc100 [Dataset]. https://huggingface.co/datasets/SEACrowd/cc100
    Explore at:
    Dataset authored and provided by
    SEACrowd
    Description

    This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Organization logo

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Search
Clear search
Close search
Google apps
Main menu