Facebook
TwitterCommon Crawl Statistics
Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:
Charsets
The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
Facebook
TwitterCommon Crawl sample
A small unofficial random subset of the famous Common Crawl dataset.
60 random segment WET files were downloaded from Common Crawl on 2024-05-12. Lines between 500 and 5000 characters long (inclusive) were kept. Only unique texts were kept. No other filtering.
Languages
Each text was assigned to one of the language codes using the GCLD3 Python package. The Chinese texts were classified as either simplified, traditional, or Cantonese using the… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/common-crawl-sample.
Facebook
Twittermalaysia-ai/common-crawl dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitteramazingvince/common-crawl-diverse-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterTraditional Chinese C4
Dataset Summary
Data obtained from 2025-18 and 2025-13 Common Crawl. Downloaded and processed using code based on another project attempting to recreate the C4 dataset. The resultant dataset contains both simplified and traditional Chinese. It was then filtered using a modified list of simplified Chinese characters to obtain another traditional Chinese dataset. I am still ironning out the process of filtering. The 2025-13 dataset was deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered.
Facebook
Twitteryvfu/common-crawl-character-counts dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCommon Crawl Citations Overview
This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar. Please note that these citations are not curated, so they will include some false positives. For an annotated subset of these citations with additional fields, please see citations-annotated.
Facebook
Twittercodymd/common-crawl-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Common Crawl 2025 June
Common-Crawl-2025-June is a curated, processed, and filtered dataset built from the June 2025 Common Crawl web corpus.It contains data crawled between June 1, 2025, and June 10, 2025, processed using Hugging Face’s Data Trove pipeline and several AI-based content filters to remove unsafe, harmful, or low-quality text.
Dataset Summary
This dataset represents one of the latest structured Common Crawl releases with high-quality web data.The… See the full description on the dataset page: https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June.
Facebook
TwitterDataset Card for Common Crawl Traditional Chinese
De-duplicated version of jed351/Traditional-Chinese-Common-Crawl-Filtered. De-duplicated with MinHash
Is suggested to filter the dataset with NLU models before any serious use.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
commoncrawl/web-graph-testing-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
A thoroughly cleaned version of the Italian portion of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4) by AllenAI.
Based on Common Crawl dataset: "https://commoncrawl.org".
This is the processed version of Google's mC4 dataset by AllenAI, with further cleaning detailed in the repository README file.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
A sampling-enabled version of mC4, the colossal, cleaned version of Common Crawl's web crawl corpus.
Based on Common Crawl dataset: "https://commoncrawl.org".
This is a version of the processed version of Google's mC4 dataset by AllenAI, in which sampling methods are implemented to perform on the fly.
Facebook
Twittermariisa/small-german-common-crawl dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Card for common-crawl-sample_urls
This dataset provides the URLs and top-level domains associated with training records in agentlans/common-crawl-sample. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.
Dataset Details
Dataset Description
This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/common-crawl-sample_urls.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.\
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🧠 FineWeb-English-Filtered
📘 Dataset Summary
FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives.It contains 940 million documents of publicly available web text, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated using a custom AWS Glue pipeline that processed, filtered, and merged .wet files across multiple terabytes of Common… See the full description on the dataset page: https://huggingface.co/datasets/anandjh8/common-crawl-english-filtered.
Facebook
TwitterA colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.
Facebook
TwitterCreative Commons Common Crawl
Description
This dataset contains text from 52 Common Crawl snapshots, covering about half of Common Crawl snapshots available to date and covering all years of operations of Common Crawl up to 2024. We found a higher level of duplication across this collection, suggesting that including more snapshots would lead to a modest increase in total token yield. From these snapshots, we extract HTML content using FastWarc. Then, using a regular… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/cccc_filtered.
Facebook
TwitterThis corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.
Facebook
TwitterCommon Crawl Statistics
Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:
Charsets
The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.