79 datasets found

statistics
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
h
common-crawl-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng, common-crawl-sample [Dataset]. https://huggingface.co/datasets/agentlans/common-crawl-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Alan Tseng
Description
Common Crawl sample

A small unofficial random subset of the famous Common Crawl dataset.

60 random segment WET files were downloaded from Common Crawl on 2024-05-12. Lines between 500 and 5000 characters long (inclusive) were kept. Only unique texts were kept. No other filtering.

Languages

Each text was assigned to one of the language codes using the GCLD3 Python package. The Chinese texts were classified as either simplified, traditional, or Cantonese using the… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/common-crawl-sample.
h
common-crawl
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Malaysia AI, common-crawl [Dataset]. https://huggingface.co/datasets/malaysia-ai/common-crawl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Malaysia AI
Description
malaysia-ai/common-crawl dataset hosted on Hugging Face and contributed by the HF Datasets community
h
common-crawl-diverse-sample
huggingface.co
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Haines (2024). common-crawl-diverse-sample [Dataset]. https://huggingface.co/datasets/amazingvince/common-crawl-diverse-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2024
Authors
Vincent Haines
Description
amazingvince/common-crawl-diverse-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Chinese-Common-Crawl-Filtered
huggingface.co
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jed Cheng (2025). Chinese-Common-Crawl-Filtered [Dataset]. https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered
Explore at:
Dataset updated
Jun 2, 2025
Authors
Jed Cheng
Description
Traditional Chinese C4

Dataset Summary

Data obtained from 2025-18 and 2025-13 Common Crawl. Downloaded and processed using code based on another project attempting to recreate the C4 dataset. The resultant dataset contains both simplified and traditional Chinese. It was then filtered using a modified list of simplified Chinese characters to obtain another traditional Chinese dataset. I am still ironning out the process of filtering. The 2025-13 dataset was deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered.
h
common-crawl-character-counts
huggingface.co
Updated Jul 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yvfu (2025). common-crawl-character-counts [Dataset]. https://huggingface.co/datasets/yvfu/common-crawl-character-counts
Explore at:
Dataset updated
Jul 2, 2025
Authors
yvfu
Description
yvfu/common-crawl-character-counts dataset hosted on Hugging Face and contributed by the HF Datasets community
citations
huggingface.co
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). citations [Dataset]. https://huggingface.co/datasets/commoncrawl/citations
Explore at:
Dataset updated
Jul 30, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Citations Overview

This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar. Please note that these citations are not curated, so they will include some false positives. For an annotated subset of these citations with additional fields, please see citations-annotated.
h
common-crawl-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Daniels, common-crawl-sample [Dataset]. https://huggingface.co/datasets/codymd/common-crawl-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Cody Daniels
Description
codymd/common-crawl-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Common-Crawl-2025-June
huggingface.co
Updated Jun 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shirova AI (2025). Common-Crawl-2025-June [Dataset]. https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June
Explore at:
Dataset updated
Jun 25, 2025
Dataset authored and provided by
Shirova AI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Common Crawl 2025 June

Common-Crawl-2025-June is a curated, processed, and filtered dataset built from the June 2025 Common Crawl web corpus.It contains data crawled between June 1, 2025, and June 10, 2025, processed using Hugging Face’s Data Trove pipeline and several AI-based content filters to remove unsafe, harmful, or low-quality text.

Dataset Summary

This dataset represents one of the latest structured Common Crawl releases with high-quality web data.The… See the full description on the dataset page: https://huggingface.co/datasets/Shirova/Common-Crawl-2025-June.
h
common-crawl-zhtw
huggingface.co
Updated Aug 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oscar, Li (2023). common-crawl-zhtw [Dataset]. https://huggingface.co/datasets/liswei/common-crawl-zhtw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2023
Authors
Oscar, Li
Description
Dataset Card for Common Crawl Traditional Chinese

De-duplicated version of jed351/Traditional-Chinese-Common-Crawl-Filtered. De-duplicated with MinHash

Is suggested to filter the dataset with NLU models before any serious use.
web-graph-testing-v1
huggingface.co
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2025). web-graph-testing-v1 [Dataset]. https://huggingface.co/datasets/commoncrawl/web-graph-testing-v1
Explore at:
Dataset updated
Nov 14, 2025
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
commoncrawl/web-graph-testing-v1 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
clean_mc4_it
huggingface.co
Updated Feb 22, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriele Sarti (2020). clean_mc4_it [Dataset]. https://huggingface.co/datasets/gsarti/clean_mc4_it
Explore at:
Dataset updated
Feb 22, 2020
Authors
Gabriele Sarti
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
A thoroughly cleaned version of the Italian portion of the multilingual colossal, cleaned version of Common Crawl's web crawl corpus (mC4) by AllenAI.

Based on Common Crawl dataset: "https://commoncrawl.org".

This is the processed version of Google's mC4 dataset by AllenAI, with further cleaning detailed in the repository README file.
h
mc4-sampling
huggingface.co
Updated Jun 24, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BERTIN Project (2018). mc4-sampling [Dataset]. https://huggingface.co/datasets/bertin-project/mc4-sampling
Explore at:
Dataset updated
Jun 24, 2018
Dataset authored and provided by
BERTIN Project
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
A sampling-enabled version of mC4, the colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: "https://commoncrawl.org".

This is a version of the processed version of Google's mC4 dataset by AllenAI, in which sampling methods are implemented to perform on the fly.
h
small-german-common-crawl
huggingface.co
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marisa Schmidt (2024). small-german-common-crawl [Dataset]. https://huggingface.co/datasets/mariisa/small-german-common-crawl
Explore at:
Dataset updated
Apr 24, 2024
Authors
Marisa Schmidt
Description
mariisa/small-german-common-crawl dataset hosted on Hugging Face and contributed by the HF Datasets community
h
common-crawl-sample_urls
huggingface.co
Updated May 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). common-crawl-sample_urls [Dataset]. http://doi.org/10.57967/hf/5470
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5470
Dataset updated
May 15, 2025
Authors
Nick Hagar
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Dataset Card for common-crawl-sample_urls

This dataset provides the URLs and top-level domains associated with training records in agentlans/common-crawl-sample. It is part of a collection of datasets curated to make exploring LLM training datasets more straightforward and accessible.

Dataset Details Dataset Description

This dataset was created by downloading the source data, extracting URLs and top-level domains, and retaining only those record… See the full description on the dataset page: https://huggingface.co/datasets/nhagar/common-crawl-sample_urls.
h
OSCAR-2201
huggingface.co
Updated May 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OSCAR (2022). OSCAR-2201 [Dataset]. https://huggingface.co/datasets/oscar-corpus/OSCAR-2201
Explore at:
Dataset updated
May 20, 2022
Dataset authored and provided by
OSCAR
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
The Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.\
h
common-crawl-english-filtered
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
anand, common-crawl-english-filtered [Dataset]. https://huggingface.co/datasets/anandjh8/common-crawl-english-filtered
Explore at:
Authors
anand
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🧠 FineWeb-English-Filtered

📘 Dataset Summary

FineWeb-English-Filtered is a large-scale, cleaned, English-only text dataset derived from Common Crawl’s WET archives.It contains 940 million documents of publicly available web text, converted into Apache Parquet format with a consistent schema for fast and efficient data loading. The dataset was generated using a custom AWS Glue pipeline that processed, filtered, and merged .wet files across multiple terabytes of Common… See the full description on the dataset page: https://huggingface.co/datasets/anandjh8/common-crawl-english-filtered.
h
clean-si-mc4
huggingface.co
Updated Dec 21, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshan Sodimana (2021). clean-si-mc4 [Dataset]. https://huggingface.co/datasets/keshan/clean-si-mc4
Explore at:
Dataset updated
Dec 21, 2021
Authors
Keshan Sodimana
Description
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.
h
cccc_filtered
huggingface.co
Updated Jun 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Pile (2024). cccc_filtered [Dataset]. https://huggingface.co/datasets/common-pile/cccc_filtered
Explore at:
Dataset updated
Jun 4, 2024
Dataset authored and provided by
Common Pile
Description
Creative Commons Common Crawl

Description

This dataset contains text from 52 Common Crawl snapshots, covering about half of Common Crawl snapshots available to date and covering all years of operations of Common Crawl up to 2024. We found a higher level of duplication across this collection, suggesting that including more snapshots would lead to a modest increase in total token yield. From these snapshots, we extract HTML content using FastWarc. Then, using a regular… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/cccc_filtered.
h
cc100
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SEACrowd, cc100 [Dataset]. https://huggingface.co/datasets/SEACrowd/cc100
Explore at:
Dataset authored and provided by
SEACrowd
Description
This corpus is an attempt to recreate the dataset used for training XLM-R. This corpus comprises of monolingual data for 100+ languages and also includes data for romanized languages (indicated by *_rom). This was constructed using the urls and paragraph indices provided by the CC-Net repository by processing January-December 2018 Commoncrawl snapshots. Each file comprises of documents separated by double-newlines and paragraphs within the same document separated by a newline. The data is generated using the open source CC-Net repository. No claims of intellectual property are made on the work of preparation of the corpus.

Facebook

Twitter

Click to copy link

Link copied

Cite

Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:

Dataset updated

Nov 20, 2024

Dataset provided by

Common Crawlhttp://commoncrawl.org/

Authors

Common Crawl Foundation

Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Clear search

Close search

Google apps

Main menu

statistics

common-crawl-sample

common-crawl

common-crawl-diverse-sample

Chinese-Common-Crawl-Filtered

common-crawl-character-counts

citations

common-crawl-sample

Common-Crawl-2025-June

common-crawl-zhtw

web-graph-testing-v1

clean_mc4_it

mc4-sampling

small-german-common-crawl

common-crawl-sample_urls

OSCAR-2201

common-crawl-english-filtered

clean-si-mc4

cccc_filtered

cc100

statistics

commoncrawl/statistics

Common Crawl Statistics