Common Crawl Statistics
Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:
Charsets
The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.
The original source is the Common Crawl dataset: https://commoncrawl.org
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('c4_wsrs', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2017) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.
Common Crawl Citations Overview
This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar. Please note that these citations are not curated, so they will include some false positives. For an annotated subset of these citations with additional fields, please see citations-annotated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.
Sitemaps:
Top level labels of Curlie.org directory | Number of sitemap links |
Arts | 20110 |
Business | 68690 |
Computers | 17404 |
Games | 3068 |
Health | 13999 |
Home | 4130 |
Kids_and_Teens | 2240 |
News | 5855 |
Recreation | 19273 |
Reference | 10862 |
Regional | 419 |
Science | 10729 |
Shopping | 29903 |
Society | 35019 |
Sports | 12597 |
Robots.txt files:
Top level labels of Curlie.org directory | Number of robots.txt links |
Arts | 25281 |
Business | 79497 |
Computers | 21880 |
Games | 5037 |
Health | 17326 |
Home | 5401 |
Kids_and_Teens | 3753 |
News | 3424 |
Recreation | 26355 |
Reference | 15404 |
Regional | 678 |
Science | 16500 |
Shopping | 30266 |
Society | 45397 |
Sports | 18029 |
This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2024-10 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.
The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Content:
CEREAL (visit the project website) is a document-level corpus of documents in Spanish extracted from OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl.
The process to build the corpus and its characteristics can be found in:
Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.
The corpus used to train the classifier and the sentence-level version of CEREAL is available at
https://zenodo.org/records/11390829
Files Description:
See the README.txt file
This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2017-13 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.
This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2018-22 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset includes news articles gathered from CommonCrawl for media outlets that were selected based on their political orientation. The news articles span publication dates from 2010 to 2021.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set belongs to an academic manuscript examining longitudinally (2000-2019) the prevalence of terms denoting far-right and far-left political extremism in a large corpus of more than 32 million written news and opinion articles from 54 news media outlets popular in the United States and the United Kingdom.
The textual content of news and opinion articles from the 54 outlets listed in the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. This threshold was chosen to maximize inclusion in our analysis of outlets with sparse amounts of articles text per year.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript
-articlesContainingTargetWords.rar contains counts of target words in outlets articles as well as total counts of words in articles
Usage Notes
In a small percentage of articles, outlet specific XPath expressions failed to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.
Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. Some additional outlet-specific inaccuracies that we could identify occurred in "The Hill" and "Newsmax" news outlets where XPath expressions had some shortfalls at precisely capturing articles’ content. For "The Hill", in years 2007-2009, XPath expressions failed to capture the complete text of the article in about 40% of the articles. This does not necessarily result in incorrect frequency counts for that outlet but in a sample of articles’ words that is about 40% smaller than the total population of articles words for those three years. In the case of "NewsMax", the issue was that for some articles, XPath expressions captured the entire text of the article twice. Notice that this does not result in incorrect frequency counts. If a word appears x times in an article with a total of y words, the same frequency count will still be derived when our scripts count the word 2x times in the version of the article with a total of 2y words.
To conclude, in a data analysis of 32 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 in the main manuscript for illustration of the accuracy of the frequency counts).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom: The Guardian, The Times, The Independent, The Daily Mirror, BBC, Financial Times, Metro, Telegraph, The and The Daily Mail plus a few additional American-based outlets used for comparison reference. The target words are listed in the associated manuscript and are mostly words that denote some type of prejudice, social justice related terms or counterreaction to it. A few additional words are also available since they are used in the manuscript for illustration purposes.
The textual content of news and opinion articles from the outlets listed in Figure 3 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We derived relative frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript
-targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles
-targetWordsInArticlesCountsGuardianExampleWords contains counts of target words in outlets articles as well as total counts of words in articles for illustrative Figure 1 in main manuscript
Usage Notes
In a small percentage of articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.
Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. To conclude, in a data analysis of 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 of main manuscript for supporting evidence).
This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2020-50 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set contains frequency counts of target words in 5 million news and opinion articles from 3 popular newspapers in Spain: El País, El Mundo and ABC. The target words are listed in the associated manuscript and are mostly words that denote some type of prejudice. A few additional words not denoting prejudice are also available since they are used in the manuscript for illustration purposes.
The textual content of news and opinion articles from the outlets listed in Figure 1 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from original sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.
Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.
Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.
The list of compressed files in this data set is listed next:
-analysisScripts.rar contains the analysis scripts used in the main manuscript
-targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles
Usage Notes
In a small percentage of articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles might not be precise.
To conclude, in a data analysis of millions of news articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 2 of main manuscript for supporting evidence).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). We sampled 1,000 anchor texts for documents with more than 1,000 anchor texts at random and all anchor texts for documents with less than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents in Version 1 and 97% of documents for Version 2). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with anchor text.
Cleaned versions of the MS MARCO Anchor Text 2022 dataset are available in ir_datasets, Zenodo and Hugging Face. The raw dataset with additional information and all metadata for the extracted anchor texts (roughly 100GB) is available on Hugging Face and files.webis.de.
The details of the construction of the Webis MS MARCO Anchor Text 2022 dataset are described in the associated paper. If you use this dataset, please cite
@InProceedings{froebe:2022a,
address = {Berlin Heidelberg New York},
author = {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen},
booktitle = {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)},
editor = {Matthias Hagen and Suzan Verberne and Craig Macdonald and Christin Seifert and Krisztian Balog and Kjetil N{\o}rv\r{a}g and Vinay Setty},
month = apr,
publisher = {Springer},
series = {Lecture Notes in Computer Science},
site = {Stavanger, Norway},
title = {{The Power of Anchor Text in the Neural Retrieval Era}},
year = 2022
}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SOTAB V2 for SemTab 2023 includes datasets used to evaluate Column Type Annotation (CTA) and Columns Property Annotation (CPA) systems in the 2023 edition of the SemTab challenge. The datasets for both rounds of the challenge were down-sampled from the full train, test and validation splits of the SOTAB V2 (WDC Schema.org Table Annotation Benchmark version 2) benchmark, so that the datasets of the first round have a smaller vocabulary of 40 and 50 labels for CTA and CPA respectively corresponding to easier/more general domains, and the datasets of the second round include the full vocabulary size of 80 and 105 labels and are therefore considered to be harder to annotate. The columns and the relationships between columns are annotated using the Schema.org and DBpedia vocabulary.
SOTAB V2 for SemTab 2023 contains the splits used in Round 1 and Round 2 of the challenge. Each round includes a training, validation and test split together with the ground truth for the test splits and the vocabulary list. The ground truth of the test sets of both rounds are manually verified.
Files contained in SOTAB V2 for SemTab 2023:
All the corresponding tables can be found in the "Tables" zip folders.
Note on License: This data includes data from the following sources. Refer to each source for license details:
- CommonCrawl https://commoncrawl.org/
THIS DATA IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Common Crawl Statistics
Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:
Charsets
The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.