The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Host-level Web Graph - This graph aggregates the page graph by subdomain/host where each node represents a specific subdomain/host and an edge exists between a pair of hosts/subdomains if at least one link was found between pages that belong to a pair of subdomains/hosts. The hyperlink graph was extracted from the Web corpus released by the Common Crawl Foundation in August 2012. The Web corpus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation. Also, see web-cc12-firstlevel-subdomain and web-cc12-PayLevelDomain.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
The Common Crawl Creative Commons Corpus (C5)
Raw CommonCrawl crawls, annotated with Creative Commons license information
C5 is an effort to collect Creative Commons-licensed web data in one place. The licensing information is extracted from the web pages based on whether they link to Creative Commons licenses either overtly in a tags (like in the footer of Wikipedia) or in metadata fields indicating deliberate Creative Commons publication. However, false positives may occur! See… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons.
The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2017) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.
Common Crawl Statistics
Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:
Charsets
The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.
When crawl and mobile indexing are ensured, it makes it easier for the crawler or the Internet user to visit and to facilitate the discovery of a site by search engines. Thus, according to the source, in 2020, more than ** percent of SEOs attached great importance to internal networking, that is to say to the presence of internal links pointing to the page to be highlighted. They considered all the criteria in the crawl category to be important, with the exception of the indication of priority in the sitemap, to which only ** percent of SEOs gave meaning, with **** importance out of five.
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.
The original source is the Common Crawl dataset: https://commoncrawl.org
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('c4_wsrs', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Seed list generated for domain crawl of lanl.gov
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
TakaraSpider Japanese Web Crawl Dataset
Dataset Summary
TakaraSpider is a large-scale web crawl dataset specifically designed to capture Japanese web content alongside international sources. The dataset contains 257,900 web pages collected through systematic crawling, with a primary focus on Japanese language content (78.5%) while maintaining substantial international representation (21.5%). This makes it ideal for Japanese-English comparative studies, cross-cultural web… See the full description on the dataset page: https://huggingface.co/datasets/takarajordan/takaraspider.
This dataset consists of a complete set of augmented index files for CC-MAIN-2019-35 [1]. This version of the index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header from the HTTP response as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files. [1] https://commoncrawl.org/blog/august-2019-crawl-archive-now-available
Microformat, Microdata and RDFa data from the October 2016 Common Crawl web corpus. We found structured data within 1.24 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.63 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (16.5%). Altogether, the extracted data sets consist of 44.2 billion RDF quads.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.
Sitemaps:
Top level labels of Curlie.org directory | Number of sitemap links |
Arts | 20110 |
Business | 68690 |
Computers | 17404 |
Games | 3068 |
Health | 13999 |
Home | 4130 |
Kids_and_Teens | 2240 |
News | 5855 |
Recreation | 19273 |
Reference | 10862 |
Regional | 419 |
Science | 10729 |
Shopping | 29903 |
Society | 35019 |
Sports | 12597 |
Robots.txt files:
Top level labels of Curlie.org directory | Number of robots.txt links |
Arts | 25281 |
Business | 79497 |
Computers | 21880 |
Games | 5037 |
Health | 17326 |
Home | 5401 |
Kids_and_Teens | 3753 |
News | 3424 |
Recreation | 26355 |
Reference | 15404 |
Regional | 678 |
Science | 16500 |
Shopping | 30266 |
Society | 45397 |
Sports | 18029 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were sampled from a larger corpus to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word.
The corpus is split in training and test set as it is used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts.
More information on the corpus can be found on the corpus web page at our university (listed under documented by).
Includes publicly-accessible, human-readable material from Australian government websites, using the Organisations Register (AGOR) as a seed list and domain categorisation source, obeying robots.txt
and sitemap.xml
directives, gathered over a 10-day period.
Several non-*.gov.au domains are included in the AGOR - these have been crawled up to a limit of 10K URLs.
Several binary file formats included and converted to HTML: doc,docm,docx,dot,epub,keys,numbers,pages,pdf,ppt,pptm,pptx,rtf,xls,xlsm,xlsx
URLs returning responses larger than 10MB are not included in the dataset.
Raw gathered data (including metadata) is published in the Web Archive (WARC) format, in both a single, multi-gigabyte WARC file and split series.
Metadata extracted from pages after filtering is published in JSON format, with fields defined in a data dictionary.
Web content contained within these WARC files has originally been authored by the agency hosting the referenced material. Authoring agencies are responsible for the choice of licence attached to the original material.
A consistent licence across the entirety of the WARC files' contents should not be assumed. Agencies may have indicated copyright and licence information for a given URL as metadata embedded in a WARC file entry, but this should not be assumed to be present for all WARC entries.
More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with up to 1,000 anchor texts each.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A BitTorrent file to download data with the title 'ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl) '
The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.