100+ datasets found

s
The CommonCrawl Corpus
marketplace.sshopencloud.eu
Updated Apr 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL
Explore at:
Dataset updated
Apr 24, 2020
Description
The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
n
web-cc12-hostgraph
networkrepository.com
csv
Updated Oct 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Network Data Repository (2018). web-cc12-hostgraph [Dataset]. https://networkrepository.com/web-cc12-hostgraph.php
Explore at:
csvAvailable download formats
Dataset updated
Oct 4, 2018
Dataset authored and provided by
Network Data Repository
License
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Description
Host-level Web Graph - This graph aggregates the page graph by subdomain/host where each node represents a specific subdomain/host and an edge exists between a pair of hosts/subdomains if at least one link was found between pages that belong to a pair of subdomains/hosts. The hyperlink graph was extracted from the Web corpus released by the Common Crawl Foundation in August 2012. The Web corpus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation. Also, see web-cc12-firstlevel-subdomain and web-cc12-PayLevelDomain.
h
CommonCrawl-CreativeCommons
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bram Vanroy, CommonCrawl-CreativeCommons [Dataset]. http://doi.org/10.57967/hf/5340
Explore at:
Unique identifier
https://doi.org/10.57967/hf/5340
Authors
Bram Vanroy
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
The Common Crawl Creative Commons Corpus (C5)

Raw CommonCrawl crawls, annotated with Creative Commons license information

C5 is an effort to collect Creative Commons-licensed web data in one place. The licensing information is extracted from the web pages based on whether they link to Creative Commons licenses either overtly in a tags (like in the footer of Wikipedia) or in metadata fields indicating deliberate Creative Commons publication. However, false positives may occur! See… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons.
e
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1)...
b2find.eudat.eu
Updated Nov 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ca7d532c-3db1-5bdc-ab01-e16a9b1466e3
Explore at:
Dataset updated
Nov 16, 2024
Description
The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2017) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.
statistics
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
h
taiga
huggingface.co
Updated Nov 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Astafurov Danil (2023). taiga [Dataset]. https://huggingface.co/datasets/danasone/taiga
Explore at:
Dataset updated
Nov 2, 2023
Authors
Astafurov Danil
License
https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
Description
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.
Crawl attributes ranking by importance on websites in France 2020
statista.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Crawl attributes ranking by importance on websites in France 2020 [Dataset]. https://www.statista.com/statistics/1220602/crawl-attributes-ranking-by-importance-on-websites-france/
Explore at:
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2020
Area covered
France
Description
When crawl and mobile indexing are ensured, it makes it easier for the crawler or the Internet user to visit and to facilitate the discovery of a site by search engines. Thus, according to the source, in 2020, more than ** percent of SEOs attached great importance to internal networking, that is to say to the presence of internal links pointing to the page to be highlighted. They considered all the criteria in the crawl category to be important, with the exception of the indication of priority in the sitemap, to which only ** percent of SEOs gave meaning, with **** importance out of five.
e
Web Data Commons Training and Test Sets for Large-Scale Product Matching -...
b2find.eudat.eu
Updated Nov 27, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
Explore at:
Dataset updated
Nov 27, 2020
Description
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.
T
c4_wsrs
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
Explore at:
Dataset updated
Dec 22, 2022
Description
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset: https://commoncrawl.org

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('c4_wsrs', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
LANL domain crawl seed list
figshare.com
txt
Updated May 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Klein (2022). LANL domain crawl seed list [Dataset]. http://doi.org/10.6084/m9.figshare.19912459.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19912459.v1
Dataset updated
May 27, 2022
Dataset provided by
Figsharehttp://figshare.com/
Authors
Martin Klein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Seed list generated for domain crawl of lanl.gov
h
takaraspider
huggingface.co
Updated Jun 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jordan Legg (2025). takaraspider [Dataset]. https://huggingface.co/datasets/takarajordan/takaraspider
Explore at:
Dataset updated
Jun 18, 2025
Authors
Jordan Legg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
TakaraSpider Japanese Web Crawl Dataset

Dataset Summary

TakaraSpider is a large-scale web crawl dataset specifically designed to capture Japanese web content alongside international sources. The dataset contains 257,900 web pages collected through systematic crawling, with a primary focus on Japanese language content (78.5%) while maintaining substantial international representation (21.5%). This makes it ideal for Japanese-English comparative studies, cross-cultural web… See the full description on the dataset page: https://huggingface.co/datasets/takarajordan/takaraspider.
e
Common Crawl URL index for August 2019 with Last-Modified timestamps -...
catalogue.eidf.ac.uk
Updated Sep 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Common Crawl URL index for August 2019 with Last-Modified timestamps - Dataset - CKAN [Dataset]. https://catalogue.eidf.ac.uk/dataset/eidf125-common-crawl-url-index-for-august-2019-with-last-modified-timestamps
Explore at:
Dataset updated
Sep 1, 2019
Description
This dataset consists of a complete set of augmented index files for CC-MAIN-2019-35 [1]. This version of the index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header from the HTTP response as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files. [1] https://commoncrawl.org/blog/august-2019-crawl-archive-now-available
w
Web Data Commons - RDFa, Microdata, and Microformat Data Sets
webdatacommons.org
n-quads
Updated Oct 15, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Bizer; Robert Meusel; Anna Primpeli (2016). Web Data Commons - RDFa, Microdata, and Microformat Data Sets [Dataset]. http://webdatacommons.org/structureddata/2016-10/stats/stats.html
Explore at:
n-quadsAvailable download formats
Dataset updated
Oct 15, 2016
Authors
Christian Bizer; Robert Meusel; Anna Primpeli
Description
Microformat, Microdata and RDFa data from the October 2016 Common Crawl web corpus. We found structured data within 1.24 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.63 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (16.5%). Altogether, the extracted data sets consist of 44.2 billion RDF quads.

Comprehensive set of Sitemap and robots.txt links extracted from Common...

zenodo.org

zip

Updated Mar 8, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Dinzinger; Michael Dinzinger (2024). Comprehensive set of Sitemap and robots.txt links extracted from Common Crawl [Dataset]. http://doi.org/10.5281/zenodo.10511292

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10511292

Dataset updated

Mar 8, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Michael Dinzinger; Michael Dinzinger

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 14, 2024

Description

This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.

Sitemaps:

32,252,027 links (all ending with .xml or .xml.gz); 395.2 MB (compressed)
Website categories; 2.2 MB (compressed)

Top level labels of Curlie.org directory	Number of sitemap links
Arts	20110
Business	68690
Computers	17404
Games	3068
Health	13999
Home	4130
Kids_and_Teens	2240
News	5855
Recreation	19273
Reference	10862
Regional	419
Science	10729
Shopping	29903
Society	35019
Sports	12597

Robots.txt files:

41,611,704 links; 440.9 MB (compressed)
Website categories; 2.7 MB (compressed)

Top level labels of Curlie.org directory	Number of robots.txt links
Arts	25281
Business	79497
Computers	21880
Games	5037
Health	17326
Home	5401
Kids_and_Teens	3753
News	3424
Recreation	26355
Reference	15404
Regional	678
Science	16500
Shopping	30266
Society	45397
Sports	18029

Z
Webis-Simple-Sentences-17 Corpus
data.niaid.nih.gov
live.european-language-grid.eu
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiesel, Johannes (2020). Webis-Simple-Sentences-17 Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_205950
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Stein, Benno
Lucks, Stefan
Kiesel, Johannes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were sampled from a larger corpus to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word.

The corpus is split in training and test set as it is used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts.

More information on the corpus can be found on the corpus web page at our university (listed under documented by).
Whole-of-Australian Government Web Crawl
data.gov.au
html, warc
Updated Jul 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Digital Transformation Agency (2019). Whole-of-Australian Government Web Crawl [Dataset]. https://data.gov.au/data/dataset/groups/whole-of-australian-government-web-crawl
Explore at:
html, warcAvailable download formats
Dataset updated
Jul 29, 2019
Dataset provided by
Digital Transformation Agencyhttp://dta.gov.au/
Area covered
Australia
Description
Includes publicly-accessible, human-readable material from Australian government websites, using the Organisations Register (AGOR) as a seed list and domain categorisation source, obeying robots.txt and sitemap.xml directives, gathered over a 10-day period.

Several non-*.gov.au domains are included in the AGOR - these have been crawled up to a limit of 10K URLs.

Several binary file formats included and converted to HTML: doc,docm,docx,dot,epub,keys,numbers,pages,pdf,ppt,pptm,pptx,rtf,xls,xlsm,xlsx

URLs returning responses larger than 10MB are not included in the dataset.

Raw gathered data (including metadata) is published in the Web Archive (WARC) format, in both a single, multi-gigabyte WARC file and split series.

Metadata extracted from pages after filtering is published in JSON format, with fields defined in a data dictionary.

Licence

Web content contained within these WARC files has originally been authored by the agency hosting the referenced material. Authoring agencies are responsible for the choice of licence attached to the original material.

A consistent licence across the entirety of the WARC files' contents should not be assumed. Agencies may have indicated copyright and licence information for a given URL as metadata embedded in a WARC file entry, but this should not be assumed to be present for all WARC entries.
w
RDFa, Microdata, and Microformat Data Set
data.wu.ac.at
html
Updated Aug 3, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Web Data Commons (2014). RDFa, Microdata, and Microformat Data Set [Dataset]. https://data.wu.ac.at/schema/datahub_io/MDhkYWU2ODMtNmFjYi00NDgxLWFjODMtMjFjOGUzYTVlNzFm
Explore at:
htmlAvailable download formats
Dataset updated
Aug 3, 2014
Dataset provided by
Web Data Commons
Description
More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.
W
Webis MS MARCO Anchor Text 2022
webis.de
anthology.aicmu.ac.cn
5883456
Updated 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen (2022). Webis MS MARCO Anchor Text 2022 [Dataset]. http://doi.org/10.5281/zenodo.5883456
Explore at:
5883456Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.5883456
Dataset updated
2022
Dataset provided by
The Web Technology & Information Systems Network
Friedrich Schiller University Jena
University of Kassel, hessian.AI, and ScaDS.AI
Authors
Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with up to 1,000 anchor texts each.
e
esCorpius: A Massive Spanish Crawling Corpus - Dataset - B2FIND
b2find.eudat.eu
Updated May 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). esCorpius: A Massive Spanish Crawling Corpus - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a6d982c5-6a96-52ae-b0f3-ffb32a1b1380
Explore at:
Dataset updated
May 6, 2023
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.
a
ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl)
academictorrents.com
bittorrent
Updated Dec 16, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Djoerd Hiemstra (2014). ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl) [Dataset]. https://academictorrents.com/details/8ecbbc8360a2d8b6438000ebf257ed06e2eaeb20
Explore at:
bittorrent(30349856980)Available download formats
Dataset updated
Dec 16, 2014
Dataset authored and provided by
Djoerd Hiemstra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A BitTorrent file to download data with the title 'ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl) '

Facebook

Twitter

Click to copy link

Link copied

Cite

(2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL

The CommonCrawl Corpus

Explore at:

133 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 24, 2020

Description

The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Clear search

Close search

Google apps

Main menu

The CommonCrawl Corpus

web-cc12-hostgraph

CommonCrawl-CreativeCommons

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1)...

statistics

taiga

Crawl attributes ranking by importance on websites in France 2020

Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

c4_wsrs

LANL domain crawl seed list

takaraspider

Common Crawl URL index for August 2019 with Last-Modified timestamps -...

Web Data Commons - RDFa, Microdata, and Microformat Data Sets

Comprehensive set of Sitemap and robots.txt links extracted from Common...

Webis-Simple-Sentences-17 Corpus

Whole-of-Australian Government Web Crawl

Licence

RDFa, Microdata, and Microformat Data Set

Webis MS MARCO Anchor Text 2022

esCorpius: A Massive Spanish Crawling Corpus - Dataset - B2FIND

ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl)

The CommonCrawl CorpusSee More Versions

The CommonCrawl Corpus