100+ datasets found

h
common-crawl-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng, common-crawl-sample [Dataset]. https://huggingface.co/datasets/agentlans/common-crawl-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Alan Tseng
Description
Common Crawl sample

A small unofficial random subset of the famous Common Crawl dataset.

60 random segment WET files were downloaded from Common Crawl on 2024-05-12. Lines between 500 and 5000 characters long (inclusive) were kept. Only unique texts were kept. No other filtering.

Languages

Each text was assigned to one of the language codes using the GCLD3 Python package. The Chinese texts were classified as either simplified, traditional, or Cantonese using the… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/common-crawl-sample.
statistics
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
a
Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN,...
academictorrents.com
bittorrent
Updated Feb 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
None (2019). Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN) [Dataset]. https://academictorrents.com/details/2a4e272c4fd06abc3b3ee022fd2fd9e220b37c33
Explore at:
bittorrent(918311367)Available download formats
Dataset updated
Feb 4, 2019
Authors
None
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN, ES-EN, FR-EN, RU-EN)
s
The CommonCrawl Corpus
marketplace.sshopencloud.eu
Updated Apr 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL
Explore at:
Dataset updated
Apr 24, 2020
Description
The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
h
common-crawl-diverse-sample
huggingface.co
aifasthub.com
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Haines (2024). common-crawl-diverse-sample [Dataset]. https://huggingface.co/datasets/amazingvince/common-crawl-diverse-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2024
Authors
Vincent Haines
Description
amazingvince/common-crawl-diverse-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
R
Human Crawl Dataset
universe.roboflow.com
zip
Updated Sep 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
project (2023). Human Crawl Dataset [Dataset]. https://universe.roboflow.com/project-ynaln/human-crawl/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
Sep 16, 2023
Dataset authored and provided by
project
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Crawl Bounding Boxes
Description
Human Crawl

## Overview Human Crawl is a dataset for object detection tasks - it contains Crawl annotations for 299 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
e
Common Crawl URL index for August 2019 with Last-Modified timestamps -...
catalogue.eidf.ac.uk
Updated Sep 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2019). Common Crawl URL index for August 2019 with Last-Modified timestamps - Dataset - CKAN [Dataset]. https://catalogue.eidf.ac.uk/dataset/eidf125-common-crawl-url-index-for-august-2019-with-last-modified-timestamps
Explore at:
Dataset updated
Sep 1, 2019
Description
This dataset consists of a complete set of augmented index files for CC-MAIN-2019-35 [1]. This version of the index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header from the HTTP response as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files. [1] https://commoncrawl.org/blog/august-2019-crawl-archive-now-available
h
Chinese-Common-Crawl-Filtered
huggingface.co
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jed Cheng (2025). Chinese-Common-Crawl-Filtered [Dataset]. https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered
Explore at:
Dataset updated
Jun 2, 2025
Authors
Jed Cheng
Description
Traditional Chinese C4

Dataset Summary

Data obtained from 2025-18 and 2025-13 Common Crawl. Downloaded and processed using code based on another project attempting to recreate the C4 dataset. The resultant dataset contains both simplified and traditional Chinese. It was then filtered using a modified list of simplified Chinese characters to obtain another traditional Chinese dataset. I am still ironning out the process of filtering. The 2025-13 dataset was deduplicated… See the full description on the dataset page: https://huggingface.co/datasets/jed351/Chinese-Common-Crawl-Filtered.
n
web-cc12-hostgraph
networkrepository.com
csv
Updated Oct 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Network Data Repository (2018). web-cc12-hostgraph [Dataset]. https://networkrepository.com/web-cc12-hostgraph.php
Explore at:
csvAvailable download formats
Dataset updated
Oct 4, 2018
Dataset authored and provided by
Network Data Repository
License
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Description
Host-level Web Graph - This graph aggregates the page graph by subdomain/host where each node represents a specific subdomain/host and an edge exists between a pair of hosts/subdomains if at least one link was found between pages that belong to a pair of subdomains/hosts. The hyperlink graph was extracted from the Web corpus released by the Common Crawl Foundation in August 2012. The Web corpus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation. Also, see web-cc12-firstlevel-subdomain and web-cc12-PayLevelDomain.
e
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1)...
b2find.eudat.eu
Updated Nov 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ca7d532c-3db1-5bdc-ab01-e16a9b1466e3
Explore at:
Dataset updated
Nov 16, 2024
Description
The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2017) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.
p
Lester Crawl Primary Center
publicschoolreview.com
json, xml
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Lester Crawl Primary Center [Dataset]. https://www.publicschoolreview.com/lester-crawl-primary-center-profile
Explore at:
json, xmlAvailable download formats
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2008 - Dec 31, 2023
Description
Historical Dataset of Lester Crawl Primary Center is provided by PublicSchoolReview and contain statistics on metrics:Total Students Trends Over Years (2009-2023),Total Classroom Teachers Trends Over Years (2010-2023),Student-Teacher Ratio Comparison Over Years (2010-2023),Asian Student Percentage Comparison Over Years (2008-2022),Hispanic Student Percentage Comparison Over Years (2009-2023),Black Student Percentage Comparison Over Years (2009-2023),White Student Percentage Comparison Over Years (2009-2023),Two or More Races Student Percentage Comparison Over Years (2013-2023),Diversity Score Comparison Over Years (2009-2023),Free Lunch Eligibility Comparison Over Years (2009-2023),Reduced-Price Lunch Eligibility Comparison Over Years (2010-2023)
Crawl attributes ranking by importance on websites in France 2020
statista.com
Updated Jul 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Crawl attributes ranking by importance on websites in France 2020 [Dataset]. https://www.statista.com/statistics/1220602/crawl-attributes-ranking-by-importance-on-websites-france/
Explore at:
Dataset updated
Jul 9, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2020
Area covered
France
Description
When crawl and mobile indexing are ensured, it makes it easier for the crawler or the Internet user to visit and to facilitate the discovery of a site by search engines. Thus, according to the source, in 2020, more than ** percent of SEOs attached great importance to internal networking, that is to say to the presence of internal links pointing to the page to be highlighted. They considered all the criteria in the crawl category to be important, with the exception of the indication of priority in the sitemap, to which only ** percent of SEOs gave meaning, with **** importance out of five.

Comprehensive set of Sitemap and robots.txt links extracted from Common...

zenodo.org

zip

Updated Mar 8, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Dinzinger; Michael Dinzinger (2024). Comprehensive set of Sitemap and robots.txt links extracted from Common Crawl [Dataset]. http://doi.org/10.5281/zenodo.10511292

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10511292

Dataset updated

Mar 8, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Michael Dinzinger; Michael Dinzinger

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 14, 2024

Description

This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.

Sitemaps:

32,252,027 links (all ending with .xml or .xml.gz); 395.2 MB (compressed)
Website categories; 2.2 MB (compressed)

Top level labels of Curlie.org directory	Number of sitemap links
Arts	20110
Business	68690
Computers	17404
Games	3068
Health	13999
Home	4130
Kids_and_Teens	2240
News	5855
Recreation	19273
Reference	10862
Regional	419
Science	10729
Shopping	29903
Society	35019
Sports	12597

Robots.txt files:

41,611,704 links; 440.9 MB (compressed)
Website categories; 2.7 MB (compressed)

Top level labels of Curlie.org directory	Number of robots.txt links
Arts	25281
Business	79497
Computers	21880
Games	5037
Health	17326
Home	5401
Kids_and_Teens	3753
News	3424
Recreation	26355
Reference	15404
Regional	678
Science	16500
Shopping	30266
Society	45397
Sports	18029

W
CommonCrawl News Articles by Political Orientation
anthology.aicmu.ac.cn
webis.de
7476697
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian Keiff; Henning Wachsmuth (2022). CommonCrawl News Articles by Political Orientation [Dataset]. http://doi.org/10.5281/zenodo.7476697
Explore at:
7476697Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.7476697
Dataset updated
2022
Dataset provided by
Paderborn University
The Web Technology & Information Systems Network
Authors
Maximilian Keiff; Henning Wachsmuth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset includes news articles gathered from CommonCrawl for media outlets that were selected based on their political orientation. The news articles span publication dates from 2010 to 2021.
citations
huggingface.co
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). citations [Dataset]. https://huggingface.co/datasets/commoncrawl/citations
Explore at:
Dataset updated
Jul 30, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Citations Overview

This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar. Please note that these citations are not curated, so they will include some false positives. For an annotated subset of these citations with additional fields, please see citations-annotated.
T
c4_wsrs
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
Explore at:
Dataset updated
Dec 22, 2022
Description
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset: https://commoncrawl.org

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('c4_wsrs', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
p
Trends in Total Students (2009-2023): Lester Crawl Primary Center
publicschoolreview.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Total Students (2009-2023): Lester Crawl Primary Center [Dataset]. https://www.publicschoolreview.com/lester-crawl-primary-center-profile
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset tracks annual total students amount from 2009 to 2023 for Lester Crawl Primary Center
n
NIF Registry Automated Crawl Data
neuinfo.org
scicrunch.org
+2more
Updated May 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). NIF Registry Automated Crawl Data [Dataset]. http://identifiers.org/RRID:SCR_012862
Explore at:
Unique identifier
https://identifiers.org/RRID:SCR_012862
Dataset updated
May 13, 2025
Description
An automatic pipeline based on an algorithm that identifies new resources in publications every month to assist the efficiency of NIF curators. The pipeline is also able to find the last time the resource's webpage was updated and whether the URL is still valid. This can assist the curator in knowing which resources need attention. Additionally, the pipeline identifies publications that reference existing NIF Registry resources as this is also of interest. These mentions are available through the Data Federation version of the NIF Registry, http://neuinfo.org/nif/nifgwt.html?query=nlx_144509 The RDF is based on an algorithm on how related it is to neuroscience. (hits of neuroscience related terms). Each potential resource gets assigned a score (based on how related it is to neuroscience) and the resources are then ranked and a list is generated.
h
common-crawl-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Daniels, common-crawl-sample [Dataset]. https://huggingface.co/datasets/codymd/common-crawl-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Cody Daniels
Description
codymd/common-crawl-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
p
Trends in Total Classroom Teachers (2010-2023): Lester Crawl Primary Center
publicschoolreview.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, Trends in Total Classroom Teachers (2010-2023): Lester Crawl Primary Center [Dataset]. https://www.publicschoolreview.com/lester-crawl-primary-center-profile
Explore at:
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset tracks annual total classroom teachers amount from 2010 to 2023 for Lester Crawl Primary Center

Facebook

Twitter

Click to copy link

Link copied

Cite

Alan Tseng, common-crawl-sample [Dataset]. https://huggingface.co/datasets/agentlans/common-crawl-sample

common-crawl-sample

agentlans/common-crawl-sample

Explore at:

8 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Authors

Alan Tseng

Description

Common Crawl sample

A small unofficial random subset of the famous Common Crawl dataset.

60 random segment WET files were downloaded from Common Crawl on 2024-05-12. Lines between 500 and 5000 characters long (inclusive) were kept. Only unique texts were kept. No other filtering.

  Languages

Each text was assigned to one of the language codes using the GCLD3 Python package. The Chinese texts were classified as either simplified, traditional, or Cantonese using the… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/common-crawl-sample.

Clear search

Close search

Google apps

Main menu

common-crawl-sample

statistics

Common Crawl corpus - training-parallel-commoncrawl.tgz (CS-EN, DE-EN,...

The CommonCrawl Corpus

common-crawl-diverse-sample

Human Crawl Dataset

Human Crawl

Common Crawl URL index for August 2019 with Last-Modified timestamps -...

Chinese-Common-Crawl-Filtered

web-cc12-hostgraph

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1)...

Lester Crawl Primary Center

Crawl attributes ranking by importance on websites in France 2020

Comprehensive set of Sitemap and robots.txt links extracted from Common...

CommonCrawl News Articles by Political Orientation

citations

c4_wsrs

Trends in Total Students (2009-2023): Lester Crawl Primary Center

NIF Registry Automated Crawl Data

common-crawl-sample

Trends in Total Classroom Teachers (2010-2023): Lester Crawl Primary Center

common-crawl-sample

agentlans/common-crawl-sample