49 datasets found

statistics
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
Random sample of Common Crawl domains from 2021
kaggle.com
Updated Aug 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HiHarshSinghal (2021). Random sample of Common Crawl domains from 2021 [Dataset]. https://www.kaggle.com/datasets/harshsinghal/random-sample-of-common-crawl-domains-from-2021/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HiHarshSinghal
Description
Context

Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

Content

I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

Acknowledgements

Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

Inspiration

My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

I am also interested in identifying fraudulent domains and understanding malicious URL patterns.
P
C4 Dataset
paperswithcode.com
Updated Dec 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu (2023). C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
Explore at:
Dataset updated
Dec 13, 2023
Authors
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
Description
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.
T
c4_wsrs
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
Explore at:
Dataset updated
Dec 22, 2022
Description
A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

The original source is the Common Crawl dataset: https://commoncrawl.org

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('c4_wsrs', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
P
Common Crawl Dataset
paperswithcode.com
opendatalab.com
Updated Oct 7, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Common Crawl Dataset [Dataset]. https://paperswithcode.com/dataset/common-crawl
Explore at:
Dataset updated
Oct 7, 2014
Description
The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
citations
huggingface.co
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). citations [Dataset]. https://huggingface.co/datasets/commoncrawl/citations
Explore at:
Dataset updated
Jul 30, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Citations Overview

This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar. Please note that these citations are not curated, so they will include some false positives. For an annotated subset of these citations with additional fields, please see citations-annotated.
Common Crawl News 20200110212037-00310
kaggle.com
Updated Jan 11, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Altay (2020). Common Crawl News 20200110212037-00310 [Dataset]. https://www.kaggle.com/gabrielaltay/common-crawl-news-2020011021203700310/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 11, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gabriel Altay
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a single Web ARChive (WARC) file downloaded from the Common Crawl S3 bucket. It covers part of 2020-01-10 (2020 January 10th).

https://commoncrawl.org/2016/10/news-dataset-available/

https://commoncrawl.org/the-data/get-started/
l
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)
lindat.cz
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Oliver Rüdiger (2025). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5811?show=full
Explore at:
Dataset updated
Feb 1, 2025
Authors
Jan Oliver Rüdiger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
*** german version see below ***

The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2024) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

CATMA v6

CoNLL

CSV

CSV (only meta-data)

DTA TCF-XML

DWDS TEI-XML

HTML

IDS I5-XML

IDS KorAP XML

IMS Open Corpus Workbench

JSON

OPUS Corpus Collection XCES

Plaintext

SaltXML

SlashA XML

SketchEngine VERT

SPEEDy/CODEX (JSON)

TLV-XML

TreeTagger

TXM

WebLicht

XML

Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If
l
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1)
lindat.cz
Updated Nov 21, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5794
Explore at:
Dataset updated
Nov 21, 2024
Authors
Jan Oliver Rüdiger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
*** german version see below ***

The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2022) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

CATMA v6

CoNLL

CSV

CSV (only meta-data)

DTA TCF-XML

DWDS TEI-XML

HTML

IDS I5-XML

IDS KorAP XML

IMS Open Corpus Workbench

JSON

OPUS Corpus Collection XCES

Plaintext

SaltXML

SlashA XML

SketchEngine VERT

SPEEDy/CODEX (JSON)

TLV-XML

TreeTagger

TXM

WebLicht

XML

Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If
h
CC-MAIN-2018-22_urls
huggingface.co
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). CC-MAIN-2018-22_urls [Dataset]. http://doi.org/10.57967/hf/4101
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/4101
Dataset updated
Jan 15, 2025
Authors
Nick Hagar
Description
This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2018-22 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.

Comprehensive set of Sitemap and robots.txt links extracted from Common...

zenodo.org

zip

Updated Mar 8, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael Dinzinger; Michael Dinzinger (2024). Comprehensive set of Sitemap and robots.txt links extracted from Common Crawl [Dataset]. http://doi.org/10.5281/zenodo.10511292

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10511292

Dataset updated

Mar 8, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Michael Dinzinger; Michael Dinzinger

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

Jan 14, 2024

Description

This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.

Sitemaps:

32,252,027 links (all ending with .xml or .xml.gz); 395.2 MB (compressed)
Website categories; 2.2 MB (compressed)

Top level labels of Curlie.org directory	Number of sitemap links
Arts	20110
Business	68690
Computers	17404
Games	3068
Health	13999
Home	4130
Kids_and_Teens	2240
News	5855
Recreation	19273
Reference	10862
Regional	419
Science	10729
Shopping	29903
Society	35019
Sports	12597

Robots.txt files:

41,611,704 links; 440.9 MB (compressed)
Website categories; 2.7 MB (compressed)

Top level labels of Curlie.org directory	Number of robots.txt links
Arts	25281
Business	79497
Computers	21880
Games	5037
Health	17326
Home	5401
Kids_and_Teens	3753
News	3424
Recreation	26355
Reference	15404
Regional	678
Science	16500
Shopping	30266
Society	45397
Sports	18029

l
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1)
lindat.cz
Updated Nov 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5788?show=full
Explore at:
Dataset updated
Nov 16, 2024
Authors
Jan Oliver Rüdiger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
*** german version see below ***

The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2014) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

CATMA v6

CoNLL

CSV

CSV (only meta-data)

DTA TCF-XML

DWDS TEI-XML

HTML

IDS I5-XML

IDS KorAP XML

IMS Open Corpus Workbench

JSON

OPUS Corpus Collection XCES

Plaintext

SaltXML

SlashA XML

SketchEngine VERT

SPEEDy/CODEX (JSON)

TLV-XML

TreeTagger

TXM

WebLicht

XML

Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If
h
clean-si-mc4
huggingface.co
Updated Dec 21, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keshan Sodimana (2021). clean-si-mc4 [Dataset]. https://huggingface.co/datasets/keshan/clean-si-mc4
Explore at:
Dataset updated
Dec 21, 2021
Authors
Keshan Sodimana
Description
A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.
CEREAL I, el Corpus del Español REAL
zenodo.org
bz2, txt
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristina España-Bonet; Cristina España-Bonet; Alberto Barrón-Cedeño; Alberto Barrón-Cedeño (2024). CEREAL I, el Corpus del Español REAL [Dataset]. http://doi.org/10.5281/zenodo.11387864
Explore at:
bz2, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11387864
Dataset updated
Jun 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Cristina España-Bonet; Cristina España-Bonet; Alberto Barrón-Cedeño; Alberto Barrón-Cedeño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Content:

CEREAL (visit the project website) is a document-level corpus of documents in Spanish extracted from OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl.

The process to build the corpus and its characteristics can be found in:

Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.

The corpus used to train the classifier and the sentence-level version of CEREAL is available at
https://zenodo.org/records/11390829

Files Description:

See the README.txt file
h
CC-MAIN-2019-18_urls
huggingface.co
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nick Hagar (2025). CC-MAIN-2019-18_urls [Dataset]. http://doi.org/10.57967/hf/4091
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/4091
Dataset updated
Jan 15, 2025
Authors
Nick Hagar
Description
This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2019-18 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.
Z
Dataset for Report: "The Increasing Prominence of Prejudice and Social...
data.niaid.nih.gov
Updated Jun 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rozado (2022). Dataset for Report: "The Increasing Prominence of Prejudice and Social Justice Rhetoric in UK News Media" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6482344
Explore at:
Dataset updated
Jun 13, 2022
Dataset authored and provided by
David Rozado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom
Description
This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom. The target words are listed in the associated report and are mostly words that denote prejudice or are often associated with social justice discourse. A few additional words not denoting prejudice are also available since they are used in the report for illustration purposes of the method.

The textual content of news and opinion articles from the outlets is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. This threshold was chosen to maximize inclusion in our analysis of outlets with sparse amounts of articles text per year.

Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.

Most of the incorrect frequency counts are often minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text.To conclude, in a data analysis of over 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 2 of main manuscript for supporting evidence of the temporal precision of the method).
Webis-MS-MARCO-Anchor-Texts-22
zenodo.org
application/gzip
Updated Jan 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maik Fröbe; Maik Fröbe; Sebastian Günther; Sebastian Günther; Maximilian Probst; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Maximilian Probst (2022). Webis-MS-MARCO-Anchor-Texts-22 [Dataset]. http://doi.org/10.5281/zenodo.5883456
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5883456
Dataset updated
Jan 30, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Maik Fröbe; Maik Fröbe; Sebastian Günther; Sebastian Günther; Maximilian Probst; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Maximilian Probst
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). We sampled 1,000 anchor texts for documents with more than 1,000 anchor texts at random and all anchor texts for documents with less than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents in Version 1 and 97% of documents for Version 2). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with anchor text.

Cleaned versions of the MS MARCO Anchor Text 2022 dataset are available in ir_datasets, Zenodo and Hugging Face. The raw dataset with additional information and all metadata for the extracted anchor texts (roughly 100GB) is available on Hugging Face and files.webis.de.

The details of the construction of the Webis MS MARCO Anchor Text 2022 dataset are described in the associated paper. If you use this dataset, please cite
@InProceedings{froebe:2022a, address = {Berlin Heidelberg New York}, author = {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen}, booktitle = {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)}, editor = {Matthias Hagen and Suzan Verberne and Craig Macdonald and Christin Seifert and Krisztian Balog and Kjetil N{\o}rv\r{a}g and Vinay Setty}, month = apr, publisher = {Springer}, series = {Lecture Notes in Computer Science}, site = {Stavanger, Norway}, title = {{The Power of Anchor Text in the Neural Retrieval Era}}, year = 2022 }
W
CommonCrawl News Articles by Political Orientation
webis.de
anthology.aicmu.ac.cn
7476697
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maximilian Keiff; Henning Wachsmuth (2022). CommonCrawl News Articles by Political Orientation [Dataset]. http://doi.org/10.5281/zenodo.7476697
Explore at:
7476697Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.7476697
Dataset updated
2022
Dataset provided by
The Web Technology & Information Systems Network
Leibniz Universität Hannover
Authors
Maximilian Keiff; Henning Wachsmuth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset includes news articles gathered from CommonCrawl for media outlets that were selected based on their political orientation. The news articles span publication dates from 2010 to 2021.
c
Data from: Text collection for training the BERTić transformer model...
clarin.si
live.european-language-grid.eu
Updated May 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikola Ljubešić (2021). Text collection for training the BERTić transformer model BERTić-data [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1426?show=full
Explore at:
Dataset updated
May 8, 2021
Authors
Nikola Ljubešić
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).
Data for manuscript "The Prevalence of Terms Denoting Far-right and Far-left...
zenodo.org
data.niaid.nih.gov
bin
Updated Mar 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rozado; David Rozado (2022). Data for manuscript "The Prevalence of Terms Denoting Far-right and Far-left Political Extremism in U.S. and U.K. News Media" [Dataset]. http://doi.org/10.5281/zenodo.5437016
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5437016
Dataset updated
Mar 22, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Rozado; David Rozado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom, United States
Description
This data set belongs to an academic manuscript examining longitudinally (2000-2019) the prevalence of terms denoting far-right and far-left political extremism in a large corpus of more than 32 million written news and opinion articles from 54 news media outlets popular in the United States and the United Kingdom.

The textual content of news and opinion articles from the 54 outlets listed in the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. This threshold was chosen to maximize inclusion in our analysis of outlets with sparse amounts of articles text per year.

Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

The list of compressed files in this data set is listed next:

-analysisScripts.rar contains the analysis scripts used in the main manuscript

-articlesContainingTargetWords.rar contains counts of target words in outlets articles as well as total counts of words in articles

Usage Notes

In a small percentage of articles, outlet specific XPath expressions failed to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.

Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. Some additional outlet-specific inaccuracies that we could identify occurred in "The Hill" and "Newsmax" news outlets where XPath expressions had some shortfalls at precisely capturing articles’ content. For "The Hill", in years 2007-2009, XPath expressions failed to capture the complete text of the article in about 40% of the articles. This does not necessarily result in incorrect frequency counts for that outlet but in a sample of articles’ words that is about 40% smaller than the total population of articles words for those three years. In the case of "NewsMax", the issue was that for some articles, XPath expressions captured the entire text of the article twice. Notice that this does not result in incorrect frequency counts. If a word appears x times in an article with a total of y words, the same frequency count will still be derived when our scripts count the word 2x times in the version of the article with a total of 2y words.

To conclude, in a data analysis of 32 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 in the main manuscript for illustration of the accuracy of the frequency counts).

Facebook

Twitter

Click to copy link

Link copied

Cite

Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:

Dataset updated

Nov 20, 2024

Dataset provided by

Common Crawlhttp://commoncrawl.org/

Authors

Common Crawl Foundation

Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Clear search

Close search

Google apps

Main menu

statistics

Random sample of Common Crawl domains from 2021

Context

Content

Acknowledgements

Inspiration

C4 Dataset

c4_wsrs

Common Crawl Dataset

citations

Common Crawl News 20200110212037-00310

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1)

CC-MAIN-2018-22_urls

Comprehensive set of Sitemap and robots.txt links extracted from Common...

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1)

clean-si-mc4

CEREAL I, el Corpus del Español REAL

CC-MAIN-2019-18_urls

Dataset for Report: "The Increasing Prominence of Prejudice and Social...

Webis-MS-MARCO-Anchor-Texts-22

CommonCrawl News Articles by Political Orientation

Data from: Text collection for training the BERTić transformer model...

Data for manuscript "The Prevalence of Terms Denoting Far-right and Far-left...

statistics

commoncrawl/statistics

Common Crawl Statistics