21 datasets found

statistics
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description
Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.
P
Common Crawl Dataset
paperswithcode.com
opendatalab.com
Updated Oct 7, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2014). Common Crawl Dataset [Dataset]. https://paperswithcode.com/dataset/common-crawl
Explore at:
Dataset updated
Oct 7, 2014
Description
The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
Random sample of Common Crawl domains from 2021
kaggle.com
Updated Aug 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HiHarshSinghal (2021). Random sample of Common Crawl domains from 2021 [Dataset]. https://www.kaggle.com/datasets/harshsinghal/random-sample-of-common-crawl-domains-from-2021/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
HiHarshSinghal
Description
Context

Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

Content

I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

Acknowledgements

Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

Inspiration

My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

I am also interested in identifying fraudulent domains and understanding malicious URL patterns.
h
common-crawl-sample
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng, common-crawl-sample [Dataset]. https://huggingface.co/datasets/agentlans/common-crawl-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Alan Tseng
Description
Common Crawl sample

A small unofficial random subset of the famous Common Crawl dataset.

60 random segment WET files were downloaded from Common Crawl on 2024-05-12. Lines between 500 and 5000 characters long (inclusive) were kept. Only unique texts were kept. No other filtering.

Languages

Each text was assigned to one of the language codes using the GCLD3 Python package. The Chinese texts were classified as either simplified, traditional, or Cantonese using the… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/common-crawl-sample.
P
C4 Dataset
paperswithcode.com
Updated Dec 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu (2023). C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
Explore at:
Dataset updated
Dec 13, 2023
Authors
Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
Description
C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

The dataset can be downloaded in a pre-processed form from allennlp.
l
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)
lindat.cz
Updated Feb 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Oliver Rüdiger (2025). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5811?show=full
Explore at:
Dataset updated
Feb 1, 2025
Authors
Jan Oliver Rüdiger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
*** german version see below ***

The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2024) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

CATMA v6

CoNLL

CSV

CSV (only meta-data)

DTA TCF-XML

DWDS TEI-XML

HTML

IDS I5-XML

IDS KorAP XML

IMS Open Corpus Workbench

JSON

OPUS Corpus Collection XCES

Plaintext

SaltXML

SlashA XML

SketchEngine VERT

SPEEDy/CODEX (JSON)

TLV-XML

TreeTagger

TXM

WebLicht

XML

Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If
c
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2023 – VERSION 1)
lindat.mff.cuni.cz
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Oliver Rüdiger (2025). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2023 – VERSION 1) [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5810?show=full
Explore at:
Dataset updated
May 21, 2025
Authors
Jan Oliver Rüdiger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
*** german version see below ***

The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2023) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

CATMA v6

CoNLL

CSV

CSV (only meta-data)

DTA TCF-XML

DWDS TEI-XML

HTML

IDS I5-XML

IDS KorAP XML

IMS Open Corpus Workbench

JSON

OPUS Corpus Collection XCES

Plaintext

SaltXML

SlashA XML

SketchEngine VERT

SPEEDy/CODEX (JSON)

TLV-XML

TreeTagger

TXM

WebLicht

XML

Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If
l
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)
lindat.cz
Updated Nov 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5806
Explore at:
Dataset updated
Nov 26, 2024
Authors
Jan Oliver Rüdiger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
*** german version see below ***

The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2019) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

CATMA v6

CoNLL

CSV

CSV (only meta-data)

DTA TCF-XML

DWDS TEI-XML

HTML

IDS I5-XML

IDS KorAP XML

IMS Open Corpus Workbench

JSON

OPUS Corpus Collection XCES

Plaintext

SaltXML

SlashA XML

SketchEngine VERT

SPEEDy/CODEX (JSON)

TLV-XML

TreeTagger

TXM

WebLicht

XML

Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If
l
Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1)
lindat.cz
Updated Nov 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5788?show=full
Explore at:
Dataset updated
Nov 16, 2024
Authors
Jan Oliver Rüdiger
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
*** german version see below ***

The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2014) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

CATMA v6

CoNLL

CSV

CSV (only meta-data)

DTA TCF-XML

DWDS TEI-XML

HTML

IDS I5-XML

IDS KorAP XML

IMS Open Corpus Workbench

JSON

OPUS Corpus Collection XCES

Plaintext

SaltXML

SlashA XML

SketchEngine VERT

SPEEDy/CODEX (JSON)

TLV-XML

TreeTagger

TXM

WebLicht

XML

Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If
P
RealNews Dataset
paperswithcode.com
opendatalab.com
Updated Jan 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi (2023). RealNews Dataset [Dataset]. https://paperswithcode.com/dataset/realnews
Explore at:
Dataset updated
Jan 30, 2023
Authors
Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi
Description
RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.
h
cc_news
huggingface.co
Updated Jul 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladimir Blagojevic (2018). cc_news [Dataset]. https://huggingface.co/datasets/vblagoje/cc_news
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 3, 2018
Authors
Vladimir Blagojevic
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for CC-News

Dataset Summary

CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English… See the full description on the dataset page: https://huggingface.co/datasets/vblagoje/cc_news.
n
web-cc12-hostgraph
networkrepository.com
csv
Updated Oct 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Network Data Repository (2018). web-cc12-hostgraph [Dataset]. https://networkrepository.com/web-cc12-hostgraph.php
Explore at:
csvAvailable download formats
Dataset updated
Oct 4, 2018
Dataset authored and provided by
Network Data Repository
License
https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php
Description
Host-level Web Graph - This graph aggregates the page graph by subdomain/host where each node represents a specific subdomain/host and an edge exists between a pair of hosts/subdomains if at least one link was found between pages that belong to a pair of subdomains/hosts. The hyperlink graph was extracted from the Web corpus released by the Common Crawl Foundation in August 2012. The Web corpus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation. Also, see web-cc12-firstlevel-subdomain and web-cc12-PayLevelDomain.
P
aquamuse Dataset
paperswithcode.com
library.toponeai.link
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sayali Kulkarni; Sheide Chammas; Wan Zhu; Fei Sha; Eugene Ie (2024). aquamuse Dataset [Dataset]. https://paperswithcode.com/dataset/aquamuse
Explore at:
Dataset updated
Aug 7, 2024
Authors
Sayali Kulkarni; Sheide Chammas; Wan Zhu; Fei Sha; Eugene Ie
Description
5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.
h
fineweb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2493
Dataset authored and provided by
FineData
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
🍷 FineWeb

15 trillion tokens of the finest data the 🌐 web has to offer

What is it?

The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.
d
Replication Data for: Intercity relationships between 293 Chinese cities...
search.dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wang Tongjing; Yin Zhao; Ziyu Bao; Evert Meijers (2023). Replication Data for: Intercity relationships between 293 Chinese cities quantified based on toponym co-occurrence [Dataset]. http://doi.org/10.7910/DVN/O0M59W
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/O0M59W
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Wang Tongjing; Yin Zhao; Ziyu Bao; Evert Meijers
Time period covered
Apr 18, 2019 - Apr 26, 2019
Description
This dataset presents relationships between 293 Chinese cities, derived using a toponym co-occurrence method. By employing this toponym co-occurrence analysis method, the strength of an intercity relationship is determined by the frequency at which both city names appear on the same webpage. The data was sourced from the Common Crawl web archive's 2019 April Corpus, which contains approximately 2.5 billion web pages. The primary aim of this dataset is to provide a fresh perspective on intercity relationships, thereby facilitating studies on city network analysis. The dataset not only encourages further research into comparing this innovative city relationship with other established networks but is also a showcase that presents a straightforward methodology that can be applied to other archives within Common Crawl. As such, it paves the way for longitudinal studies that probe the evolution of city networks.
Z
Data for manuscript "Reciprocal Radicalization: The Rise of Culture War...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Dec 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Rozado (2021). Data for manuscript "Reciprocal Radicalization: The Rise of Culture War Terminology in British and American News Coverage" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5709759
Explore at:
Dataset updated
Dec 7, 2021
Dataset authored and provided by
David Rozado
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
United Kingdom, United States
Description
This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom: The Guardian, The Times, The Independent, The Daily Mirror, BBC, Financial Times, Metro, Telegraph, The and The Daily Mail plus a few additional American-based outlets used for comparison reference. The target words are listed in the associated manuscript and are mostly words that denote some type of prejudice, social justice related terms or counterreaction to it. A few additional words are also available since they are used in the manuscript for illustration purposes.

The textual content of news and opinion articles from the outlets listed in Figure 3 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We derived relative frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.

Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

The list of compressed files in this data set is listed next:

-analysisScripts.rar contains the analysis scripts used in the main manuscript

-targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles

-targetWordsInArticlesCountsGuardianExampleWords contains counts of target words in outlets articles as well as total counts of words in articles for illustrative Figure 1 in main manuscript

Usage Notes

In a small percentage of articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.

Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. To conclude, in a data analysis of 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 of main manuscript for supporting evidence).

GOLOS Dataset

paperswithcode.com

Updated Jun 17, 2021

Facebook

Twitter

Click to copy link

Link copied

Cite

Nikolay Karpov; Alexander Denisenko; Fedor Minkin (2021). GOLOS Dataset [Dataset]. https://paperswithcode.com/dataset/golos

Explore at:

Dataset updated

Jun 17, 2021

Authors

Nikolay Karpov; Alexander Denisenko; Fedor Minkin

Description

Golos is a Russian speech dataset suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.

Dataset structure | Domain | Train files | Train hours | Test files | Test hours | |----------------|------------|--------|-------|------| | Crowd | 979 796 | 1 095 | 9 994 | 11.2 | | Farfield | 124 003 | 132.4| 1 916 | 1.4 | | Total | 1 103 799 | 1 227.4|11 910 | 12.6 |

Audio files in opus format | Archive | Size | Link | |------------------|------------|---------------------| | golos_opus.tar | 20.5 GB | https://sc.link/JpD |

Audio files in wav format | Archives | Size | Links | |-------------------|------------|---------------------| | train_farfield.tar| 15.4 GB | https://sc.link/1Z3 | | train_crowd0.tar | 11 GB | https://sc.link/Lrg | | train_crowd1.tar | 14 GB | https://sc.link/MvQ | | train_crowd2.tar | 13.2 GB | https://sc.link/NwL | | train_crowd3.tar | 11.6 GB | https://sc.link/Oxg | | train_crowd4.tar | 15.8 GB | https://sc.link/Pyz | | train_crowd5.tar | 13.1 GB | https://sc.link/Qz7 | | train_crowd6.tar | 15.7 GB | https://sc.link/RAL | | train_crowd7.tar | 12.7 GB | https://sc.link/VG5 | | train_crowd8.tar | 12.2 GB | https://sc.link/WJW | | train_crowd9.tar | 8.08 GB | https://sc.link/XKk | | test.tar | 1.3 GB | https://sc.link/Kqr |

Evaluation Percents of Word Error Rate for different test sets

Decoder \ Test set	Crowd test	Farfield test	MCV¹ dev	MCV¹ test
Greedy decoder	4.389 %	14.949 %	9.314 %	11.278 %
Beam Search with Common Crawl LM	4.709 %	12.503 %	6.341 %	7.976 %
Beam Search with Golos train set LM	3.548 %	12.384 %	-	-
Beam Search with Common Crawl and Golos LM	3.318 %	11.488 %	6.4 %	8.06 %

h
chinese-c4
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jianbin Chang, chinese-c4 [Dataset]. https://huggingface.co/datasets/shjwudp/chinese-c4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Jianbin Chang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
Introduction

Chinese-C4 is a clean Chinese internet dataset based on Common Crawl. The dataset is 46.29GB and has undergone multiple cleaning strategies, including Chinese filtering, heuristic cleaning based on punctuation, line-based hashing for deduplication, and repetition removal. The dataset is open source and free for commercial use, and you are welcome to use the data and the cleaning strategies provided and contribute your cleaning strategies. You can find the cleaning… See the full description on the dataset page: https://huggingface.co/datasets/shjwudp/chinese-c4.
MuRIL Large tf
kaggle.com
Updated Oct 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2021). MuRIL Large tf [Dataset]. https://www.kaggle.com/nbroad/muril-large-tf/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 16, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nicholas Broad
Description
This was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls parameters matter, but I'm wondering about the position_ids

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']

MuRIL: Multilingual Representations for Indian Language

MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

Apache 2.0 License

Link to model on Hugging Face Hub

Overview

This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

We use a training paradigm similar to multilingual bert, with a few modifications as listed:

We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

Training

The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

Monolingual Data

We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

Parallel Data

We have two types of parallel data - Translated Data
We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
- Transliterated Data
We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

Trainable parameters

All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

Uses & Limitations

This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

Citation

@misc{khanuja2021muril, title={MuRIL: Multilingual Representations for Indian Languages}, author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar}, year={2021}, eprint={2103.10730}, archivePrefix={arXiv}, primaryClass={cs.CL} }

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...
h
wmt19
huggingface.co
opendatalab.com
Updated Dec 25, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WMT: Workshop on Statistical Machine Translation (2021). wmt19 [Dataset]. https://huggingface.co/datasets/wmt/wmt19
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 25, 2021
Dataset authored and provided by
WMT: Workshop on Statistical Machine Translation
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "wmt19"

Dataset Summary

Warning: There are issues with the Common Crawl corpus data (training-parallel-commoncrawl.tgz):

Non-English files contain many English sentences. Their "parallel" sentences in English are not aligned: they are uncorrelated with their counterpart.

We have contacted the WMT organizers, and in response, they have indicated that they do not have plans to update the Common Crawl corpus data. Their rationale pertains… See the full description on the dataset page: https://huggingface.co/datasets/wmt/wmt19.

Facebook

Twitter

Click to copy link

Link copied

Cite

Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:

Dataset updated

Nov 20, 2024

Dataset provided by

Common Crawlhttp://commoncrawl.org/

Authors

Common Crawl Foundation

Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Clear search

Close search

Google apps

Main menu

statistics

Common Crawl Dataset

Random sample of Common Crawl domains from 2021

Context

Content

Acknowledgements

Inspiration

common-crawl-sample

C4 Dataset

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2023 – VERSION 1)

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)

Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1)

RealNews Dataset

cc_news

web-cc12-hostgraph

aquamuse Dataset

fineweb

Replication Data for: Intercity relationships between 293 Chinese cities...

Data for manuscript "Reciprocal Radicalization: The Rise of Culture War...

GOLOS Dataset

chinese-c4

MuRIL Large tf

MuRIL: Multilingual Representations for Indian Language

Overview

Training

Monolingual Data

Parallel Data

Trainable parameters

Uses & Limitations

Citation

References

[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

[2]: Wikipedia

[3]: Common Crawl

[4]: PMINDIA

[5]: Dakshina

[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

[7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...

wmt19

statistics

commoncrawl/statistics

Common Crawl Statistics