49 datasets found
  1. statistics

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Statistics

    Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

      Charsets
    

    The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

  2. Random sample of Common Crawl domains from 2021

    • kaggle.com
    Updated Aug 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HiHarshSinghal (2021). Random sample of Common Crawl domains from 2021 [Dataset]. https://www.kaggle.com/datasets/harshsinghal/random-sample-of-common-crawl-domains-from-2021/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HiHarshSinghal
    Description

    Context

    Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

    Content

    I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

    Acknowledgements

    Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

    Inspiration

    My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

    I am also interested in identifying fraudulent domains and understanding malicious URL patterns.

  3. P

    C4 Dataset

    • paperswithcode.com
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu (2023). C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
    Explore at:
    Dataset updated
    Dec 13, 2023
    Authors
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
    Description

    C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

    The dataset can be downloaded in a pre-processed form from allennlp.

  4. T

    c4_wsrs

    • tensorflow.org
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

    The original source is the Common Crawl dataset: https://commoncrawl.org

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('c4_wsrs', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. P

    Common Crawl Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 7, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Common Crawl Dataset [Dataset]. https://paperswithcode.com/dataset/common-crawl
    Explore at:
    Dataset updated
    Oct 7, 2014
    Description

    The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  6. citations

    • huggingface.co
    Updated Jul 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). citations [Dataset]. https://huggingface.co/datasets/commoncrawl/citations
    Explore at:
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Citations Overview

    This dataset contains citations referencing Common Crawl Foundation and its datasets, pulled from Google Scholar. Please note that these citations are not curated, so they will include some false positives. For an annotated subset of these citations with additional fields, please see citations-annotated.

  7. Common Crawl News 20200110212037-00310

    • kaggle.com
    Updated Jan 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Altay (2020). Common Crawl News 20200110212037-00310 [Dataset]. https://www.kaggle.com/gabrielaltay/common-crawl-news-2020011021203700310/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 11, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gabriel Altay
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a single Web ARChive (WARC) file downloaded from the Common Crawl S3 bucket. It covers part of 2020-01-10 (2020 January 10th).

  8. l

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)

    • lindat.cz
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2025). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5811?show=full
    Explore at:
    Dataset updated
    Feb 1, 2025
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2024) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  9. l

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1)

    • lindat.cz
    Updated Nov 21, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2022 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5794
    Explore at:
    Dataset updated
    Nov 21, 2024
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2022) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  10. h

    CC-MAIN-2018-22_urls

    • huggingface.co
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). CC-MAIN-2018-22_urls [Dataset]. http://doi.org/10.57967/hf/4101
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2025
    Authors
    Nick Hagar
    Description

    This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2018-22 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.

  11. Comprehensive set of Sitemap and robots.txt links extracted from Common...

    • zenodo.org
    zip
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Dinzinger; Michael Dinzinger (2024). Comprehensive set of Sitemap and robots.txt links extracted from Common Crawl [Dataset]. http://doi.org/10.5281/zenodo.10511292
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Dinzinger; Michael Dinzinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 14, 2024
    Description

    This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.

    Sitemaps:

    • 32,252,027 links (all ending with .xml or .xml.gz); 395.2 MB (compressed)
    • Website categories; 2.2 MB (compressed)
    • Top level labels of Curlie.org directoryNumber of sitemap links
      Arts20110
      Business68690
      Computers17404
      Games3068
      Health13999
      Home4130
      Kids_and_Teens2240
      News5855
      Recreation19273
      Reference10862
      Regional419
      Science10729
      Shopping29903
      Society35019
      Sports12597

    Robots.txt files:

    • 41,611,704 links; 440.9 MB (compressed)
    • Website categories; 2.7 MB (compressed)
    • Top level labels of Curlie.org directoryNumber of robots.txt links
      Arts25281
      Business79497
      Computers21880
      Games5037
      Health17326
      Home5401
      Kids_and_Teens3753
      News3424
      Recreation26355
      Reference15404
      Regional678
      Science16500
      Shopping30266
      Society45397
      Sports18029
  12. l

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1)

    • lindat.cz
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5788?show=full
    Explore at:
    Dataset updated
    Nov 16, 2024
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2014) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  13. h

    clean-si-mc4

    • huggingface.co
    Updated Dec 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keshan Sodimana (2021). clean-si-mc4 [Dataset]. https://huggingface.co/datasets/keshan/clean-si-mc4
    Explore at:
    Dataset updated
    Dec 21, 2021
    Authors
    Keshan Sodimana
    Description

    A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.

  14. CEREAL I, el Corpus del Español REAL

    • zenodo.org
    bz2, txt
    Updated Jun 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cristina España-Bonet; Cristina España-Bonet; Alberto Barrón-Cedeño; Alberto Barrón-Cedeño (2024). CEREAL I, el Corpus del Español REAL [Dataset]. http://doi.org/10.5281/zenodo.11387864
    Explore at:
    bz2, txtAvailable download formats
    Dataset updated
    Jun 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Cristina España-Bonet; Cristina España-Bonet; Alberto Barrón-Cedeño; Alberto Barrón-Cedeño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Content:

    CEREAL (visit the project website) is a document-level corpus of documents in Spanish extracted from OSCAR. Each document in the corpus is classified according to its country of origin. CEREAL covers 24 countries where Spanish is spoken. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text, which comes from OSCAR and therefore from Common Crawl.

    The process to build the corpus and its characteristics can be found in:

    Cristina España-Bonet and Alberto Barrón-Cedeño. "Elote, Choclo and Mazorca: on the Varieties of Spanish." In proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2024), Mexico City, Mexico, June 2024.

    The corpus used to train the classifier and the sentence-level version of CEREAL is available at
    https://zenodo.org/records/11390829

    Files Description:

    See the README.txt file

  15. h

    CC-MAIN-2019-18_urls

    • huggingface.co
    Updated Jan 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Hagar (2025). CC-MAIN-2019-18_urls [Dataset]. http://doi.org/10.57967/hf/4091
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 15, 2025
    Authors
    Nick Hagar
    Description

    This dataset contains domain names and counts of (non-deduplicated) URLs for every record in the CC-MAIN-2019-18 snapshot of the Common Crawl. It was collected from the AWS S3 version of Common Crawl via Amazon Athena. This dataset is derived from Common Crawl data and is subject to Common Crawl's Terms of Use: https://commoncrawl.org/terms-of-use.

  16. Z

    Dataset for Report: "The Increasing Prominence of Prejudice and Social...

    • data.niaid.nih.gov
    Updated Jun 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rozado (2022). Dataset for Report: "The Increasing Prominence of Prejudice and Social Justice Rhetoric in UK News Media" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_6482344
    Explore at:
    Dataset updated
    Jun 13, 2022
    Dataset authored and provided by
    David Rozado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom
    Description

    This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom. The target words are listed in the associated report and are mostly words that denote prejudice or are often associated with social justice discourse. A few additional words not denoting prejudice are also available since they are used in the report for illustration purposes of the method.

    The textual content of news and opinion articles from the outlets is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

    Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. This threshold was chosen to maximize inclusion in our analysis of outlets with sparse amounts of articles text per year.

    Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

    In a small percentage of articles, outlet specific XPath expressions might fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.

    Most of the incorrect frequency counts are often minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text.To conclude, in a data analysis of over 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 2 of main manuscript for supporting evidence of the temporal precision of the method).

  17. Webis-MS-MARCO-Anchor-Texts-22

    • zenodo.org
    application/gzip
    Updated Jan 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maik Fröbe; Maik Fröbe; Sebastian Günther; Sebastian Günther; Maximilian Probst; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Maximilian Probst (2022). Webis-MS-MARCO-Anchor-Texts-22 [Dataset]. http://doi.org/10.5281/zenodo.5883456
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jan 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Maik Fröbe; Maik Fröbe; Sebastian Günther; Sebastian Günther; Maximilian Probst; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Maximilian Probst
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). We sampled 1,000 anchor texts for documents with more than 1,000 anchor texts at random and all anchor texts for documents with less than 1,000 anchor texts (this sampling yields that all anchor text is included for 94% of the documents in Version 1 and 97% of documents for Version 2). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with anchor text.

    Cleaned versions of the MS MARCO Anchor Text 2022 dataset are available in ir_datasets, Zenodo and Hugging Face. The raw dataset with additional information and all metadata for the extracted anchor texts (roughly 100GB) is available on Hugging Face and files.webis.de.

    The details of the construction of the Webis MS MARCO Anchor Text 2022 dataset are described in the associated paper. If you use this dataset, please cite
    @InProceedings{froebe:2022a,
    address = {Berlin Heidelberg New York},
    author = {Maik Fr{\"o}be and Sebastian G{\"u}nther and Maximilian Probst and Martin Potthast and Matthias Hagen},
    booktitle = {Advances in Information Retrieval. 44th European Conference on IR Research (ECIR 2022)},
    editor = {Matthias Hagen and Suzan Verberne and Craig Macdonald and Christin Seifert and Krisztian Balog and Kjetil N{\o}rv\r{a}g and Vinay Setty},
    month = apr,
    publisher = {Springer},
    series = {Lecture Notes in Computer Science},
    site = {Stavanger, Norway},
    title = {{The Power of Anchor Text in the Neural Retrieval Era}},
    year = 2022
    }

  18. W

    CommonCrawl News Articles by Political Orientation

    • webis.de
    • anthology.aicmu.ac.cn
    7476697
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Keiff; Henning Wachsmuth (2022). CommonCrawl News Articles by Political Orientation [Dataset]. http://doi.org/10.5281/zenodo.7476697
    Explore at:
    7476697Available download formats
    Dataset updated
    2022
    Dataset provided by
    The Web Technology & Information Systems Network
    Leibniz Universität Hannover
    Authors
    Maximilian Keiff; Henning Wachsmuth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset includes news articles gathered from CommonCrawl for media outlets that were selected based on their political orientation. The news articles span publication dates from 2010 to 2021.

  19. c

    Data from: Text collection for training the BERTić transformer model...

    • clarin.si
    • live.european-language-grid.eu
    Updated May 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikola Ljubešić (2021). Text collection for training the BERTić transformer model BERTić-data [Dataset]. https://clarin.si/repository/xmlui/handle/11356/1426?show=full
    Explore at:
    Dataset updated
    May 8, 2021
    Authors
    Nikola Ljubešić
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The BERTić-data text collection contains more than 8 billion tokens of mostly web-crawled text written in Bosnian, Croatian, Montenegrin or Serbian. The collection was used to train the BERTić transformer model (https://huggingface.co/classla/bcms-bertic). The data consists of web crawls before 2015, i.e. bsWaC (http://hdl.handle.net/11356/1062), hrWaC (http://hdl.handle.net/11356/1064), and srWaC (http://hdl.handle.net/11356/1063); previously unpublished 2019-2020 crawls, i.e. cnrWaC, CLASSLA-bs, CLASSLA-hr, and CLASSLA-sr; the cc100-hr and cc100-sr parts of CommonCrawl (https://commoncrawl.org/); and the Riznica corpus (http://hdl.handle.net/11356/1180). All texts were transliterated to the Latin script. The format of the text collection is one-sentence-per-line, empty-line-as-document-boundary. More details, especially on the applied near-deduplication procedure, can be found in the BERTić paper (https://arxiv.org/pdf/2104.09243.pdf).

  20. Data for manuscript "The Prevalence of Terms Denoting Far-right and Far-left...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Mar 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rozado; David Rozado (2022). Data for manuscript "The Prevalence of Terms Denoting Far-right and Far-left Political Extremism in U.S. and U.K. News Media" [Dataset]. http://doi.org/10.5281/zenodo.5437016
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 22, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Rozado; David Rozado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom, United States
    Description

    This data set belongs to an academic manuscript examining longitudinally (2000-2019) the prevalence of terms denoting far-right and far-left political extremism in a large corpus of more than 32 million written news and opinion articles from 54 news media outlets popular in the United States and the United Kingdom.

    The textual content of news and opinion articles from the 54 outlets listed in the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We used derived word frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

    Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet. This threshold was chosen to maximize inclusion in our analysis of outlets with sparse amounts of articles text per year.

    Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

    The list of compressed files in this data set is listed next:

    -analysisScripts.rar contains the analysis scripts used in the main manuscript

    -articlesContainingTargetWords.rar contains counts of target words in outlets articles as well as total counts of words in articles

    Usage Notes

    In a small percentage of articles, outlet specific XPath expressions failed to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.

    Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. Some additional outlet-specific inaccuracies that we could identify occurred in "The Hill" and "Newsmax" news outlets where XPath expressions had some shortfalls at precisely capturing articles’ content. For "The Hill", in years 2007-2009, XPath expressions failed to capture the complete text of the article in about 40% of the articles. This does not necessarily result in incorrect frequency counts for that outlet but in a sample of articles’ words that is about 40% smaller than the total population of articles words for those three years. In the case of "NewsMax", the issue was that for some articles, XPath expressions captured the entire text of the article twice. Notice that this does not result in incorrect frequency counts. If a word appears x times in an article with a total of y words, the same frequency count will still be derived when our scripts count the word 2x times in the version of the article with a total of 2y words.

    To conclude, in a data analysis of 32 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 in the main manuscript for illustration of the accuracy of the frequency counts).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Organization logo

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Search
Clear search
Close search
Google apps
Main menu