21 datasets found
  1. statistics

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Statistics

    Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

      Charsets
    

    The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

  2. P

    Common Crawl Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Oct 7, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2014). Common Crawl Dataset [Dataset]. https://paperswithcode.com/dataset/common-crawl
    Explore at:
    Dataset updated
    Oct 7, 2014
    Description

    The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  3. Random sample of Common Crawl domains from 2021

    • kaggle.com
    Updated Aug 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HiHarshSinghal (2021). Random sample of Common Crawl domains from 2021 [Dataset]. https://www.kaggle.com/datasets/harshsinghal/random-sample-of-common-crawl-domains-from-2021/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    HiHarshSinghal
    Description

    Context

    Common Crawl project has fascinated me ever since I learned about it. It provides a large number of data formats and presents challenges across skill and interest areas. I am particularly interested in URL analysis for applications such as typosquatting, malicious URLs, and just about anything interesting that can be done with domain names.

    Content

    I have sampled 1% of the domains from the Common Crawl Index dataset that is available on AWS in Parquet format. You can read more about how I extracted this dataset @ https://harshsinghal.dev/create-a-url-dataset-for-nlp/

    Acknowledgements

    Thanks a ton to the folks at https://commoncrawl.org/ for making this immensely valuable resource available to the world for free. Please find their Terms of Use here.

    Inspiration

    My interests are in working with string similarity functions and I continue to find scalable ways of doing this. I wrote about using a Postgres extension to compute string distances and used Common Crawl URL domains as the input dataset (you can read more @ https://harshsinghal.dev/postgres-text-similarity-with-commoncrawl-domains/).

    I am also interested in identifying fraudulent domains and understanding malicious URL patterns.

  4. h

    common-crawl-sample

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng, common-crawl-sample [Dataset]. https://huggingface.co/datasets/agentlans/common-crawl-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Alan Tseng
    Description

    Common Crawl sample

    A small unofficial random subset of the famous Common Crawl dataset.

    60 random segment WET files were downloaded from Common Crawl on 2024-05-12. Lines between 500 and 5000 characters long (inclusive) were kept. Only unique texts were kept. No other filtering.

      Languages
    

    Each text was assigned to one of the language codes using the GCLD3 Python package. The Chinese texts were classified as either simplified, traditional, or Cantonese using the… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/common-crawl-sample.

  5. P

    C4 Dataset

    • paperswithcode.com
    Updated Dec 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu (2023). C4 Dataset [Dataset]. https://paperswithcode.com/dataset/c4
    Explore at:
    Dataset updated
    Dec 13, 2023
    Authors
    Colin Raffel; Noam Shazeer; Adam Roberts; Katherine Lee; Sharan Narang; Michael Matena; Yanqi Zhou; Wei Li; Peter J. Liu
    Description

    C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It was based on Common Crawl dataset: https://commoncrawl.org. It was used to train the T5 text-to-text Transformer models.

    The dataset can be downloaded in a pre-processed form from allennlp.

  6. l

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1)

    • lindat.cz
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2025). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2024 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5811?show=full
    Explore at:
    Dataset updated
    Feb 1, 2025
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2024) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  7. c

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2023 – VERSION 1)

    • lindat.mff.cuni.cz
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2025). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2023 – VERSION 1) [Dataset]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5810?show=full
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2023) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  8. l

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1)

    • lindat.cz
    Updated Nov 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2019 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5806
    Explore at:
    Dataset updated
    Nov 26, 2024
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2019) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  9. l

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1)

    • lindat.cz
    Updated Nov 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Oliver Rüdiger (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2014 – VERSION 1) [Dataset]. https://lindat.cz/repository/xmlui/handle/11372/LRT-5788?show=full
    Explore at:
    Dataset updated
    Nov 16, 2024
    Authors
    Jan Oliver Rüdiger
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    *** german version see below ***

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2014) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens.

    The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes.

    The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year).

    The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG.

    Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

    Please note that the files are saved as *.cec6.gz. These are binary files of the CorpusExplorer (see above). These files ensure efficient archiving. You can use both CorpusExplorer and the ‘CEC6-Converter’ (available for Linux, MacOS and Windows - see: https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-5705) to convert the data. The data can be exported in the following formats:

    • CATMA v6
    • CoNLL
    • CSV
    • CSV (only meta-data)
    • DTA TCF-XML
    • DWDS TEI-XML
    • HTML
    • IDS I5-XML
    • IDS KorAP XML
    • IMS Open Corpus Workbench
    • JSON
    • OPUS Corpus Collection XCES
    • Plaintext
    • SaltXML
    • SlashA XML
    • SketchEngine VERT
    • SPEEDy/CODEX (JSON)
    • TLV-XML
    • TreeTagger
    • TXM
    • WebLicht
    • XML

    Please note that an export increases the storage space requirement extensively. The ‘CorpusExplorerConsole’ (https://github.com/notesjor/CorpusExplorer.Terminal.Console - available for Linux, MacOS and Windows) also offers a simple solution for editing and analysing. If you have any questions, please contact the author.

    Legal information The data was downloaded on 01.11.2024. The use, processing and distribution is subject to §60d UrhG (german copyright law), which authorises the use for non-commercial purposes in research and teaching. LINDAT/CLARIN is responsible for long-term archiving in accordance with §69d para. 5 and ensures that only authorised persons can access the data. The data has been checked to the best of our knowledge and belief (on a random basis) - should you nevertheless find legal violations (e.g. right to be forgotten, personal rights, etc.), please write an e-mail to the author (amc_report@jan-oliver-ruediger.de) with the following information: 1) why this content is undesirable (please outline only briefly) and 2) how the content can be identified - e.g. file name, URL or domain, etc. The author will endeavour to identify the content. The author will endeavour to remove the content and re-upload the data (modified) within two weeks (new version). If

  10. P

    RealNews Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jan 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi (2023). RealNews Dataset [Dataset]. https://paperswithcode.com/dataset/realnews
    Explore at:
    Dataset updated
    Jan 30, 2023
    Authors
    Rowan Zellers; Ari Holtzman; Hannah Rashkin; Yonatan Bisk; Ali Farhadi; Franziska Roesner; Yejin Choi
    Description

    RealNews is a large corpus of news articles from Common Crawl. Data is scraped from Common Crawl, limited to the 5000 news domains indexed by Google News. The authors used the Newspaper Python library to extract the body and metadata from each article. News from Common Crawl dumps from December 2016 through March 2019 were used as training data; articles published in April 2019 from the April 2019 dump were used for evaluation. After deduplication, RealNews is 120 gigabytes without compression.

  11. h

    cc_news

    • huggingface.co
    Updated Jul 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladimir Blagojevic (2018). cc_news [Dataset]. https://huggingface.co/datasets/vblagoje/cc_news
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 3, 2018
    Authors
    Vladimir Blagojevic
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for CC-News

      Dataset Summary
    

    CC-News dataset contains news articles from news sites all over the world. The data is available on AWS S3 in the Common Crawl bucket at /crawl-data/CC-NEWS/. This version of the dataset has been prepared using news-please - an integrated web crawler and information extractor for news.It contains 708241 English language news articles published between Jan 2017 and December 2019. It represents a small portion of the English… See the full description on the dataset page: https://huggingface.co/datasets/vblagoje/cc_news.

  12. n

    web-cc12-hostgraph

    • networkrepository.com
    csv
    Updated Oct 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Network Data Repository (2018). web-cc12-hostgraph [Dataset]. https://networkrepository.com/web-cc12-hostgraph.php
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 4, 2018
    Dataset authored and provided by
    Network Data Repository
    License

    https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php

    Description

    Host-level Web Graph - This graph aggregates the page graph by subdomain/host where each node represents a specific subdomain/host and an edge exists between a pair of hosts/subdomains if at least one link was found between pages that belong to a pair of subdomains/hosts. The hyperlink graph was extracted from the Web corpus released by the Common Crawl Foundation in August 2012. The Web corpus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation. Also, see web-cc12-firstlevel-subdomain and web-cc12-PayLevelDomain.

  13. P

    aquamuse Dataset

    • paperswithcode.com
    • library.toponeai.link
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sayali Kulkarni; Sheide Chammas; Wan Zhu; Fei Sha; Eugene Ie (2024). aquamuse Dataset [Dataset]. https://paperswithcode.com/dataset/aquamuse
    Explore at:
    Dataset updated
    Aug 7, 2024
    Authors
    Sayali Kulkarni; Sheide Chammas; Wan Zhu; Fei Sha; Eugene Ie
    Description

    5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.

  14. h

    fineweb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FineData, fineweb [Dataset]. http://doi.org/10.57967/hf/2493
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    FineData
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    🍷 FineWeb

    15 trillion tokens of the finest data the 🌐 web has to offer

      What is it?
    

    The 🍷 FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove library, our large scale data processing library. 🍷 FineWeb was originally meant to be a fully open replication of 🦅 RefinedWeb, with a release of the full dataset under… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFW/fineweb.

  15. d

    Replication Data for: Intercity relationships between 293 Chinese cities...

    • search.dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang Tongjing; Yin Zhao; Ziyu Bao; Evert Meijers (2023). Replication Data for: Intercity relationships between 293 Chinese cities quantified based on toponym co-occurrence [Dataset]. http://doi.org/10.7910/DVN/O0M59W
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wang Tongjing; Yin Zhao; Ziyu Bao; Evert Meijers
    Time period covered
    Apr 18, 2019 - Apr 26, 2019
    Description

    This dataset presents relationships between 293 Chinese cities, derived using a toponym co-occurrence method. By employing this toponym co-occurrence analysis method, the strength of an intercity relationship is determined by the frequency at which both city names appear on the same webpage. The data was sourced from the Common Crawl web archive's 2019 April Corpus, which contains approximately 2.5 billion web pages. The primary aim of this dataset is to provide a fresh perspective on intercity relationships, thereby facilitating studies on city network analysis. The dataset not only encourages further research into comparing this innovative city relationship with other established networks but is also a showcase that presents a straightforward methodology that can be applied to other archives within Common Crawl. As such, it paves the way for longitudinal studies that probe the evolution of city networks.

  16. Z

    Data for manuscript "Reciprocal Radicalization: The Rise of Culture War...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Dec 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rozado (2021). Data for manuscript "Reciprocal Radicalization: The Rise of Culture War Terminology in British and American News Coverage" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5709759
    Explore at:
    Dataset updated
    Dec 7, 2021
    Dataset authored and provided by
    David Rozado
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United Kingdom, United States
    Description

    This data set contains frequency counts of target words in 16 million news and opinion articles from 10 popular news media outlets in the United Kingdom: The Guardian, The Times, The Independent, The Daily Mirror, BBC, Financial Times, Metro, Telegraph, The and The Daily Mail plus a few additional American-based outlets used for comparison reference. The target words are listed in the associated manuscript and are mostly words that denote some type of prejudice, social justice related terms or counterreaction to it. A few additional words are also available since they are used in the manuscript for illustration purposes.

    The textual content of news and opinion articles from the outlets listed in Figure 3 of the main manuscript is available in the outlet's online domains and/or public cache repositories such as Google cache (https://webcache.googleusercontent.com), The Internet Wayback Machine (https://archive.org/web/web.php), and Common Crawl (https://commoncrawl.org). We derived relative frequency counts from these sources. Textual content included in our analysis is circumscribed to articles headlines and main body of text of the articles and does not include other article elements such as figure captions.

    Targeted textual content was located in HTML raw data using outlet specific xpath expressions. Tokens were lowercased prior to estimating frequency counts. To prevent outlets with sparse text content for a year from distorting aggregate frequency counts, we only include outlet frequency counts from years for which there is at least 1 million words of article content from an outlet.

    Yearly frequency usage of a target word in an outlet in any given year was estimated by dividing the total number of occurrences of the target word in all articles of a given year by the number of all words in all articles of that year. This method of estimating frequency accounts for variable volume of total article output over time.

    The list of compressed files in this data set is listed next:

    -analysisScripts.rar contains the analysis scripts used in the main manuscript

    -targetWordsInArticlesCounts.rar contains counts of target words in outlets articles as well as total counts of words in articles

    -targetWordsInArticlesCountsGuardianExampleWords contains counts of target words in outlets articles as well as total counts of words in articles for illustrative Figure 1 in main manuscript

    Usage Notes

    In a small percentage of articles, outlet specific XPath expressions can fail to properly capture the content of the article due to the heterogeneity of HTML elements and CSS styling combinations with which articles text content is arranged in outlets online domains. As a result, the total and target word counts metrics for a small subset of articles are not precise. In a random sample of articles and outlets, manual estimation of target words counts overlapped with the automatically derived counts for over 90% of the articles.

    Most of the incorrect frequency counts were minor deviations from the actual counts such as for instance counting the word "Facebook" in an article footnote encouraging article readers to follow the journalist’s Facebook profile and that the XPath expression mistakenly included as the content of the article main text. To conclude, in a data analysis of 16 million articles, we cannot manually check the correctness of frequency counts for every single article and hundred percent accuracy at capturing articles’ content is elusive due to the small number of difficult to detect boundary cases such as incorrect HTML markup syntax in online domains. Overall however, we are confident that our frequency metrics are representative of word prevalence in print news media content (see Figure 1 of main manuscript for supporting evidence).

  17. P

    GOLOS Dataset

    • paperswithcode.com
    Updated Jun 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikolay Karpov; Alexander Denisenko; Fedor Minkin (2021). GOLOS Dataset [Dataset]. https://paperswithcode.com/dataset/golos
    Explore at:
    Dataset updated
    Jun 17, 2021
    Authors
    Nikolay Karpov; Alexander Denisenko; Fedor Minkin
    Description

    Golos is a Russian speech dataset suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.

    Dataset structure | Domain | Train files | Train hours | Test files | Test hours | |----------------|------------|--------|-------|------| | Crowd | 979 796 | 1 095 | 9 994 | 11.2 | | Farfield | 124 003 | 132.4| 1 916 | 1.4 | | Total | 1 103 799 | 1 227.4|11 910 | 12.6 |

    Audio files in opus format | Archive | Size | Link | |------------------|------------|---------------------| | golos_opus.tar | 20.5 GB | https://sc.link/JpD |

    Audio files in wav format | Archives | Size | Links | |-------------------|------------|---------------------| | train_farfield.tar| 15.4 GB | https://sc.link/1Z3 | | train_crowd0.tar | 11 GB | https://sc.link/Lrg | | train_crowd1.tar | 14 GB | https://sc.link/MvQ | | train_crowd2.tar | 13.2 GB | https://sc.link/NwL | | train_crowd3.tar | 11.6 GB | https://sc.link/Oxg | | train_crowd4.tar | 15.8 GB | https://sc.link/Pyz | | train_crowd5.tar | 13.1 GB | https://sc.link/Qz7 | | train_crowd6.tar | 15.7 GB | https://sc.link/RAL | | train_crowd7.tar | 12.7 GB | https://sc.link/VG5 | | train_crowd8.tar | 12.2 GB | https://sc.link/WJW | | train_crowd9.tar | 8.08 GB | https://sc.link/XKk | | test.tar | 1.3 GB | https://sc.link/Kqr |

    Evaluation Percents of Word Error Rate for different test sets

    Decoder \ Test setCrowd testFarfield testMCV1 devMCV1 test
    Greedy decoder4.389 %14.949 %9.314 %11.278 %
    Beam Search with Common Crawl LM4.709 %12.503 %6.341 %7.976 %
    Beam Search with Golos train set LM3.548 %12.384 %--
    Beam Search with Common Crawl and Golos LM3.318 %11.488 %6.4 %8.06 %
  18. h

    chinese-c4

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jianbin Chang, chinese-c4 [Dataset]. https://huggingface.co/datasets/shjwudp/chinese-c4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Jianbin Chang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    China
    Description

    Introduction

    Chinese-C4 is a clean Chinese internet dataset based on Common Crawl. The dataset is 46.29GB and has undergone multiple cleaning strategies, including Chinese filtering, heuristic cleaning based on punctuation, line-based hashing for deduplication, and repetition removal. The dataset is open source and free for commercial use, and you are welcome to use the data and the cleaning strategies provided and contribute your cleaning strategies. You can find the cleaning… See the full description on the dataset page: https://huggingface.co/datasets/shjwudp/chinese-c4.

  19. MuRIL Large tf

    • kaggle.com
    Updated Oct 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2021). MuRIL Large tf [Dataset]. https://www.kaggle.com/nbroad/muril-large-tf/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 16, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nicholas Broad
    Description

    This was converted from the pytorch state_dict, and I'm not sure it will work because I got this warning. I don't think the cls parameters matter, but I'm wondering about the position_ids

    Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.bias', 'bert.embeddings.position_ids', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']

    MuRIL: Multilingual Representations for Indian Language

    MuRIL is a BERT model pre-trained on 17 Indian languages and their transliterated counterparts. We have released the pre-trained model (with the MLM layer intact, enabling masked word predictions) in this repository. We have also released the encoder on TFHub with an additional pre-processing module, that processes raw text into the expected input format for the encoder. You can find more details on MuRIL in this paper.

    Apache 2.0 License

    Link to model on Hugging Face Hub

    Overview

    This model uses a BERT base architecture [1] pretrained from scratch using the Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6] Indian languages.

    We use a training paradigm similar to multilingual bert, with a few modifications as listed:

    We include translation and transliteration segment pairs in training as well. We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to enhance low-resource performance. [7] See the Training section for more details.

    Training

    The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below

    Monolingual Data

    We make use of publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.

    Parallel Data

    We have two types of parallel data - Translated Data
    We obtain translations of the above monolingual corpora using the Google NMT pipeline. We feed translated segment pairs as input. We also make use of the publicly available PMINDIA corpus.
    - Transliterated Data
    We obtain transliterations of Wikipedia using the IndicTrans [8] library. We feed transliterated segment pairs as input. We also make use of the publicly available Dakshina dataset.

    We keep an exponent value of 0.3 to calculate duplication multiplier values for upsampling of lower resourced languages and set dupe factors accordingly. Note, we limit transliterated pairs to Wikipedia only.

    The model was trained using a self-supervised masked language modeling task. We do whole word masking with a maximum of 80 predictions. The model was trained for 1000K steps, with a batch size of 4096, and a max sequence length of 512.

    Trainable parameters

    All parameters in the module are trainable, and fine-tuning all parameters is the recommended practice.

    Uses & Limitations

    This model is intended to be used for a variety of downstream NLP tasks for Indian languages. This model is trained on transliterated data as well, a phenomomenon commonly observed in the Indian context. This model is not expected to perform well on languages other than the ones used in pretraining, i.e. 17 Indian languages.

    Citation

    @misc{khanuja2021muril,
       title={MuRIL: Multilingual Representations for Indian Languages},
       author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
       year={2021},
       eprint={2103.10730},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
    }
    

    References

    [1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.

    [2]: Wikipedia

    [3]: Common Crawl

    [4]: PMINDIA

    [5]: Dakshina

    [6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi), Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu (ur).

    [7]: [Conneau, Alexis, et al. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.0...

  20. h

    wmt19

    • huggingface.co
    • opendatalab.com
    Updated Dec 25, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WMT: Workshop on Statistical Machine Translation (2021). wmt19 [Dataset]. https://huggingface.co/datasets/wmt/wmt19
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 25, 2021
    Dataset authored and provided by
    WMT: Workshop on Statistical Machine Translation
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "wmt19"

      Dataset Summary
    

    Warning: There are issues with the Common Crawl corpus data (training-parallel-commoncrawl.tgz):

    Non-English files contain many English sentences.
    Their "parallel" sentences in English are not aligned: they are uncorrelated with their counterpart.
    

    We have contacted the WMT organizers, and in response, they have indicated that they do not have plans to update the Common Crawl corpus data. Their rationale pertains… See the full description on the dataset page: https://huggingface.co/datasets/wmt/wmt19.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
Organization logo

statistics

commoncrawl/statistics

Common Crawl Statistics

Explore at:
Dataset updated
Nov 20, 2024
Dataset provided by
Common Crawlhttp://commoncrawl.org/
Authors
Common Crawl Foundation
Description

Common Crawl Statistics

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

  Charsets

The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

Search
Clear search
Close search
Google apps
Main menu