2 datasets found
  1. s

    The CommonCrawl Corpus

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  2. O

    Common Crawl

    • opendatalab.com
    zip
    Updated Jan 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institut national de recherche en informatique et en automatique (2019). Common Crawl [Dataset]. https://opendatalab.com/OpenDataLab/Common_Crawl
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 1, 2019
    Dataset provided by
    Sorbonne University
    Institut national de recherche en informatique et en automatique
    License

    https://commoncrawl.org/terms-of-use/https://commoncrawl.org/terms-of-use/

    Description

    The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL

The CommonCrawl Corpus

Explore at:
133 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 24, 2020
Description

The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Search
Clear search
Close search
Google apps
Main menu