100+ datasets found
  1. s

    The CommonCrawl Corpus

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

  2. n

    web-cc12-hostgraph

    • networkrepository.com
    csv
    Updated Oct 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Network Data Repository (2018). web-cc12-hostgraph [Dataset]. https://networkrepository.com/web-cc12-hostgraph.php
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 4, 2018
    Dataset authored and provided by
    Network Data Repository
    License

    https://networkrepository.com/policy.phphttps://networkrepository.com/policy.php

    Description

    Host-level Web Graph - This graph aggregates the page graph by subdomain/host where each node represents a specific subdomain/host and an edge exists between a pair of hosts/subdomains if at least one link was found between pages that belong to a pair of subdomains/hosts. The hyperlink graph was extracted from the Web corpus released by the Common Crawl Foundation in August 2012. The Web corpus was gathered using a web crawler employing a breadth-first-search selection strategy and embedding link discovery while crawling. The crawl was seeded with a large number of URLs from former crawls performed by the Common Crawl Foundation. Also, see web-cc12-firstlevel-subdomain and web-cc12-PayLevelDomain.

  3. h

    CommonCrawl-CreativeCommons

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bram Vanroy, CommonCrawl-CreativeCommons [Dataset]. http://doi.org/10.57967/hf/5340
    Explore at:
    Authors
    Bram Vanroy
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    The Common Crawl Creative Commons Corpus (C5)

    Raw CommonCrawl crawls, annotated with Creative Commons license information

    C5 is an effort to collect Creative Commons-licensed web data in one place. The licensing information is extracted from the web pages based on whether they link to Creative Commons licenses either overtly in a tags (like in the footer of Wikipedia) or in metadata fields indicating deliberate Creative Commons publication. However, false positives may occur! See… See the full description on the dataset page: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons.

  4. e

    Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1)...

    • b2find.eudat.eu
    Updated Nov 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Ancillary Monitor Corpus: Common Crawl - german web (YEAR 2017 – VERSION 1) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/ca7d532c-3db1-5bdc-ab01-e16a9b1466e3
    Explore at:
    Dataset updated
    Nov 16, 2024
    Description

    The ‘Ancillary Monitor Corpus: Common Crawl - german web’ was designed with the aim of enabling a broad-based linguistic analysis of the German-language (visible) internet over time - with the aim of achieving comparability with the DeReKo (‘German Reference Corpus’ of the Leibniz Institute for the German Language - DeReKo volume 57 billion tokens - status: DeReKo Release 2024-I). The corpus is separated by year (here year 2017) and versioned (here version 1). Version 1 comprises (all years 2013-2024) 97.45 billion tokens. The corpus is based on the data dumps from CommonCrawl (https://commoncrawl.org/). CommonCrawl is a non-profit organisation that provides copies of the visible Internet free of charge for research purposes. The CommonCrawl WET raw data was first filtered by TLD (top-level domain). Only pages ending in the following TLDs were taken into account: ‘.at; .bayern; .berlin; .ch; .cologne; .de; .gmbh; .hamburg; .koeln; .nrw; .ruhr; .saarland; .swiss; .tirol; .wien; .zuerich’. These are the exclusive German-language TLDs according to ICANN (https://data.iana.org/TLD/tlds-alpha-by-domain.txt) as of 1 June 2024 - TLDs with a purely corporate reference (e.g. ‘.edeka; .bmw; .ford’) were excluded. The language of the individual documents (URLs) was then estimated with the help of NTextCat (https://github.com/ivanakcheurov/ntextcat) (via the CORE14 profile of NTextCat) - only those documents/URLs for which German was the most likely language were processed further (e.g. to exclude foreign-language material such as individual subpages). The third step involved filtering for manual selectors and filtering for 1:1 duplicates (within one year). The filtering and subsequent processing was carried out using CorpusExplorer (http://hdl.handle.net/11234/1-2634) and our own (supplementary) scripts, and the TreeTagger (http://hdl.handle.net/11372/LRT-323) was used for automatic annotation. The corpus was processed on the HELIX HPC cluster. The author would like to take this opportunity to thank the state of Baden-Württemberg and the German Research Foundation (DFG) for the possibility to use the bwHPC/HELIX HPC cluster - funding code HPC cluster: INST 35/1597-1 FUGG. Data content: - Tokens and record boundaries - Automatic lemma and POS annotation (using TreeTagger) - Metadata: - GUID - Unique identifier of the document - YEAR - Year of capture (please use this information for data slices) - Url - Full URL - Tld - Top-Level Domain - Domain - Domain without TLD (but with sub-domains if applicable) - DomainFull - Complete domain (incl. TLD) - DomainFull - Complete domain (incl. TLD) - Datum - (System Information): Date of the CorpusExplorer (date of capture by CommonCrawl - not date of creation/modification of the document). - Hash - (System Information): SHA1 hash of the CommonCrawl - Pfad - (System Information): Path of the cluster (raw data) - is supplied by the system.

  5. statistics

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Crawl Foundation (2024). statistics [Dataset]. https://huggingface.co/datasets/commoncrawl/statistics
    Explore at:
    Dataset updated
    Nov 20, 2024
    Dataset provided by
    Common Crawlhttp://commoncrawl.org/
    Authors
    Common Crawl Foundation
    Description

    Common Crawl Statistics

    Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives, for more detailed information and graphs please visit our official statistics page. Here you can find the following statistics files:

      Charsets
    

    The character set or encoding of HTML pages only is identified by Tika's AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages… See the full description on the dataset page: https://huggingface.co/datasets/commoncrawl/statistics.

  6. h

    taiga

    • huggingface.co
    Updated Nov 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Astafurov Danil (2023). taiga [Dataset]. https://huggingface.co/datasets/danasone/taiga
    Explore at:
    Dataset updated
    Nov 2, 2023
    Authors
    Astafurov Danil
    License

    https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/

    Description

    A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's mC4 dataset by AllenAI.

  7. Crawl attributes ranking by importance on websites in France 2020

    • statista.com
    Updated Jul 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Crawl attributes ranking by importance on websites in France 2020 [Dataset]. https://www.statista.com/statistics/1220602/crawl-attributes-ranking-by-importance-on-websites-france/
    Explore at:
    Dataset updated
    Jul 9, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2020
    Area covered
    France
    Description

    When crawl and mobile indexing are ensured, it makes it easier for the crawler or the Internet user to visit and to facilitate the discovery of a site by search engines. Thus, according to the source, in 2020, more than ** percent of SEOs attached great importance to internal networking, that is to say to the presence of internal links pointing to the page to be highlighted. They considered all the criteria in the crawl category to be important, with the exception of the indication of priority in the sitemap, to which only ** percent of SEOs gave meaning, with **** importance out of five.

  8. e

    Web Data Commons Training and Test Sets for Large-Scale Product Matching -...

    • b2find.eudat.eu
    Updated Nov 27, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Web Data Commons Training and Test Sets for Large-Scale Product Matching - Version 2.0 Product Matching Task derived from the WDC Product Data Corpus - Version 2.0 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/720b440c-eda0-5182-af9f-f868ed999bd7
    Explore at:
    Dataset updated
    Nov 27, 2020
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label “match” or “no match”) for four product categories, computers, cameras, watches and shoes. In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web weak supervision. The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites. For more information and download links for the corpus itself, please follow the links below.

  9. T

    c4_wsrs

    • tensorflow.org
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). c4_wsrs [Dataset]. https://www.tensorflow.org/datasets/catalog/c4_wsrs
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    A medical abbreviation expansion dataset which applies web-scale reverse substitution (wsrs) to the C4 dataset, which is a colossal, cleaned version of Common Crawl's web crawl corpus.

    The original source is the Common Crawl dataset: https://commoncrawl.org

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('c4_wsrs', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. LANL domain crawl seed list

    • figshare.com
    txt
    Updated May 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Klein (2022). LANL domain crawl seed list [Dataset]. http://doi.org/10.6084/m9.figshare.19912459.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 27, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Martin Klein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Seed list generated for domain crawl of lanl.gov

  11. h

    takaraspider

    • huggingface.co
    Updated Jun 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Legg (2025). takaraspider [Dataset]. https://huggingface.co/datasets/takarajordan/takaraspider
    Explore at:
    Dataset updated
    Jun 18, 2025
    Authors
    Jordan Legg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    TakaraSpider Japanese Web Crawl Dataset

      Dataset Summary
    

    TakaraSpider is a large-scale web crawl dataset specifically designed to capture Japanese web content alongside international sources. The dataset contains 257,900 web pages collected through systematic crawling, with a primary focus on Japanese language content (78.5%) while maintaining substantial international representation (21.5%). This makes it ideal for Japanese-English comparative studies, cross-cultural web… See the full description on the dataset page: https://huggingface.co/datasets/takarajordan/takaraspider.

  12. e

    Common Crawl URL index for August 2019 with Last-Modified timestamps -...

    • catalogue.eidf.ac.uk
    Updated Sep 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Common Crawl URL index for August 2019 with Last-Modified timestamps - Dataset - CKAN [Dataset]. https://catalogue.eidf.ac.uk/dataset/eidf125-common-crawl-url-index-for-august-2019-with-last-modified-timestamps
    Explore at:
    Dataset updated
    Sep 1, 2019
    Description

    This dataset consists of a complete set of augmented index files for CC-MAIN-2019-35 [1]. This version of the index contains one additional field, lastmod, in about 18% of the entries, giving the value of the Last-Modified header from the HTTP response as a POSIX-format timestamp, enabling much finer-grained longitudinal study of the corresponding web resources. The filename, offset and length fields in the augmented index are unchanged, and so can be used for retrieval from the original WARC files. [1] https://commoncrawl.org/blog/august-2019-crawl-archive-now-available

  13. w

    Web Data Commons - RDFa, Microdata, and Microformat Data Sets

    • webdatacommons.org
    n-quads
    Updated Oct 15, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Bizer; Robert Meusel; Anna Primpeli (2016). Web Data Commons - RDFa, Microdata, and Microformat Data Sets [Dataset]. http://webdatacommons.org/structureddata/2016-10/stats/stats.html
    Explore at:
    n-quadsAvailable download formats
    Dataset updated
    Oct 15, 2016
    Authors
    Christian Bizer; Robert Meusel; Anna Primpeli
    Description

    Microformat, Microdata and RDFa data from the October 2016 Common Crawl web corpus. We found structured data within 1.24 billion HTML pages out of the 3.2 billion pages contained in the crawl (38%). These pages originate from 5.63 million different pay-level-domains out of the 34 million pay-level-domains covered by the crawl (16.5%). Altogether, the extracted data sets consist of 44.2 billion RDF quads.

  14. Comprehensive set of Sitemap and robots.txt links extracted from Common...

    • zenodo.org
    zip
    Updated Mar 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Dinzinger; Michael Dinzinger (2024). Comprehensive set of Sitemap and robots.txt links extracted from Common Crawl [Dataset]. http://doi.org/10.5281/zenodo.10511292
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 8, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Michael Dinzinger; Michael Dinzinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 14, 2024
    Description

    This is a comprehensive list of links to sitemaps and robots.txt files, which are extracted from the latest WARC Archive dump 2023-50 of robots.txt files.

    Sitemaps:

    • 32,252,027 links (all ending with .xml or .xml.gz); 395.2 MB (compressed)
    • Website categories; 2.2 MB (compressed)
    • Top level labels of Curlie.org directoryNumber of sitemap links
      Arts20110
      Business68690
      Computers17404
      Games3068
      Health13999
      Home4130
      Kids_and_Teens2240
      News5855
      Recreation19273
      Reference10862
      Regional419
      Science10729
      Shopping29903
      Society35019
      Sports12597

    Robots.txt files:

    • 41,611,704 links; 440.9 MB (compressed)
    • Website categories; 2.7 MB (compressed)
    • Top level labels of Curlie.org directoryNumber of robots.txt links
      Arts25281
      Business79497
      Computers21880
      Games5037
      Health17326
      Home5401
      Kids_and_Teens3753
      News3424
      Recreation26355
      Reference15404
      Regional678
      Science16500
      Shopping30266
      Society45397
      Sports18029
  15. Z

    Webis-Simple-Sentences-17 Corpus

    • data.niaid.nih.gov
    • live.european-language-grid.eu
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kiesel, Johannes (2020). Webis-Simple-Sentences-17 Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_205950
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Stein, Benno
    Lucks, Stefan
    Kiesel, Johannes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A corpus of 471,085,690 English sentences extracted from the ClueWeb12 Web Crawl. The sentences were sampled from a larger corpus to achieve a level of sentence complexity similar to the one of sentences that humans make up as a memory aid for remembering passwords. Sentence complexity was determined by syllables per word.

    The corpus is split in training and test set as it is used in the associated publication. The test set is extracted from part 00 of the ClueWeb12, while the training set is extracted from the other parts.

    More information on the corpus can be found on the corpus web page at our university (listed under documented by).

  16. Whole-of-Australian Government Web Crawl

    • data.gov.au
    html, warc
    Updated Jul 29, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Digital Transformation Agency (2019). Whole-of-Australian Government Web Crawl [Dataset]. https://data.gov.au/data/dataset/groups/whole-of-australian-government-web-crawl
    Explore at:
    html, warcAvailable download formats
    Dataset updated
    Jul 29, 2019
    Dataset provided by
    Digital Transformation Agencyhttp://dta.gov.au/
    Area covered
    Australia
    Description

    Includes publicly-accessible, human-readable material from Australian government websites, using the Organisations Register (AGOR) as a seed list and domain categorisation source, obeying robots.txt and sitemap.xml directives, gathered over a 10-day period.

    Several non-*.gov.au domains are included in the AGOR - these have been crawled up to a limit of 10K URLs.

    Several binary file formats included and converted to HTML: doc,docm,docx,dot,epub,keys,numbers,pages,pdf,ppt,pptm,pptx,rtf,xls,xlsm,xlsx

    URLs returning responses larger than 10MB are not included in the dataset.

    Raw gathered data (including metadata) is published in the Web Archive (WARC) format, in both a single, multi-gigabyte WARC file and split series.

    Metadata extracted from pages after filtering is published in JSON format, with fields defined in a data dictionary.

    Licence

    Web content contained within these WARC files has originally been authored by the agency hosting the referenced material. Authoring agencies are responsible for the choice of licence attached to the original material.

    A consistent licence across the entirety of the WARC files' contents should not be assumed. Agencies may have indicated copyright and licence information for a given URL as metadata embedded in a WARC file entry, but this should not be assumed to be present for all WARC entries.

  17. w

    RDFa, Microdata, and Microformat Data Set

    • data.wu.ac.at
    html
    Updated Aug 3, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Web Data Commons (2014). RDFa, Microdata, and Microformat Data Set [Dataset]. https://data.wu.ac.at/schema/datahub_io/MDhkYWU2ODMtNmFjYi00NDgxLWFjODMtMjFjOGUzYTVlNzFm
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Aug 3, 2014
    Dataset provided by
    Web Data Commons
    Description

    More and more websites have started to embed structured data describing products, people, organizations, places, events into their HTML pages using markup standards such as RDFa, Microdata and Microformats. The Web Data Commons project extracts this data from several billion web pages. The project provides the extracted data for download and publishes statistics about the deployment of the different formats.

  18. W

    Webis MS MARCO Anchor Text 2022

    • webis.de
    • anthology.aicmu.ac.cn
    5883456
    Updated 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen (2022). Webis MS MARCO Anchor Text 2022 [Dataset]. http://doi.org/10.5281/zenodo.5883456
    Explore at:
    5883456Available download formats
    Dataset updated
    2022
    Dataset provided by
    The Web Technology & Information Systems Network
    Friedrich Schiller University Jena
    University of Kassel, hessian.AI, and ScaDS.AI
    Authors
    Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with up to 1,000 anchor texts each.

  19. e

    esCorpius: A Massive Spanish Crawling Corpus - Dataset - B2FIND

    • b2find.eudat.eu
    Updated May 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). esCorpius: A Massive Spanish Crawling Corpus - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a6d982c5-6a96-52ae-b0f3-ffb32a1b1380
    Explore at:
    Dataset updated
    May 6, 2023
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    In the recent years, Transformer-based models have lead to significant advances in language modelling for natural language processing. However, they require a vast amount of data to be (pre-)trained and there is a lack of corpora in languages other than English. Recently, several initiatives have presented multilingual datasets obtained from automatic web crawling. However, the results in Spanish present important shortcomings, as they are either too small in comparison with other languages, or present a low quality derived from sub-optimal cleaning and deduplication. In this paper, we introduce esCorpius, a Spanish crawling corpus obtained from near 1 Pb of Common Crawl data. It is the most extensive corpus in Spanish with this level of quality in the extraction, purification and deduplication of web textual content. Our data curation process involves a novel highly parallel cleaning pipeline and encompasses a series of deduplication mechanisms that together ensure the integrity of both document and paragraph boundaries. Additionally, we maintain both the source web page URL and the WARC shard origin URL in order to complain with EU regulations. esCorpius has been released under CC BY-NC-ND 4.0 license.

  20. a

    ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl)

    • academictorrents.com
    bittorrent
    Updated Dec 16, 2014
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Djoerd Hiemstra (2014). ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl) [Dataset]. https://academictorrents.com/details/8ecbbc8360a2d8b6438000ebf257ed06e2eaeb20
    Explore at:
    bittorrent(30349856980)Available download formats
    Dataset updated
    Dec 16, 2014
    Dataset authored and provided by
    Djoerd Hiemstra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A BitTorrent file to download data with the title 'ClueWeb12_Anchors (anchor text derived from CMU's ClueWeb12 web crawl) '

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2020). The CommonCrawl Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/93FNrL

The CommonCrawl Corpus

Explore at:
133 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 24, 2020
Description

The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Search
Clear search
Close search
Google apps
Main menu