Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Provider
Free
Cost to access
Described as free to access or have a license that allows redistribution.
9 datasets found
  1. W

    Data from: Webis-Web-Archive-17

    • webis.de
    • zenodo.org
    1002203
    Updated 2017
  2. z

    Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    csv, txt
    Updated Mar 22, 2019
  3. W

    Webis-Web-Archive-Quality-22

    • webis.de
    6881334
    Updated 2022
  4. W

    Webis-Web-Errors-19

    • webis.de
    • zenodo.org
    2549837
    Updated 2019
  5. z

    Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    csv
    Updated Jan 25, 2019
  6. W

    Webis-WebSeg-20

    • webis.de
    • zenodo.org
    3354902
    Updated 2020
  7. z

    Webis-Web-Segments-20

    • zenodo.org
    txt, zip
    Updated Jun 8, 2020
  8. B

    geohist.ca website files/fichiers du site web geohist.ca

    • borealisdata.ca
    application/gzip
    Updated Jun 2, 2022
  9. z

    Webis Clickbait Corpus 2017 (Webis-Clickbait-17)

    • zenodo.org
    zip
    Updated Jun 11, 2018
  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johannes Kiesel; Florian Kneist; Matthias Hagen; Benno Stein (2017). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.1002203

Data from: Webis-Web-Archive-17

Related Article
Explore at:
1002203Available download formats
Dataset updated
2017
Dataset provided by
Friedrich Schiller University Jena
The Web Technology & Information Systems Network
Bauhaus-Universität Weimar
Leipzig University
Authors
Johannes Kiesel; Florian Kneist; Matthias Hagen; Benno Stein
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.

Search
Clear search
Close search
Google apps
Main menu