Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Free
Cost to access
Described as free to access or have a license that allows redistribution.
9 datasets found
  1. Data from: Webis-Web-Archive-17

    • zenodo.org
    png, txt, zip
    Updated Oct 4, 2017
  2. Data from: Webis-Web-Archive-17

    • webis.de
    1002203
    Updated 2017
  3. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    • search.datacite.org
    csv
    Updated Jan 25, 2019
  4. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    csv, txt
    Updated Mar 22, 2019
  5. o

    Webis-Web-Errors-19

    • explore.openaire.eu
    • zenodo.org
    Updated Apr 15, 2019
  6. Webis-WebSeg-20

    • webis.de
    • zenodo.org
    • +1more
    Updated 2020
  7. Webis-Web-Segments-20

    • zenodo.org
    txt, zip
    Updated Jun 8, 2020
  8. Webis-Clickbait-17

    • webis.de
    Updated 2017
  9. o

    Webis Clickbait Corpus 2017 (Webis-Clickbait-17)

    • explore.openaire.eu
    • live.european-language-grid.eu
    • +1more
    Updated Jan 16, 2022
  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Kiesel, Johannes; Potthast, Martin; Hagen, Matthias; Kneist, Florian; Stein, Benno (2017). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.1002204
Organization logoOrganization logoOrganization logoOrganization logo

Data from: Webis-Web-Archive-17

Related Article
zip, txt, pngAvailable download formats
Dataset updated
Oct 4, 2017
Dataset provided by
Ulm Universityhttp://www.uni-ulm.de/
Martin-Luther-University Halle-Wittenberghttp://www.uchicago.edu/
Bauhaus-Universität Weimarhttp://www.uni-weimar.de/
Leipzig Universityhttp://www.uni-leipzig.de/
Authors
Kiesel, Johannes; Potthast, Martin; Hagen, Matthias; Kneist, Florian; Stein, Benno
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.

Search
Clear search
Close search
Google apps
Main menu