8 datasets found

W
Data from: Webis-Web-Archive-17
webis.de
data.niaid.nih.gov
1002203
Updated 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist (2017). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.1002203
Explore at:
1002203Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.1002203
Dataset updated
2017
Dataset provided by
Friedrich Schiller University Jena
GESIS - Leibniz Institute for the Social Sciences
Bauhaus-Universität Weimar
University of Kassel, hessian.AI, and ScaDS.AI
The Web Technology & Information Systems Network
Authors
Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.
W
Webis-Web-Archive-Quality-22
webis.de
anthology.aicmu.ac.cn
6881334
Updated 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Theresa Elstner; Johannes Kiesel; Sebastian Schmidt; Martin Potthast; Benno Stein (2022). Webis-Web-Archive-Quality-22 [Dataset]. http://doi.org/10.5281/zenodo.6881334
Explore at:
6881334Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.6881334
Dataset updated
2022
Dataset provided by
GESIS - Leibniz Institute for the Social Sciences
Bauhaus-Universität Weimar
University of Kassel
Enginsight GmbH
University of Kassel, hessian.AI, and ScaDS.AI
The Web Technology & Information Systems Network
Authors
Theresa Elstner; Johannes Kiesel; Sebastian Schmidt; Martin Potthast; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis-Web-Archive-Quality-22 comprises a total of 6,500 pairs of screenshots from web pages as they were archived and as they were reproduced from that archive, along with archive quality annotations and information of DOM elements on the screenshot.
Webis-Web-Errors-19
zenodo.org
webis.de
+1more
csv, png, txt
Updated Jul 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein (2024). Webis-Web-Errors-19 [Dataset]. http://doi.org/10.5281/zenodo.2640364
Explore at:
csv, png, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2640364
Dataset updated
Jul 24, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages. If you use this dataset in your research, please cite it using this paper.
Webis-Web-Archive-17 Content Error Annotations
zenodo.org
csv
Updated Sep 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein (2020). Webis-Web-Archive-17 Content Error Annotations [Dataset]. http://doi.org/10.5281/zenodo.2549838
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.2549838
Dataset updated
Sep 21, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotations of content errors in the Webis-Web-Archive-17.

Described in more detail in an upcoming publication.
Z
Webis-Web-Segments-20
data.niaid.nih.gov
Updated Feb 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kiesel, Johannes; Kneist, Florian; Meyer, Lars; Komlossy, Kristof; Stein, Benno; Potthast, Martin (2023). Webis-Web-Segments-20 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3354902
Explore at:
Dataset updated
Feb 16, 2023
Dataset provided by
Leipzig University
Bauhaus-Universität Weimar
Authors
Kiesel, Johannes; Kneist, Florian; Meyer, Lars; Komlossy, Kristof; Stein, Benno; Potthast, Martin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset of crowdsourced annotations for web page segmentations.

Web pages are taken from the webis-web-archive-17.
Webis-WebSeg-20
zenodo.org
webis.de
+1more
txt, zip
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes Kiesel; Johannes Kiesel; Florian Kneist; Florian Kneist; Lars Meyer; Lars Meyer; Kristof Komlossy; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast; Kristof Komlossy (2023). Webis-WebSeg-20 [Dataset]. http://doi.org/10.5281/zenodo.3988124
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3988124
Dataset updated
Feb 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Kiesel; Johannes Kiesel; Florian Kneist; Florian Kneist; Lars Meyer; Lars Meyer; Kristof Komlossy; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast; Kristof Komlossy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis-WebSeg-20 dataset comprises 42,450 crowdsourced segmentations for 8,490 web pages from the Webis-Web-Archive-17. Segmentations were fused from the segmentations of five crowd workers each. If you use this dataset in your research, please cite it using this paper.
Z
Webis Clickbait Corpus 2017 (Webis-Clickbait-17)
data.niaid.nih.gov
live.european-language-grid.eu
Updated Jun 11, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Potthast, Martin; Gollub, Tim; Wiegmann, Matti; Stein, Benno; Hagen, Matthias; Komlossy, Kristof; Schuster, Sebstian; Fernandez, Erika P. Garces (2022). Webis Clickbait Corpus 2017 (Webis-Clickbait-17) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3346490
Explore at:
Dataset updated
Jun 11, 2022
Dataset provided by
Bauhaus-Universität Weimar
Authors
Potthast, Martin; Gollub, Tim; Wiegmann, Matti; Stein, Benno; Hagen, Matthias; Komlossy, Kristof; Schuster, Sebstian; Fernandez, Erika P. Garces
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis Clickbait Corpus 2017 (Webis-Clickbait-17) comprises a total of 38,517 Twitter posts from 27 major US news publishers. In addition to the posts, information about the articles linked in the posts are included. The posts had been published between November 2016 and June 2017. To avoid publisher and topical biases, a maximum of ten posts per day and publisher were sampled. All posts were annotated on a 4-point scale [not click baiting (0.0), slightly click baiting (0.33), considerably click baiting (0.66), heavily click baiting (1.0)] by five annotators from Amazon Mechanical Turk. A total of 9,276 posts are considered clickbait by the majority of annotators. In terms of its size, this corpus outranges the Webis Clickbait Corpus 2016 by one order of magnitude. The corpus is divided into two logical parts, a training and a test dataset. The training dataset has been released in the course of the Clickbait Challenge and a download link is provided below. To allow for an objective evaulatuion of clickbait detection systems, the test dataset is available only through the Evaluation-as-a-Service platform TIRA at the moment. On TIRA, developers can deploy clickbait detection systems and execute them against the test dataset. The performance of the submitted systems can be viewed on the TIRA page of the Clickbait Challenge.

To make working with the Webis Clickbait Corpus 2017 convenient, and to allow for its validation and replication, we are developing and sharing a number of software tags:

Corpus Viewer. Our Django web service for exploring corpora. For importing the Webis Clickbait Corpus 2017 into the corpus viewer, we provide an appropriate configuration file.

MTurk Manager. Our Django web service for conducting sophisticated crowd sourcing tasks on Amazon Mechanical Turk. The service allows to manage projects, upload batches of HITS, apply custom reviewing interfaces, and more. To make the clickbait crowd-sourcing task replicable, we share the worker template that we used to instruct the workers and to display the tweets. Also shared is a reviewing template that can be used to accept/reject assignments and to assess the quality of the received annotations quickly.

Web Archiver. Software for archiving web pages as WARC files and reproducing them later on. This software can be used to open the WARC archives provided above.

In addition to the corpus "clickbait17-train-170630.zip", we provide the original WARC archives of the articles that are linked in the posts. They are split in 5 archives that can be extracted separately.
B
geohist.ca website files/fichiers du site web geohist.ca
borealisdata.ca
Updated Jun 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marcel Fortin (2022). geohist.ca website files/fichiers du site web geohist.ca [Dataset]. http://doi.org/10.5683/SP2/OWEBOJ
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/OWEBOJ
Dataset updated
Jun 2, 2022
Dataset provided by
Borealis
Authors
Marcel Fortin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Archive of the Geohistory/Géohistoire website and related files. Captured July 17, 2020.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist (2017). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.1002203

Data from: Webis-Web-Archive-17

Explore at:

1002203Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.1002203

Dataset updated

2017

Dataset provided by

Friedrich Schiller University Jena
GESIS - Leibniz Institute for the Social Sciences
Bauhaus-Universität Weimar
University of Kassel, hessian.AI, and ScaDS.AI
The Web Technology & Information Systems Network

Authors

Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality.

Clear search

Close search

Google apps

Main menu

Data from: Webis-Web-Archive-17

Webis-Web-Archive-Quality-22

Webis-Web-Errors-19

Webis-Web-Archive-17 Content Error Annotations

Webis-Web-Segments-20

Webis-WebSeg-20

Webis Clickbait Corpus 2017 (Webis-Clickbait-17)

geohist.ca website files/fichiers du site web geohist.ca

Data from: Webis-Web-Archive-17See More Versions

Data from: Webis-Web-Archive-17