9 datasets found
  1. Data from: Webis-Web-Archive-17

    • zenodo.org
    • anthology.aicmu.ac.cn
    • +3more
    png, txt, zip
    Updated Jul 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Johannes Kiesel; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Florian Kneist; Benno Stein; Benno Stein; Florian Kneist (2024). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.4040710
    Explore at:
    zip, png, txtAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kiesel; Johannes Kiesel; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Florian Kneist; Benno Stein; Benno Stein; Florian Kneist
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.

  2. W

    Webis-Web-Archive-Quality-22

    • anthology.aicmu.ac.cn
    • webis.de
    6881334
    Updated 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martin Potthast; Johannes Kiesel; Benno Stein (2022). Webis-Web-Archive-Quality-22 [Dataset]. http://doi.org/10.5281/zenodo.6881334
    Explore at:
    6881334Available download formats
    Dataset updated
    2022
    Dataset provided by
    The Web Technology & Information Systems Network
    Bauhaus-Universität Weimar
    Leipzig University
    Authors
    Martin Potthast; Johannes Kiesel; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Archive-Quality-22 comprises a total of 6,500 pairs of screenshots from web pages as they were archived and as they were reproduced from that archive, along with archive quality annotations and information of DOM elements on the screenshot.

  3. Webis-Web-Errors-19

    • zenodo.org
    • webis.de
    • +2more
    csv, png, txt
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein (2024). Webis-Web-Errors-19 [Dataset]. http://doi.org/10.5281/zenodo.2640364
    Explore at:
    csv, png, txtAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages. If you use this dataset in your research, please cite it using this paper.

  4. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    csv, txt
    Updated Sep 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein (2020). Webis-Web-Archive-17 Content Error Annotations [Dataset]. http://doi.org/10.5281/zenodo.2602699
    Explore at:
    txt, csvAvailable download formats
    Dataset updated
    Sep 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotations of content errors in the Webis-Web-Archive-17.

    Described in more detail in an upcoming publication.

    In addition to the annotations, web pages were manually tagged with various labels (especially kind of error implied by the error messages).

  5. Webis-Web-Archive-17 Content Error Annotations

    • zenodo.org
    csv
    Updated Sep 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein (2020). Webis-Web-Archive-17 Content Error Annotations [Dataset]. http://doi.org/10.5281/zenodo.2549838
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Johannes Kiesel; Johannes Kiesel; Fabienne Hubricht; Benno Stein; Martin Potthast; Martin Potthast; Fabienne Hubricht; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotations of content errors in the Webis-Web-Archive-17.

    Described in more detail in an upcoming publication.

  6. W

    Webis-WebSeg-20

    • webis.de
    • anthology.aicmu.ac.cn
    • +1more
    3354902
    Updated 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Lars Meyer; Benno Stein; Martin Potthast (2020). Webis-WebSeg-20 [Dataset]. http://doi.org/10.5281/zenodo.3354902
    Explore at:
    3354902Available download formats
    Dataset updated
    2020
    Dataset provided by
    University of Kassel, hessian.AI, and ScaDS.AI
    Enginsight GmbH
    GESIS - Leibniz Institute for the Social Sciences
    The Web Technology & Information Systems Network
    Bauhaus-Universität Weimar
    Authors
    Johannes Kiesel; Lars Meyer; Benno Stein; Martin Potthast
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-WebSeg-20 dataset comprises 42,450 crowdsourced segmentations for 8,490 web pages from the Webis-Web-Archive-17. Segmentations were fused from the segmentations of five crowd workers each.

  7. Z

    Webis-Web-Segments-20

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +1more
    Updated Feb 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Meyer, Lars (2023). Webis-Web-Segments-20 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3354902
    Explore at:
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Kneist, Florian
    Komlossy, Kristof
    Meyer, Lars
    Potthast, Martin
    Kiesel, Johannes
    Stein, Benno
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset of crowdsourced annotations for web page segmentations.

    Web pages are taken from the webis-web-archive-17.

  8. B

    geohist.ca website files/fichiers du site web geohist.ca

    • borealisdata.ca
    Updated Jun 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Fortin (2022). geohist.ca website files/fichiers du site web geohist.ca [Dataset]. http://doi.org/10.5683/SP2/OWEBOJ
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 2, 2022
    Dataset provided by
    Borealis
    Authors
    Marcel Fortin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Archive of the Geohistory/Géohistoire website and related files. Captured July 17, 2020.

  9. w

    Data from Transnational Mod Languages (09-2018)/05 TML Website/TML Website...

    • data.wu.ac.at
    docx, jpeg, pdf, png
    Updated Oct 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arts (2018). Data from Transnational Mod Languages (09-2018)/05 TML Website/TML Website Storage/5 TML NEWS/TML is involved in the new archive exhibition, Edinburgh December 2015 – January 2016) [Dataset]. https://data.wu.ac.at/schema/data_bris_ac_uk_data_/YWRmMzUzMWYtZjM1OS00OWRkLWE2YjUtYjRmOWQ5YWZjNzM3
    Explore at:
    jpeg(31534.0), pdf(829537.0), jpeg(77833.0), docx(79806.0), png(666062.0)Available download formats
    Dataset updated
    Oct 2, 2018
    Dataset provided by
    Arts
    License

    http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/non-commercial-government-licence.htmhttp://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/non-commercial-government-licence.htm

    Description

    Data from Transnationalizing Modern Languages (09-2018)

    Transnationalizing Modern Languages: Mobility, Identity and Translation in Modern Italian Cultures (TML) (funded by the AHRC under the ‘Translating Cultures’ theme, 2014-17)

    PI Charles Burdett, University of Bristol. CIs Jenny Burns (Warwick), Loredana Polezzi (Warwick/Cardiff), Derek Duncan (St Andrews), Margaret Hills de Zarate (QMU)

    RAs: Barbara Spadaro (Bristol), Carlo Pirozzi (St Andrews), Marco Santello (Warwick), Naomi Wells (Warwick), Luisa Percopo (Cardiff)

    PhD students: Iacopo Colombini (St Andrews), Georgia Wall (Warwick)

    Below is a short description of the project. Within the repository, there is a longer description of TML and each folder is accompanied by an explanatory text.

    The project investigates practices of linguistic and cultural interchange within communities and individuals and explores the ways in which cultural translation intersects with linguistic translation in the everyday lives of people. The project has used as its primary object of enquiry the 150-year history of Italy as a nation state and its patterns of emigration and immigration. TML has concentrated on a series of exemplary cases, representative of the geographic, historical and linguistic map of Italian mobility. Focussing on the cultural associations that each community has formed, it examines the wealth of publications and materials that are associated with these organizations.

    Working closely with researchers from across Modern Languages, the project has sought to demonstrate the principle that language is most productively apprehended in the frame of translation and the national in the frame of the transnational. TML is contributing to the development of a new framework for the disciplinary field of MLs, one which puts the interaction of languages and cultures at its core.

    The principles of co-production and co-research lie at the core of the project and TML has worked closely with a very extensive range of partners. It has worked closely with Castlebrae and Drummond Community High Schools and with cultural associations across the world. The project exhibition, featuring the research of the project and including the work of photographer Mario Badagliacca, was curated by Viviana Gravano and Giulia Grechi of Routes Agency. Project events in the UK have drawn on the expertise of Rita Wilson (Monash), the writer Shirin Ramzanali Fazel and all members of the Advisory Board. The project, in close collaboration with the University of Namibia (UNAM) and the Phoenix Project (Cardiff), has been followed by ‘TML: Global Challenges’.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johannes Kiesel; Johannes Kiesel; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Florian Kneist; Benno Stein; Benno Stein; Florian Kneist (2024). Webis-Web-Archive-17 [Dataset]. http://doi.org/10.5281/zenodo.4040710
Organization logo

Data from: Webis-Web-Archive-17

Related Article
Explore at:
zip, png, txtAvailable download formats
Dataset updated
Jul 19, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Johannes Kiesel; Johannes Kiesel; Martin Potthast; Martin Potthast; Matthias Hagen; Matthias Hagen; Florian Kneist; Benno Stein; Benno Stein; Florian Kneist
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.

Search
Clear search
Close search
Google apps
Main menu