Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis-Web-Archive-Quality-22 comprises a total of 6,500 pairs of screenshots from web pages as they were archived and as they were reproduced from that archive, along with archive quality annotations and information of DOM elements on the screenshot.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages. If you use this dataset in your research, please cite it using this paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotations of content errors in the Webis-Web-Archive-17.
Described in more detail in an upcoming publication.
In addition to the annotations, web pages were manually tagged with various labels (especially kind of error implied by the error messages).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotations of content errors in the Webis-Web-Archive-17.
Described in more detail in an upcoming publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis-WebSeg-20 dataset comprises 42,450 crowdsourced segmentations for 8,490 web pages from the Webis-Web-Archive-17. Segmentations were fused from the segmentations of five crowd workers each.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset of crowdsourced annotations for web page segmentations.
Web pages are taken from the webis-web-archive-17.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Archive of the Geohistory/Géohistoire website and related files. Captured July 17, 2020.
http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/non-commercial-government-licence.htmhttp://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/non-commercial-government-licence.htm
Data from Transnationalizing Modern Languages (09-2018)
Transnationalizing Modern Languages: Mobility, Identity and Translation in Modern Italian Cultures (TML) (funded by the AHRC under the ‘Translating Cultures’ theme, 2014-17)
PI Charles Burdett, University of Bristol. CIs Jenny Burns (Warwick), Loredana Polezzi (Warwick/Cardiff), Derek Duncan (St Andrews), Margaret Hills de Zarate (QMU)
RAs: Barbara Spadaro (Bristol), Carlo Pirozzi (St Andrews), Marco Santello (Warwick), Naomi Wells (Warwick), Luisa Percopo (Cardiff)
PhD students: Iacopo Colombini (St Andrews), Georgia Wall (Warwick)
Below is a short description of the project. Within the repository, there is a longer description of TML and each folder is accompanied by an explanatory text.
The project investigates practices of linguistic and cultural interchange within communities and individuals and explores the ways in which cultural translation intersects with linguistic translation in the everyday lives of people. The project has used as its primary object of enquiry the 150-year history of Italy as a nation state and its patterns of emigration and immigration. TML has concentrated on a series of exemplary cases, representative of the geographic, historical and linguistic map of Italian mobility. Focussing on the cultural associations that each community has formed, it examines the wealth of publications and materials that are associated with these organizations.
Working closely with researchers from across Modern Languages, the project has sought to demonstrate the principle that language is most productively apprehended in the frame of translation and the national in the frame of the transnational. TML is contributing to the development of a new framework for the disciplinary field of MLs, one which puts the interaction of languages and cultures at its core.
The principles of co-production and co-research lie at the core of the project and TML has worked closely with a very extensive range of partners. It has worked closely with Castlebrae and Drummond Community High Schools and with cultural associations across the world. The project exhibition, featuring the research of the project and including the work of photographer Mario Badagliacca, was curated by Viviana Gravano and Giulia Grechi of Routes Agency. Project events in the UK have drawn on the expertise of Rita Wilson (Monash), the writer Shirin Ramzanali Fazel and all members of the Advisory Board. The project, in close collaboration with the University of Namibia (UNAM) and the Phoenix Project (Cardiff), has been followed by ‘TML: Global Challenges’.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Webis-Web-Archive-17 comprises a total of 10,000 web page archives from mid-2017 that were carefully sampled from the Common Crawl to involve a mixture of high-ranking and low-ranking web pages. The dataset contains the web archive files, HTML DOM, and screenshots of each web page, as well as per-page annotations of visual web archive quality. See this overview for all datasets that built upon this one. If you use this dataset in your research, please cite it using this paper.