Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages.
https://elrc-share.eu/terms/publicDomain.htmlhttps://elrc-share.eu/terms/publicDomain.html
Bilingual (EN, ES) COVID-19-related corpus acquired from the website (https://www.fda.gov/) of U.S. FOOD and DRUG, an official website of the United States government (25th April 2020). It contains 3640 TUs in total.
https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 2,840 TUs. Period of crawling : 15/11/2016 - 23/01/2017. A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation are strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.
https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 3,319 TUs. Date of crawling : 23/01/2017 A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.
https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html
This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 26,622 TUs. Date of crawling : 16/12/2016 A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not complies to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation are strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.
https://elrc-share.eu/terms/publicDomain.htmlhttps://elrc-share.eu/terms/publicDomain.html
Multilingual (EN, ES, VI, KO, TL) COVID-19-related corpus acquired from the website (https://www.fda.gov/) of U.S. FOOD and DRUG, an official website of the United States government (25th August 2020). It contains 5417 TUs in total.
https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html
Polish-English parallel corpus from the website of the National Digital Archives (https://www.nac.gov.pl)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages.