7 datasets found
  1. W

    Webis-Web-Errors-19

    • webis.de
    2549837
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist (2019). Webis-Web-Errors-19 [Dataset]. http://doi.org/10.5281/zenodo.2549837
    Explore at:
    2549837Available download formats
    Dataset updated
    2019
    Dataset provided by
    Friedrich Schiller University Jena
    University of Kassel, hessian.AI, and ScaDS.AI
    The Web Technology & Information Systems Network
    Bauhaus-Universität Weimar
    Authors
    Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages.

  2. E

    COVID-19 FDA dataset v1. Bilingual (EN, ES)

    • live.european-language-grid.eu
    tmx
    Updated Apr 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). COVID-19 FDA dataset v1. Bilingual (EN, ES) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21065
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Apr 30, 2020
    License

    https://elrc-share.eu/terms/publicDomain.htmlhttps://elrc-share.eu/terms/publicDomain.html

    Description

    Bilingual (EN, ES) COVID-19-related corpus acquired from the website (https://www.fda.gov/) of U.S. FOOD and DRUG, an official website of the United States government (25th April 2020). It contains 3640 TUs in total.

  3. E

    Spanish-German website parallel corpus

    • live.european-language-grid.eu
    • data.europa.eu
    tmx
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spanish-German website parallel corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2907
    Explore at:
    tmxAvailable download formats
    License

    https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html

    Description

    This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 2,840 TUs. Period of crawling : 15/11/2016 - 23/01/2017. A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation are strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.

  4. E

    Spanish-Italian website parallel corpus

    • live.european-language-grid.eu
    • data.europa.eu
    tmx
    Updated Jan 23, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2017). Spanish-Italian website parallel corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2903
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Jan 23, 2017
    License

    https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html

    Description

    This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 3,319 TUs. Date of crawling : 23/01/2017 A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not comply to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation is strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.

  5. E

    Maltese-English website parallel corpus

    • live.european-language-grid.eu
    tmx
    Updated Dec 16, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Maltese-English website parallel corpus [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2904
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Dec 16, 2016
    License

    https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html

    Description

    This is a parallel corpus of bilingual texts crawled from multilingual websites, which contains 26,622 TUs. Date of crawling : 16/12/2016 A strict validation process has been followed, which resulted in discarding: - TUs from crawled websites that do not complies to the PSI directive, - TUs with more than 99% of mispelled tokens, - TUs identified during the manual validation process and all the TUs from websites which error rate in the sample extracted for manual validation are strictly above the following thresholds: 50% of TUs with language identification errors, 50% of TUs with alignment errors, 50% of TUs with tokenization errors, 20% of TUs identified as machine translated content, 50% of TUs with translation errors.

  6. E

    COVID-19 FDA dataset v2. Multilingual (EN, ES, KO, VI, TL)

    • live.european-language-grid.eu
    tmx
    Updated Aug 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). COVID-19 FDA dataset v2. Multilingual (EN, ES, KO, VI, TL) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/21355
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Aug 24, 2020
    License

    https://elrc-share.eu/terms/publicDomain.htmlhttps://elrc-share.eu/terms/publicDomain.html

    Description

    Multilingual (EN, ES, VI, KO, TL) COVID-19-related corpus acquired from the website (https://www.fda.gov/) of U.S. FOOD and DRUG, an official website of the United States government (25th August 2020). It contains 5417 TUs in total.

  7. E

    Polish-English parallel corpus from the website of the National Digital...

    • live.european-language-grid.eu
    • catalog.elra.info
    • +1more
    tmx
    Updated Nov 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Polish-English parallel corpus from the website of the National Digital Archives (Processed) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/3174
    Explore at:
    tmxAvailable download formats
    Dataset updated
    Nov 14, 2018
    License

    https://elrc-share.eu/terms/openUnderPSI.htmlhttps://elrc-share.eu/terms/openUnderPSI.html

    Description

    Polish-English parallel corpus from the website of the National Digital Archives (https://www.nac.gov.pl)

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist (2019). Webis-Web-Errors-19 [Dataset]. http://doi.org/10.5281/zenodo.2549837

Webis-Web-Errors-19

Explore at:
2549837Available download formats
Dataset updated
2019
Dataset provided by
Friedrich Schiller University Jena
University of Kassel, hessian.AI, and ScaDS.AI
The Web Technology & Information Systems Network
Bauhaus-Universität Weimar
Authors
Johannes Kiesel; Martin Potthast; Matthias Hagen; Benno Stein; Florian Kneist
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis-Web-Errors-19 comprises various annotations for the 10,000 web page archives of the Webis-Web-Archive-17. The annotations are whether the page is (1) mostly advertisement, (2) cut off, (3) still loading, (4) pornographic; and whether it shows (not/a bit/ very) (5) pop-ups, (6) CAPTCHAs, or (7) error messages.

Search
Clear search
Close search
Google apps
Main menu