1 dataset found
  1. W

    Webis-Wikipedia-IPC-23

    • webis.de
    7621320
    Updated 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marcel Gohsen; Matthias Hagen; Martin Potthast; Benno Stein (2023). Webis-Wikipedia-IPC-23 [Dataset]. http://doi.org/10.5281/zenodo.7621320
    Explore at:
    7621320Available download formats
    Dataset updated
    2023
    Dataset provided by
    The Web Technology & Information Systems Network
    Bauhaus-Universität Weimar
    Friedrich Schiller University Jena
    University of Kassel, hessian.AI, and ScaDS.AI
    Authors
    Marcel Gohsen; Matthias Hagen; Martin Potthast; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyzed captions in the English Wikipedia, where editors frequently relabel the same image for different articles. As a result, the Wikipedia-IPC (Image caption Paraphrase Corpus) dataset was created which include caption pairs of the same image which represent paraphrases. It contains 30,237 gold, 229,877 silver, and 656,560 bronze quality paraphrase pairs.

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marcel Gohsen; Matthias Hagen; Martin Potthast; Benno Stein (2023). Webis-Wikipedia-IPC-23 [Dataset]. http://doi.org/10.5281/zenodo.7621320

Webis-Wikipedia-IPC-23

Explore at:
7621320Available download formats
Dataset updated
2023
Dataset provided by
The Web Technology & Information Systems Network
Bauhaus-Universität Weimar
Friedrich Schiller University Jena
University of Kassel, hessian.AI, and ScaDS.AI
Authors
Marcel Gohsen; Matthias Hagen; Martin Potthast; Benno Stein
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyzed captions in the English Wikipedia, where editors frequently relabel the same image for different articles. As a result, the Wikipedia-IPC (Image caption Paraphrase Corpus) dataset was created which include caption pairs of the same image which represent paraphrases. It contains 30,237 gold, 229,877 silver, and 656,560 bronze quality paraphrase pairs.

Search
Clear search
Close search
Google apps
Main menu