1 dataset found

W
Webis-Wikipedia-IPC-23
anthology.aicmu.ac.cn
7621320
Updated 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Martin Potthast; Matthias Hagen; Benno Stein (2023). Webis-Wikipedia-IPC-23 [Dataset]. http://doi.org/10.5281/zenodo.7621320
Explore at:
7621320Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.7621320
Dataset updated
2023
Dataset provided by
The Web Technology & Information Systems Network
Leipzig University
Bauhaus-Universität Weimar
Friedrich Schiller University Jena
Authors
Martin Potthast; Matthias Hagen; Benno Stein
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyzed captions in the English Wikipedia, where editors frequently relabel the same image for different articles. As a result, the Wikipedia-IPC (Image caption Paraphrase Corpus) dataset was created which include caption pairs of the same image which represent paraphrases. It contains 30,237 gold, 229,877 silver, and 656,560 bronze quality paraphrase pairs.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Martin Potthast; Matthias Hagen; Benno Stein (2023). Webis-Wikipedia-IPC-23 [Dataset]. http://doi.org/10.5281/zenodo.7621320

Webis-Wikipedia-IPC-23

Explore at:

7621320Available download formats

Unique identifier

https://doi.org/10.5281/zenodo.7621320

Dataset updated

2023

Dataset provided by

The Web Technology & Information Systems Network
Leipzig University
Bauhaus-Universität Weimar
Friedrich Schiller University Jena

Authors

Martin Potthast; Matthias Hagen; Benno Stein

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyzed captions in the English Wikipedia, where editors frequently relabel the same image for different articles. As a result, the Wikipedia-IPC (Image caption Paraphrase Corpus) dataset was created which include caption pairs of the same image which represent paraphrases. It contains 30,237 gold, 229,877 silver, and 656,560 bronze quality paraphrase pairs.

Clear search

Close search

Google apps

Main menu