Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Free
Cost to access
Described as free to access or have a license that allows redistribution.
3 datasets found
  1. Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

    • zenodo.org
    gz, zip
    Updated Jul 5, 2018
  2. Wikipedia Text Reuse Corpus

    • webis.de
    Updated 2018
  3. Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

    • zenodo.org
    bz2, xz
    Updated Jul 5, 2018
  4. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Milad Alshomary; Michael Völske; Henning Wachsmuth; Benno Stein; Matthias Hagen; Martin Potthast (2018). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3372485
Organization logoOrganization logoOrganization logoOrganization logo

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

gz, zipAvailable download formats
Dataset updated
Jul 5, 2018
Dataset provided by
Bauhaus-Universität Weimarhttp://www.uni-weimar.de/
Leipzig Universityhttp://www.uni-leipzig.de/
Martin-Luther-University Halle-Wittenberghttp://www.uni-halle.de/
Paderborn Universityhttp://www.uni-paderborn.de/
Authors
Milad Alshomary; Michael Völske; Henning Wachsmuth; Benno Stein; Matthias Hagen; Martin Potthast
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

The corpus has following structure:

  • wikipedia.tar.gz: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
  • within-wikipedia-tr-01.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
  • within-wikipedia-tr-02.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

Search
Clear search
Close search
Google apps
Main menu