1 dataset found
  1. o

    Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

    • explore.openaire.eu
    • live.european-language-grid.eu
    • +1more
    Updated Jul 5, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milad Alshomary; Michael Völske; Henning Wachsmuth; Benno Stein; Matthias Hagen; Martin Potthast (2018). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3372484
    Explore at:
    Dataset updated
    Jul 5, 2018
    Authors
    Milad Alshomary; Michael Völske; Henning Wachsmuth; Benno Stein; Matthias Hagen; Martin Potthast
    Description

    The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl. The corpus has following structure: wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text) within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text) preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content) The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl. {"references": ["Milad Alshomary, Michael V\u00f6lske, Tristan Licht, Henning Wachsmuth, Benno Stein, Matthias Hagen, and Martin Potthast. Wikipedia Text Reuse: Within and Without. In Leif Azzopardi et al, editors, Advances in Information Retrieval. 41st European Conference on IR Research (ECIR 2019) volume 11437 of Lecture Notes in Computer Science, pages 747-754, Berlin Heidelberg New York, April 2019. Springer."]}

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Milad Alshomary; Michael Völske; Henning Wachsmuth; Benno Stein; Matthias Hagen; Martin Potthast (2018). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3372484

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

Explore at:
22 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 5, 2018
Authors
Milad Alshomary; Michael Völske; Henning Wachsmuth; Benno Stein; Matthias Hagen; Martin Potthast
Description

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl. The corpus has following structure: wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text) within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text) preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content) The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl. {"references": ["Milad Alshomary, Michael V\u00f6lske, Tristan Licht, Henning Wachsmuth, Benno Stein, Matthias Hagen, and Martin Potthast. Wikipedia Text Reuse: Within and Without. In Leif Azzopardi et al, editors, Advances in Information Retrieval. 41st European Conference on IR Research (ECIR 2019) volume 11437 of Lecture Notes in Computer Science, pages 747-754, Berlin Heidelberg New York, April 2019. Springer."]}

Search
Clear search
Close search
Google apps
Main menu