Search
Clear search
Close search
Main menu
Google apps
2 datasets found
  1. Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

    • zenodo.org
    application/gzip, zip
    Updated Aug 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth (2022). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3372485
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    Aug 29, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

    The corpus has following structure:

    • wikipedia.tar.gz: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
    • within-wikipedia-tr-01.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
    • within-wikipedia-tr-02.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

    The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

  2. E

    Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    json
    Updated Apr 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7748
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Apr 11, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.The corpus has following structure:

    wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body

    within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

    within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

    preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content

    without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)

    The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth (2022). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3372485
Organization logo

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

Explore at:
application/gzip, zipAvailable download formats
Dataset updated
Aug 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

The corpus has following structure:

  • wikipedia.tar.gz: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
  • within-wikipedia-tr-01.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
  • within-wikipedia-tr-02.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.