2 datasets found

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)
zenodo.org
application/gzip, zip
Updated Aug 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth (2022). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3372485
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3372485
Dataset updated
Aug 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

The corpus has following structure:

wikipedia.tar.gz: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body

within-wikipedia-tr-01.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

within-wikipedia-tr-02.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.
E
Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)
live.european-language-grid.eu
data.niaid.nih.gov
json
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7748
Explore at:
jsonAvailable download formats
Dataset updated
Apr 11, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.The corpus has following structure:
wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content
without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)
The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth (2022). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. http://doi.org/10.5281/zenodo.3372485

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

Explore at:

application/gzip, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.3372485

Dataset updated

Aug 29, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Milad Alshomary; Michael Völske; Michael Völske; Henning Wachsmuth; Benno Stein; Benno Stein; Matthias Hagen; Matthias Hagen; Martin Potthast; Martin Potthast; Milad Alshomary; Henning Wachsmuth

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.

The corpus has following structure:

wikipedia.tar.gz: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
within-wikipedia-tr-01.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
within-wikipedia-tr-02.gz: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.