1 dataset found

E
Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)
live.european-language-grid.eu
data.niaid.nih.gov
json
Updated Apr 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7748
Explore at:
jsonAvailable download formats
Dataset updated
Apr 11, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.The corpus has following structure:
wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body
within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)
preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content
without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)
The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7748

Webis Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18)

Explore at:

jsonAvailable download formats

Dataset updated

Apr 11, 2024

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Wikipedia Text Reuse Corpus 2018 (Webis-Wikipedia-Text-Reuse-18) containing text reuse cases extracted from within Wikipedia and in between Wikipedia and a sample of the Common Crawl.The corpus has following structure:

wikipedia.jsonl.bz2: Each line, representing a Wikipedia article, contains a json array of article_id, article_title, and article_body

within-wikipedia-tr-01.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

within-wikipedia-tr-02.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (source article id), t_id (target article id), s_text (source text), t_text (target text)

preprocessed-web-sample.jsonl.xz: Each line, representing a web page, contains a json object of d_id, d_url, and content

without-wikipedia-tr.jsonl.bz2: Each line, representing a text reuse case, contains a json array of s_id (Wikipedia article id), d_id (web page id), s_text (article text), d_content (web page content)

The datasets were extracted in the work by Alshomary et al. 2018 that aimed to study the text reuse phenomena related to Wikipedia at scale. A pipeline for large scale text reuse extraction was developed and used on Wikipedia and the CommonCrawl.

Clear search

Close search

Google apps

Main menu