Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Provider
Free
Cost to access
Described as free to access or have a license that allows redistribution.
2 datasets found
  1. Webis-WikiDiscussions-18

    • zenodo.org
    • webis.de
    application/gzip
    Updated Aug 29, 2022
  2. E

    Webis-WikiDebate-18

    • live.european-language-grid.eu
    • webis.de
    • +1more
    Updated Apr 30, 2024
  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Khalid Al-Khatib; Henning Wachsmuth; Henning Wachsmuth; Matthias Hagen; Matthias Hagen; Benno Stein; Benno Stein; Kevin Lang; Kevin Lang; Jakob Herpel; Khalid Al-Khatib; Jakob Herpel (2022). Webis-WikiDiscussions-18 [Dataset]. http://doi.org/10.5281/zenodo.3339152
Organization logo

Webis-WikiDiscussions-18

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
application/gzipAvailable download formats
Dataset updated
Aug 29, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Khalid Al-Khatib; Henning Wachsmuth; Henning Wachsmuth; Matthias Hagen; Matthias Hagen; Benno Stein; Benno Stein; Kevin Lang; Kevin Lang; Jakob Herpel; Khalid Al-Khatib; Jakob Herpel
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Webis-WikiDiscussions-18 Corpus is the output of parsing the entire set of Wikipedia talk pages. The corpus contains about six million discussions, consisting of about 20 million turns. The turns comprise around 74,000 different tags with a total of about 100,000 instances, around 7000 different shortcuts with about 400,000 instances, and around 51,000 different inline templates with about 3.3 million instances.

The database has the following structure:

  • PAGES: PAGE-ID, URL, TITLE
  • DISCUSSIONS: DISCUSSION-ID, PAGE-ID, TITLE
  • COMMENTS: COMMENT-ID, DISCUSSION-ID, PARENT-ID, TEXT-RAW, TEXT-CLEAN, USER
  • TAGS: TAG-ID, COMMENT-ID, TAG-TEXT, TAG-CLASS
  • TEMPLATES: TEMPLATE-ID, DISCUSSION-ID, TEMPLATE-TEXT
  • SHORTCUTS: SHORTCUT-ID, COMMENT-ID, SHORTCUT-TEXT, SHORTCUT-CLASS
  • LINKS: LINK-ID, COMMENT-ID, LINK-TEXT
  • INLINE-TEMPLATES: IL-TEMPLATE-ID, COMMENT-ID, IL-TEMPLATE-TEXT, TYPE, DESCRIPTION
Search
Clear search
Close search
Google apps
Main menu