Saved datasets
Last updated
Download format
Usage rights
License from data provider
Please review the applicable license to make sure your contemplated use is permitted.
Topic
Free
Cost to access
Described as free to access or have a license that allows redistribution.
2 datasets found
  1. Webis-TLDR-17 Corpus

    • www.zenodo.eu
    • figshare.com
    • +1more
    zip
    Updated Nov 7, 2017
  2. Webis-TLDR-17

    • webis.de
    Updated 2017
  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Syed, Shahbaz; Voelske, Michael; Potthast, Martin; Stein, Benno (2017). Webis-TLDR-17 Corpus [Dataset]. http://doi.org/10.5281/zenodo.1043504
Organization logo

Webis-TLDR-17 Corpus

9 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated Nov 7, 2017
Dataset provided by
Bauhaus-Universität Weimarhttp://www.uni-weimar.de/
Authors
Syed, Shahbaz; Voelske, Michael; Potthast, Martin; Stein, Benno
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:

  • author: string (nullable = true)
  • body: string (nullable = true)
  • normalizedBody: string (nullable = true)
  • content: string (nullable = true)
  • content_len: long (nullable = true)
  • summary: string (nullable = true)
  • summary_len: long (nullable = true)
  • id: string (nullable = true)
  • subreddit: string (nullable = true)
  • subreddit_id: string (nullable = true)
  • title: string (nullable = true)

Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.

Search
Clear search
Close search
Google apps
Main menu