1 dataset found

P
Webis-TLDR-17 Corpus Dataset
paperswithcode.com
zenodo.org
Updated Aug 31, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein (2017). Webis-TLDR-17 Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/webis-tldr-17-corpus
Explore at:
Dataset updated
Aug 31, 2017
Authors
Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein
Description
This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below: - author: string (nullable = true) - body: string (nullable = true) - normalizedBody: string (nullable = true) - content: string (nullable = true) - content_len: long (nullable = true) - summary: string (nullable = true) - summary_len: long (nullable = true) - id: string (nullable = true) - subreddit: string (nullable = true) - subreddit_id: string (nullable = true) - title: string (nullable = true)

Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein (2017). Webis-TLDR-17 Corpus Dataset [Dataset]. https://paperswithcode.com/dataset/webis-tldr-17-corpus

Webis-TLDR-17 Corpus Dataset

Explore at:

Dataset updated

Aug 31, 2017

Authors

Michael V{\"o}lske; Martin Potthast; Shahbaz Syed; Benno Stein

Description

This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below: - author: string (nullable = true) - body: string (nullable = true) - normalizedBody: string (nullable = true) - content: string (nullable = true) - content_len: long (nullable = true) - summary: string (nullable = true) - summary_len: long (nullable = true) - id: string (nullable = true) - subreddit: string (nullable = true) - subreddit_id: string (nullable = true) - title: string (nullable = true)

Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.

Clear search

Close search

Google apps

Main menu