Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.
Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summarize TL;DR Filtered Dataset
This is the version of the dataset used in https://arxiv.org/abs/2009.01325. If starting a new project we would recommend using https://huggingface.co/datasets/openai/summarize_from_feedback. For more information see https://github.com/openai/summarize-from-feedback and for the original TL;DR dataset see https://huggingface.co/datasets/webis/tldr-17.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.
Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.