4 datasets found
  1. W

    Webis-TLDR-17

    • webis.de
    • anthology.aicmu.ac.cn
    1043504
    Updated 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein (2017). Webis-TLDR-17 [Dataset]. http://doi.org/10.5281/zenodo.1043504
    Explore at:
    1043504Available download formats
    Dataset updated
    2017
    Dataset provided by
    The Web Technology & Information Systems Network
    Bauhaus-Universität Weimar
    NEC Laboratories Europe
    Artefact Germany, Bauhaus-Universität Weimar
    University of Kassel, hessian.AI, and ScaDS.AI
    Authors
    Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.

  2. Webis-TLDR-17 Corpus

    • zenodo.org
    • paperswithcode.com
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein; Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein (2020). Webis-TLDR-17 Corpus [Dataset]. http://doi.org/10.5281/zenodo.1043504
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein; Shahbaz Syed; Michael Voelske; Martin Potthast; Benno Stein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus contains preprocessed posts from the Reddit dataset, suitable for abstractive summarization using deep learning. The format is a json file where each line is a JSON object representing a post. The schema of each post is shown below:

    • author: string (nullable = true)
    • body: string (nullable = true)
    • normalizedBody: string (nullable = true)
    • content: string (nullable = true)
    • content_len: long (nullable = true)
    • summary: string (nullable = true)
    • summary_len: long (nullable = true)
    • id: string (nullable = true)
    • subreddit: string (nullable = true)
    • subreddit_id: string (nullable = true)
    • title: string (nullable = true)

    Specifically, the content and summary fields can be directly used as inputs to a deep learning model (e.g. Sequence to Sequence model ). The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary. The dataset is a combination of both the Submissions and Comments merged on the common schema. As a result, most of the comments which do not belong to any submission have null as their title.

    Note : This corpus does not contain a separate test set. Thus it is up to the users to divide the corpus into appropriate training, validation and test sets.

  3. h

    openai-summarize-tldr

    • huggingface.co
    Updated Sep 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Martim Santos (2024). openai-summarize-tldr [Dataset]. https://huggingface.co/datasets/martimfasantos/openai-summarize-tldr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2024
    Authors
    Martim Santos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summarize TL;DR Filtered Dataset

    This is the version of the dataset used in https://arxiv.org/abs/2009.01325. If starting a new project we would recommend using https://huggingface.co/datasets/openai/summarize_from_feedback. For more information see https://github.com/openai/summarize-from-feedback and for the original TL;DR dataset see https://huggingface.co/datasets/webis/tldr-17.

  4. Reddit Relationships

    • kaggle.com
    Updated Jul 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan de Boer (2023). Reddit Relationships [Dataset]. https://www.kaggle.com/datasets/janldeboer/reddit-relationships
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 7, 2023
    Dataset provided by
    Kaggle
    Authors
    Jan de Boer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is derived from this reddit dataset

    We only keep the id and content of the posts in the subreddit "relationships" with length more than 50 characters. Out intend is to use for fine tuning text generation models.

  5. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein (2017). Webis-TLDR-17 [Dataset]. http://doi.org/10.5281/zenodo.1043504

Webis-TLDR-17

Explore at:
48 scholarly articles cite this dataset (View in Google Scholar)
1043504Available download formats
Dataset updated
2017
Dataset provided by
The Web Technology & Information Systems Network
Bauhaus-Universität Weimar
NEC Laboratories Europe
Artefact Germany, Bauhaus-Universität Weimar
University of Kassel, hessian.AI, and ScaDS.AI
Authors
Shahbaz Syed; Michael Völske; Martin Potthast; Benno Stein
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The Webis TLDR Corpus (2017) consists of approximately 4 Million content-summary pairs extracted for Abstractive Summarization, from the Reddit dataset for the years 2006-2016. This corpus is first of its kind from the social media domain in English and has been created to compensate the lack of variety in the datasets used for abstractive summarization research using deep learning models.

Search
Clear search
Close search
Google apps
Main menu