94 datasets found
  1. h

    arxiv-summarization

    • huggingface.co
    Updated Dec 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2021
    Authors
    ccdv
    Description

    Arxiv dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.

  2. h

    pubmed-summarization

    • huggingface.co
    • opendatalab.com
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2021
    Authors
    ccdv
    Description

    PubMed dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.

  3. h

    ro-text-summarization

    • huggingface.co
    Updated Jun 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ReaderBench (2020). ro-text-summarization [Dataset]. https://huggingface.co/datasets/readerbench/ro-text-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2020
    Dataset authored and provided by
    ReaderBench
    Description

    readerbench/ro-text-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    mlsum

    • huggingface.co
    Updated Jan 1, 2001
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    reciTAL (2001). mlsum [Dataset]. https://huggingface.co/datasets/reciTAL/mlsum
    Explore at:
    Dataset updated
    Jan 1, 2001
    Dataset authored and provided by
    reciTAL
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

  5. h

    dialogsum

    • huggingface.co
    Updated Jun 29, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2022
    Authors
    Karthick Kaliannan Neelamohan
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for DIALOGSum Corpus

      Dataset Description
    
    
    
    
    
      Links
    

    Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

      Dataset Summary
    

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.

  6. h

    bart_cnndm

    • huggingface.co
    Updated Jul 8, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Yang (2023). bart_cnndm [Dataset]. https://huggingface.co/datasets/yuyang/bart_cnndm
    Explore at:
    Dataset updated
    Jul 8, 2023
    Authors
    Yu Yang
    Description

    CNN/DailyMail non-anonymized summarization dataset. There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary

  7. h

    multi_document_summarization

    • huggingface.co
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arka Das (2024). multi_document_summarization [Dataset]. https://huggingface.co/datasets/arka0821/multi_document_summarization
    Explore at:
    Dataset updated
    Feb 8, 2024
    Authors
    Arka Das
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.

  8. h

    cnn_dailymail

    • huggingface.co
    Updated Aug 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abigail See (2023). cnn_dailymail [Dataset]. https://huggingface.co/datasets/abisee/cnn_dailymail
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Authors
    Abigail See
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for CNN Dailymail Dataset

      Dataset Summary
    

    The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

      Supported Tasks and Leaderboards
    

    'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.

  9. h

    id_liputan6

    • huggingface.co
    Updated May 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fajri Koto (2024). id_liputan6 [Dataset]. https://huggingface.co/datasets/fajrikoto/id_liputan6
    Explore at:
    Dataset updated
    May 24, 2024
    Authors
    Fajri Koto
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from this http URL, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE it-self, as well as with extractive and abstractive summarization models.

  10. h

    bbc-news-summary

    • huggingface.co
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gopal Kalpande (2023). bbc-news-summary [Dataset]. https://huggingface.co/datasets/gopalkalpande/bbc-news-summary
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2023
    Authors
    Gopal Kalpande
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    About Dataset

      Context
    

    Text summarization is a way to condense the large amount of information into a concise form by the process of selection of important information and discarding unimportant and redundant information. With the amount of textual information present in the world wide web the area of text summarization is becoming very important. The extractive summarization is the one where the exact sentences present in the document are used as summaries. The extractive… See the full description on the dataset page: https://huggingface.co/datasets/gopalkalpande/bbc-news-summary.

  11. h

    pn_summary

    • huggingface.co
    Updated Jan 4, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hooshvare Research Lab (2021). pn_summary [Dataset]. https://huggingface.co/datasets/HooshvareLab/pn_summary
    Explore at:
    Dataset updated
    Jan 4, 2021
    Dataset authored and provided by
    Hooshvare Research Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A well-structured summarization dataset for the Persian language consists of 93,207 records. It is prepared for Abstractive/Extractive tasks (like cnn_dailymail for English). It can also be used in other scopes like Text Generation, Title Generation, and News Category Classification. It is imperative to consider that the newlines were replaced with the [n] symbol. Please interpret them into normal newlines (for ex. t.replace("[n]", " ")) and then use them for your purposes.

  12. h

    FactualConsistencyScoresTextSummarization

    • huggingface.co
    Updated Jun 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Chandler (2024). FactualConsistencyScoresTextSummarization [Dataset]. https://huggingface.co/datasets/achandlr/FactualConsistencyScoresTextSummarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2024
    Authors
    Alex Chandler
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    HuggingFace Dataset: FactualConsistencyScoresTextSummarization

      Description:
    

    This dataset aggregates model scores assessing factual consistency across multiple summarization datasets. It is designed to highlight the thresholding issue with current SOTS factual consistency models in evaluating the factuality of text summarizations.

      What is the "Thresholding Issue" with SOTA Factual Consistency Models ?
    

    Existing models for detecting factual errors in summaries… See the full description on the dataset page: https://huggingface.co/datasets/achandlr/FactualConsistencyScoresTextSummarization.

  13. h

    newsroom

    • huggingface.co
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cornell LIL Lab (2024). newsroom [Dataset]. https://huggingface.co/datasets/lil-lab/newsroom
    Explore at:
    Dataset updated
    May 28, 2024
    Dataset authored and provided by
    Cornell LIL Lab
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications.

    Dataset features includes: - text: Input news text. - summary: Summary for the news. And additional features: - title: news title. - url: url of the news. - date: date of the article. - density: extractive density. - coverage: extractive coverage. - compression: compression ratio. - density_bin: low, medium, high. - coverage_bin: extractive, abstractive. - compression_bin: low, medium, high.

    This dataset can be downloaded upon requests. Unzip all the contents "train.jsonl, dev.josnl, test.jsonl" to the tfds folder.

  14. billsum

    • huggingface.co
    • tensorflow.org
    • +2more
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FiscalNote (2024). billsum [Dataset]. https://huggingface.co/datasets/FiscalNote/billsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2024
    Dataset authored and provided by
    FiscalNotehttp://fiscalnote.com/
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Dataset Card for "billsum"

      Dataset Summary
    

    BillSum, summarization of US Congressional and California state bills. There are several features:

    text: bill text. summary: summary of the bills. title: title of the bills. features for us bills. ca bills does not have. text_len: number of chars in text. sum_len: number of chars in summary.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages
    

    More Information Needed

      Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/FiscalNote/billsum.
    
  15. h

    xsum

    • huggingface.co
    Updated Oct 23, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EdinburghNLP - Natural Language Processing Group at the University of Edinburgh (2015). xsum [Dataset]. https://huggingface.co/datasets/EdinburghNLP/xsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2015
    Dataset authored and provided by
    EdinburghNLP - Natural Language Processing Group at the University of Edinburgh
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Extreme Summarization (XSum) Dataset.

    There are three features: - document: Input news article. - summary: One sentence summary of the article. - id: BBC ID of the article.

  16. h

    hindi-article-summarization

    • huggingface.co
    Updated Oct 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ganesh Jagadeesan (2023). hindi-article-summarization [Dataset]. https://huggingface.co/datasets/ganeshjcs/hindi-article-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 15, 2023
    Authors
    Ganesh Jagadeesan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Summary

    hindi-article-summarization is an open source dataset of instruct-style records generated from the Hindi Text Short and Large Summarization dataset. This was created as part of Aya Open Science Initiative from Cohere For AI. This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY-SA 4.0 License. Supported Tasks:

    Training LLMs Synthetic Data Generation Data Augmentation

    Languages: Hindi Version: 1.0

      Dataset Overview… See the full description on the dataset page: https://huggingface.co/datasets/ganeshjcs/hindi-article-summarization.
    
  17. h

    booksum

    • huggingface.co
    Updated Dec 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karim Foda (2021). booksum [Dataset]. https://huggingface.co/datasets/kmfoda/booksum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 24, 2021
    Authors
    Karim Foda
    License

    https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/

    Description

    BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

    Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

      Introduction
    

    The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.

  18. h

    tldr-17

    • huggingface.co
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Webis Group (2023). tldr-17 [Dataset]. https://huggingface.co/datasets/webis/tldr-17
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 5, 2023
    Dataset authored and provided by
    Webis Group
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

    Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.

  19. h

    multi_xscience

    • huggingface.co
    Updated Feb 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). multi_xscience [Dataset]. https://huggingface.co/datasets/bigbio/multi_xscience
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 14, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results---using several state-of-the-art models trained on the Multi-XScience dataset---reveal t hat Multi-XScience is well suited for abstractive models.

  20. h

    Autochart

    • huggingface.co
    Updated Oct 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saad Obaid ul Islam (2024). Autochart [Dataset]. https://huggingface.co/datasets/saadob12/Autochart
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 28, 2024
    Authors
    Saad Obaid ul Islam
    Description

    Tackling Hallucinations in Neural Chart Summarization

      Introduction
    

    The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.

      Abstract
    

    Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/Autochart.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization

arxiv-summarization

ccdv/arxiv-summarization

Explore at:
51 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2021
Authors
ccdv
Description

Arxiv dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

  Data Fields

id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.

Search
Clear search
Close search
Google apps
Main menu