Arxiv dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.
PubMed dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
readerbench/ro-text-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for DIALOGSum Corpus
Dataset Description
Links
Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick
Dataset Summary
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
CNN/DailyMail non-anonymized summarization dataset. There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for CNN Dailymail Dataset
Dataset Summary
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.
Supported Tasks and Leaderboards
'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from this http URL, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE it-self, as well as with extractive and abstractive summarization models.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
About Dataset
Context
Text summarization is a way to condense the large amount of information into a concise form by the process of selection of important information and discarding unimportant and redundant information. With the amount of textual information present in the world wide web the area of text summarization is becoming very important. The extractive summarization is the one where the exact sentences present in the document are used as summaries. The extractive… See the full description on the dataset page: https://huggingface.co/datasets/gopalkalpande/bbc-news-summary.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A well-structured summarization dataset for the Persian language consists of 93,207 records. It is prepared for Abstractive/Extractive tasks (like cnn_dailymail for English). It can also be used in other scopes like Text Generation, Title Generation, and News Category Classification.
It is imperative to consider that the newlines were replaced with the [n]
symbol. Please interpret them into normal newlines (for ex. t.replace("[n]", "
")
) and then use them for your purposes.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
HuggingFace Dataset: FactualConsistencyScoresTextSummarization
Description:
This dataset aggregates model scores assessing factual consistency across multiple summarization datasets. It is designed to highlight the thresholding issue with current SOTS factual consistency models in evaluating the factuality of text summarizations.
What is the "Thresholding Issue" with SOTA Factual Consistency Models ?
Existing models for detecting factual errors in summaries… See the full description on the dataset page: https://huggingface.co/datasets/achandlr/FactualConsistencyScoresTextSummarization.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications.
Dataset features includes: - text: Input news text. - summary: Summary for the news. And additional features: - title: news title. - url: url of the news. - date: date of the article. - density: extractive density. - coverage: extractive coverage. - compression: compression ratio. - density_bin: low, medium, high. - coverage_bin: extractive, abstractive. - compression_bin: low, medium, high.
This dataset can be downloaded upon requests. Unzip all the contents "train.jsonl, dev.josnl, test.jsonl" to the tfds folder.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Dataset Card for "billsum"
Dataset Summary
BillSum, summarization of US Congressional and California state bills. There are several features:
text: bill text. summary: summary of the bills. title: title of the bills. features for us bills. ca bills does not have. text_len: number of chars in text. sum_len: number of chars in summary.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/FiscalNote/billsum.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Extreme Summarization (XSum) Dataset.
There are three features: - document: Input news article. - summary: One sentence summary of the article. - id: BBC ID of the article.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Summary
hindi-article-summarization is an open source dataset of instruct-style records generated from the Hindi Text Short and Large Summarization dataset. This was created as part of Aya Open Science Initiative from Cohere For AI. This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY-SA 4.0 License. Supported Tasks:
Training LLMs Synthetic Data Generation Data Augmentation
Languages: Hindi Version: 1.0
Dataset Overview… See the full description on the dataset page: https://huggingface.co/datasets/ganeshjcs/hindi-article-summarization.
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization
Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev
Introduction
The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.
Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results---using several state-of-the-art models trained on the Multi-XScience dataset---reveal t hat Multi-XScience is well suited for abstractive models.
Tackling Hallucinations in Neural Chart Summarization
Introduction
The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.
Abstract
Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/Autochart.
Arxiv dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.