94 datasets found

h
arxiv-summarization
huggingface.co
Updated Dec 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2021
Authors
ccdv
Description
Arxiv dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.
h
pubmed-summarization
huggingface.co
opendatalab.com
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2021
Authors
ccdv
Description
PubMed dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
h
ro-text-summarization
huggingface.co
Updated Jun 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ReaderBench (2020). ro-text-summarization [Dataset]. https://huggingface.co/datasets/readerbench/ro-text-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2020
Dataset authored and provided by
ReaderBench
Description
readerbench/ro-text-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community
h
mlsum
huggingface.co
Updated Jan 1, 2001
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
reciTAL (2001). mlsum [Dataset]. https://huggingface.co/datasets/reciTAL/mlsum
Explore at:
Dataset updated
Jan 1, 2001
Dataset authored and provided by
reciTAL
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.
h
dialogsum
huggingface.co
Updated Jun 29, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karthick Kaliannan Neelamohan (2022). dialogsum [Dataset]. https://huggingface.co/datasets/knkarthick/dialogsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2022
Authors
Karthick Kaliannan Neelamohan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for DIALOGSum Corpus

Dataset Description Links

Homepage: https://aclanthology.org/2021.findings-acl.449 Repository: https://github.com/cylnlp/dialogsum Paper: https://aclanthology.org/2021.findings-acl.449 Point of Contact: https://huggingface.co/knkarthick

Dataset Summary

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 (Plus 100 holdout data for topic generation) dialogues with corresponding… See the full description on the dataset page: https://huggingface.co/datasets/knkarthick/dialogsum.
h
bart_cnndm
huggingface.co
Updated Jul 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Yang (2023). bart_cnndm [Dataset]. https://huggingface.co/datasets/yuyang/bart_cnndm
Explore at:
Dataset updated
Jul 8, 2023
Authors
Yu Yang
Description
CNN/DailyMail non-anonymized summarization dataset. There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
h
multi_document_summarization
huggingface.co
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arka Das (2024). multi_document_summarization [Dataset]. https://huggingface.co/datasets/arka0821/multi_document_summarization
Explore at:
Dataset updated
Feb 8, 2024
Authors
Arka Das
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
h
cnn_dailymail
huggingface.co
Updated Aug 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abigail See (2023). cnn_dailymail [Dataset]. https://huggingface.co/datasets/abisee/cnn_dailymail
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2023
Authors
Abigail See
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for CNN Dailymail Dataset

Dataset Summary

The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

Supported Tasks and Leaderboards

'summarization': Versions… See the full description on the dataset page: https://huggingface.co/datasets/abisee/cnn_dailymail.
h
id_liputan6
huggingface.co
Updated May 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fajri Koto (2024). id_liputan6 [Dataset]. https://huggingface.co/datasets/fajrikoto/id_liputan6
Explore at:
Dataset updated
May 24, 2024
Authors
Fajri Koto
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from this http URL, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE it-self, as well as with extractive and abstractive summarization models.
h
bbc-news-summary
huggingface.co
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gopal Kalpande (2023). bbc-news-summary [Dataset]. https://huggingface.co/datasets/gopalkalpande/bbc-news-summary
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2023
Authors
Gopal Kalpande
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
About Dataset

Context

Text summarization is a way to condense the large amount of information into a concise form by the process of selection of important information and discarding unimportant and redundant information. With the amount of textual information present in the world wide web the area of text summarization is becoming very important. The extractive summarization is the one where the exact sentences present in the document are used as summaries. The extractive… See the full description on the dataset page: https://huggingface.co/datasets/gopalkalpande/bbc-news-summary.
h
pn_summary
huggingface.co
Updated Jan 4, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hooshvare Research Lab (2021). pn_summary [Dataset]. https://huggingface.co/datasets/HooshvareLab/pn_summary
Explore at:
Dataset updated
Jan 4, 2021
Dataset authored and provided by
Hooshvare Research Lab
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A well-structured summarization dataset for the Persian language consists of 93,207 records. It is prepared for Abstractive/Extractive tasks (like cnn_dailymail for English). It can also be used in other scopes like Text Generation, Title Generation, and News Category Classification. It is imperative to consider that the newlines were replaced with the [n] symbol. Please interpret them into normal newlines (for ex. t.replace("[n]", " ")) and then use them for your purposes.
h
FactualConsistencyScoresTextSummarization
huggingface.co
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Chandler (2024). FactualConsistencyScoresTextSummarization [Dataset]. https://huggingface.co/datasets/achandlr/FactualConsistencyScoresTextSummarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2024
Authors
Alex Chandler
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
HuggingFace Dataset: FactualConsistencyScoresTextSummarization

Description:

This dataset aggregates model scores assessing factual consistency across multiple summarization datasets. It is designed to highlight the thresholding issue with current SOTS factual consistency models in evaluating the factuality of text summarizations.

What is the "Thresholding Issue" with SOTA Factual Consistency Models ?

Existing models for detecting factual errors in summaries… See the full description on the dataset page: https://huggingface.co/datasets/achandlr/FactualConsistencyScoresTextSummarization.
h
newsroom
huggingface.co
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cornell LIL Lab (2024). newsroom [Dataset]. https://huggingface.co/datasets/lil-lab/newsroom
Explore at:
Dataset updated
May 28, 2024
Dataset authored and provided by
Cornell LIL Lab
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications.

Dataset features includes: - text: Input news text. - summary: Summary for the news. And additional features: - title: news title. - url: url of the news. - date: date of the article. - density: extractive density. - coverage: extractive coverage. - compression: compression ratio. - density_bin: low, medium, high. - coverage_bin: extractive, abstractive. - compression_bin: low, medium, high.

This dataset can be downloaded upon requests. Unzip all the contents "train.jsonl, dev.josnl, test.jsonl" to the tfds folder.
billsum
huggingface.co
tensorflow.org
+2more
Updated Jun 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FiscalNote (2024). billsum [Dataset]. https://huggingface.co/datasets/FiscalNote/billsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 3, 2024
Dataset authored and provided by
FiscalNotehttp://fiscalnote.com/
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Dataset Card for "billsum"

Dataset Summary

BillSum, summarization of US Congressional and California state bills. There are several features:

text: bill text. summary: summary of the bills. title: title of the bills. features for us bills. ca bills does not have. text_len: number of chars in text. sum_len: number of chars in summary.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure… See the full description on the dataset page: https://huggingface.co/datasets/FiscalNote/billsum.
h
xsum
huggingface.co
Updated Oct 23, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EdinburghNLP - Natural Language Processing Group at the University of Edinburgh (2015). xsum [Dataset]. https://huggingface.co/datasets/EdinburghNLP/xsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2015
Dataset authored and provided by
EdinburghNLP - Natural Language Processing Group at the University of Edinburgh
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Extreme Summarization (XSum) Dataset.

There are three features: - document: Input news article. - summary: One sentence summary of the article. - id: BBC ID of the article.
h
hindi-article-summarization
huggingface.co
Updated Oct 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ganesh Jagadeesan (2023). hindi-article-summarization [Dataset]. https://huggingface.co/datasets/ganeshjcs/hindi-article-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 15, 2023
Authors
Ganesh Jagadeesan
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Summary

hindi-article-summarization is an open source dataset of instruct-style records generated from the Hindi Text Short and Large Summarization dataset. This was created as part of Aya Open Science Initiative from Cohere For AI. This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY-SA 4.0 License. Supported Tasks:

Training LLMs Synthetic Data Generation Data Augmentation

Languages: Hindi Version: 1.0

Dataset Overview… See the full description on the dataset page: https://huggingface.co/datasets/ganeshjcs/hindi-article-summarization.
h
booksum
huggingface.co
Updated Dec 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karim Foda (2021). booksum [Dataset]. https://huggingface.co/datasets/kmfoda/booksum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 24, 2021
Authors
Karim Foda
License
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
Description
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

Introduction

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.
h
tldr-17
huggingface.co
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Webis Group (2023). tldr-17 [Dataset]. https://huggingface.co/datasets/webis/tldr-17
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 5, 2023
Dataset authored and provided by
Webis Group
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This corpus contains preprocessed posts from the Reddit dataset. The dataset consists of 3,848,330 posts with an average length of 270 words for content, and 28 words for the summary.

Features includes strings: author, body, normalizedBody, content, summary, subreddit, subreddit_id. Content is used as document and summary is used as summary.
h
multi_xscience
huggingface.co
Updated Feb 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). multi_xscience [Dataset]. https://huggingface.co/datasets/bigbio/multi_xscience
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 14, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results---using several state-of-the-art models trained on the Multi-XScience dataset---reveal t hat Multi-XScience is well suited for abstractive models.
h
Autochart
huggingface.co
Updated Oct 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saad Obaid ul Islam (2024). Autochart [Dataset]. https://huggingface.co/datasets/saadob12/Autochart
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 28, 2024
Authors
Saad Obaid ul Islam
Description
Tackling Hallucinations in Neural Chart Summarization

Introduction

The trained model for investigations and state-of-the-art (SOTA) improvements are detailed in the paper: Tackling Hallucinations in Neural Chart Summarization. This repo contains optimized input prompts and summaries after NLI-filtering.

Abstract

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we address the… See the full description on the dataset page: https://huggingface.co/datasets/saadob12/Autochart.

Facebook

Twitter

Click to copy link

Link copied

Cite

ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization

arxiv-summarization

ccdv/arxiv-summarization

Explore at:

51 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 19, 2021

Authors

ccdv

Description

Arxiv dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

  Data Fields

id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.

Clear search

Close search

Google apps

Main menu

arxiv-summarization

pubmed-summarization

ro-text-summarization

mlsum

dialogsum

bart_cnndm

multi_document_summarization

cnn_dailymail

id_liputan6

bbc-news-summary

pn_summary

FactualConsistencyScoresTextSummarization

newsroom

billsum

xsum

hindi-article-summarization

booksum

tldr-17

multi_xscience

Autochart

arxiv-summarization

ccdv/arxiv-summarization