100+ datasets found

Arxiv Summary Dataset
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syndrigasti (2023). Arxiv Summary Dataset [Dataset]. https://www.kaggle.com/datasets/syndri224/arxiv-summary-dataset
Explore at:
zip(2242895176 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
Syndrigasti
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization
Data from: Legal Case Document Summarization: Extractive and Abstractive...
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh (2022). Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation [Dataset]. http://doi.org/10.5281/zenodo.7151679
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7151679
Dataset updated
Nov 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the following 3 datasets for legal document summarization :

- IN-Abs : Indian Supreme Court case documents & their `abstractive' summaries, obtained from http://www.liiofindia.org/in/cases/cen/INSC/
- IN-Ext : Indian Supreme Court case documents & their `extractive' summaries, written by two law experts (A1, A2).
- UK-Abs : United Kingdom (U.K.) Supreme Court case documents & their `abstractive' summaries, obtained from https://www.supremecourt.uk/decided-cases/

Please refer to the paper and the README file for more details.
h
pubmed-summarization
huggingface.co
opendatalab.com
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2021
Authors
ccdv
Description
PubMed dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
CCDV Arxiv Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
Explore at:
zip(2219742528 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

By ccdv (From Huggingface) [source]

About this dataset

The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

How to use the dataset

Introduction:

File Description:

validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

Usage Examples: This dataset can be utilized in various ways:

a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

Research Ideas

Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.

Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.

Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
h
long-context-text-summarization-alpaca-format
huggingface.co
Updated Nov 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antash Mishra (2024). long-context-text-summarization-alpaca-format [Dataset]. https://huggingface.co/datasets/antash420/long-context-text-summarization-alpaca-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 2, 2024
Authors
Antash Mishra
Description
antash420/long-context-text-summarization-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Vietnamese Multi Document Summarization Dataset
kaggle.com
zip
Updated Oct 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vũ trần anh (2023). Vietnamese Multi Document Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/vtrnanh/sust-feature-data-new
Explore at:
zip(8072886 bytes)Available download formats
Dataset updated
Oct 1, 2023
Authors
vũ trần anh
Description
Vietnamese Multiple Document Summarization Dataset

ViMs Dataset (ViMs folder)

This work was supported by the Ho Chi Minh City Department of Science and Technology, Grant Numbers 15/2016/HÐ-SKHCN

Data construction process: In this work, we aim to have 300 clusters of documents extracted from news. To this end, we made use of the Vietnamese language version of Google News. Due to the copyright issue, we did not collect articles from every source listed on Google News, but limited to some sources that are open for research purposes. The collected articles belong to five genres: world news, domestic news, business, entertainment, and sports. Every cluster contains from four to ten news articles. Each article is represented by the following information: the title, the plain text content, the news source, the date of publication, the author(s), the tag(s) and the headline summary.

After that, two summaries are created for each cluster (produced in the first subtask above) by two distinguished annotators using the MDSWriter system (Meyer, Christian M., et al. "MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora." Proceedings of ACL-2016 System Demonstrations). These annotators are Vietnamese native speakers and they are undergraduate students or graduate students. Most of them know about natural language processing. The full annotation process consists of seven steps that must be done sequentially from the first to the seventh one.

Data information: Original folder: Containing 300 subdirectories which are 300 news clusters. Articles (documents) in each cluster belong to a similar topic and there are from four to ten of them. The number of articles is 1,945.

Summary folder: Contains 300 subdirectories which are 600 final summaries. Every input cluster has two manual abstract summaries from two distinguished annotators. ViMs can be used for both implementing and evaluating supervised machine learning-based systems for Vietnamese abstractive multi-document summarization.

S3_summary folder: Contains 300 subdirectories including 600 ''best sentence selection'' summaries, the result of step 3 -- best sentence selection step. Sentences in a group are separated from others by a blank line. The most important sentence is marked as 1 while 0 is the label for others.

@article{tran2020vims, title={ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization}, author={Tran, Nhi-Thao and Nghiem, Minh-Quoc and Nguyen, Nhung TH and Nguyen, Ngan Luu-Thuy and Van Chi, Nam and Dinh, Dien}, journal={Language Resources and Evaluation}, volume={54}, number={4}, pages={893--920}, year={2020}, publisher={Springer} }

Vietnamese MDS (clusters folder)

Author: Tran Mai Vu, Vu Trong Hoa, Phi Van Thuy, Le Duc Trong, Ha Quang Thuy Affiliation: Knowledge Technology Laboratory, University of Technology, VNU Hanoi Research Topic: Design and Implementation of a Multi-document Summarization Program for the Vietnamese Language, Funded by the Ministry of Education (Project Code: B2012-01-24)

Data construction process: The data construction process is entirely manual. It consists of two steps:

Step 1: Data Collection and Clustering - Data is collected from the Baomoi website and organized into clusters, where each cluster contains documents related to a specific topic. The data is collected from various subjects on Baomoi, typically encompassing 8-10 main categories, such as World, Society, Culture, Economics, Science-Technology, Sports, Entertainment, Law, Education, Health, Automobiles, and Real Estate.

Step 2: Summarization - Two individuals participate in creating reference summaries. The summarization process includes two stages: extracting important sentences and rewriting them into a coherent paragraph.

Data information: Data Volume: 200 clusters

Each cluster corresponds to a folder, and it typically contains 2-5 documents (often 3). The folder's name represents the cluster.

Within each folder:

.info: Contains cluster ID and cluster label (labels are assigned by the cluster creator based on content).

.ref1.txt: Reference summary created by summarizer 1.

.ref1.tok.txt: Tokenized version of reference summary 1, with sentences and words separated.

.ref2.txt: Reference summary created by summarizer 2.

.ref2.tok.txt: Tokenized version of reference summary 2, with sentences and words separated.

.sum.txt: Machine-generated summary.

.sum.tok.txt: Tokenized version of the machine-generated summary, with sentences and words separated.

All files within the same folder represent documents (online articles) belonging to the cluster:

.body.txt: Contains the main content of the document.

.body.tok.txt: Contains the document's content with sentences and words separated.

.info.txt: Contains other in...
h
multi_document_summarization
huggingface.co
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arka Das (2024). multi_document_summarization [Dataset]. https://huggingface.co/datasets/arka0821/multi_document_summarization
Explore at:
Dataset updated
Feb 8, 2024
Authors
Arka Das
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
BUSUM-BNLP Dataset (Multi-Document Bangla Summary)
kaggle.com
zip
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marwa_Nurtaj (2023). BUSUM-BNLP Dataset (Multi-Document Bangla Summary) [Dataset]. https://www.kaggle.com/datasets/marwanurtaj/busum-bnlp-dataset-multi-document-bangla-summary
Explore at:
zip(1048000 bytes)Available download formats
Dataset updated
Oct 11, 2023
Authors
Marwa_Nurtaj
Area covered
Büsum
Description
=====================================================================================

BUSUM-BNLP DATASET

=====================================================================================

A Public Dataset for Multi-Document Update Summarization Task: To Improve the artificial intelligence-centered Information Retrieval Mission

=====================================================================================

This is the README file for the BUSUM-BNLP dataset. Good performance in NLP projects depends on high-quality datasets. For the multi-document update summarization task, we have researched some NLP datasets and also tried to generate a new one in Bangla. After researching various literature, we have found many English datasets like DUC2002, DUC2007, Daily-Mail, TAC dataset, and some Bangla datasets like Bangla Summarization Dataset (Prothom Alo), Bangla-Extra Sum Dataset, and BNLPC Dataset. In several papers, DUC2002 and DUC2007 were used to generate update summaries, while Daily Mail was tested for extractive and abstractive summaries. Though in Bengali summarization, there are not so many works done yet.

=====================================================================================

For our dataset, we have collected some old and new-dated articles of the same news from the online websites of Prothom Alo, Kaler Kontho, BBC News, Jugantor, etc. Both old and new news can contain slimier information. We have sorted out the old and new news and developed our latest dataset for applying our multi-document update summarization task.

=====================================================================================

Our multi-document Bangla dataset can be used for keyword detection in multiple files related to a specific topic. You can develop various summarization models, including updated summarizers and generic summarizers, using machine learning and deep learning techniques. This dataset can serve as a valuable resource for inspiring and generating new datasets by scraping news articles and other sources. It can also aid in developing domain-specific title detection models and extracting relevant features to improve accuracy.

=====================================================================================

One can follow the given methods for pre-processing the data: (POS) Tagging: This method will group or organise text phrases corresponding to language types such as nouns, verbs, adverbs, adjectives, etc. Cleaning Stop Words: It will eliminate common words from a document that give no useful information. Some of the stop words are like they, there, this, were, etc. Discard words or numerals containing digits: This sort of term, such as wordscapes59 or game5ts7, is tough to handle, so it's best to eliminate them or change them with an empty set and use regular expressions instead. Erase extra white spaces: In the pre-processing, the regular expression library works well to remove extra space, which is unnecessary. Cut-off Punctuations: There are 32 major punctuations in all that need to be considered. The string module and a regular expression can be used to replace any punctuation in the text with an empty string. Converting into the Same Case: If the text is in the same case throughout, a computer can easily understand the words since machines perceive lowercase and uppercase letters differently. Recognition of Named Entities: Keywords in the response text should be identified as item labels (i.e., individual, place, company title, etc.). Lemmatization or stemming: Reduction of the words to their base (like run is the ground word of runs, running, runed) form will be performed by lemmatization (which matches words with a linguistic dictionary) and stemming (which removes the suffix and the prefix from the word). Enlarge Contractions: Words like don't, which signifies "do not," and are not, which means "are not," are examples of words that have been contracted. It will be easier to carry out sentence processing duties if the contraction is expanded. Tokenization: The tokenization technique breaks text flows into tokens, which can be words, phrases, symbols, or other significant pieces of information.

====================================================================================

During our project, we encountered a few limitations that affected our data collection and modeling efforts. Firstly, to collect news on a particular topic, we had to put in a considerable amount of effort, as newspapers do not publish news in a serial manner every day. This meant that we had to read through entire texts to select relevant news articles for our dataset. Additionally, generating human-generated summaries was not an easy task, as we had to read lengthy documents and condense them into shorter summaries. However, due to time constraints and difficulties in r...
E
Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0
live.european-language-grid.eu
binary format
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Summarization datasets from the KAS corpus KAS-Sum 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20154
Explore at:
binary formatAvailable download formats
Dataset updated
Feb 3, 2022
License
https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
Description
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.

The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.

The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.

References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
h
govreport-summarization
huggingface.co
opendatalab.com
Updated Aug 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2022). govreport-summarization [Dataset]. https://huggingface.co/datasets/ccdv/govreport-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2022
Authors
ccdv
Description
GovReport dataset for summarization

Dataset for summarization of long documents.Adapted from this repo and this paperThis dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/govreport-summarization": ("report", "summary")

Data Fields

id: paper id report: a string containing the body of the reportsummary: a string containing the summary of the report

Data Splits… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/govreport-summarization.
D
Document Summarization AI Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Document Summarization AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/document-summarization-ai-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Sep 30, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Document Summarization AI Market Outlook

According to our latest research, the global Document Summarization AI market size reached USD 1.54 billion in 2024, reflecting robust adoption across industries. The market is projected to expand at a CAGR of 23.7% from 2025 to 2033, driven by increasing demand for automated content processing and smarter information retrieval. By 2033, the Document Summarization AI market is forecasted to achieve a value of USD 12.38 billion, underlining its transformative impact on enterprise operations and knowledge management. This rapid growth is primarily fueled by the proliferation of unstructured data, the need for efficient decision-making, and advancements in natural language processing (NLP) technologies.

One of the primary growth factors for the Document Summarization AI market is the exponential surge in digital content generation across industries. Enterprises, government agencies, and academic institutions are inundated with vast volumes of unstructured data, including emails, reports, legal documents, and research papers. Manual processing of such data is time-consuming, error-prone, and often leads to information overload. The integration of AI-driven summarization tools enables organizations to extract key insights, reduce redundancy, and accelerate workflow automation. As a result, businesses are able to enhance productivity, improve compliance, and make data-driven decisions more efficiently. This growing need for automation and intelligent data curation is a critical driver propelling the adoption of Document Summarization AI solutions worldwide.

Another significant factor contributing to market growth is the advancement and democratization of natural language processing (NLP) and machine learning algorithms. Leading AI vendors are investing heavily in research and development to enhance the accuracy, context-awareness, and linguistic versatility of summarization models. With the evolution of transformer-based architectures and large language models, Document Summarization AI tools are now capable of handling complex, domain-specific content with greater precision. Moreover, the availability of cloud-based AI services has lowered the entry barriers for small and medium-sized enterprises (SMEs), enabling them to leverage sophisticated summarization capabilities without significant upfront investment. This technological progress, coupled with growing awareness about the benefits of AI-powered document management, is expected to sustain high growth momentum over the forecast period.

The Document Summarization AI market is also witnessing strong traction due to regulatory and compliance requirements, particularly in sectors like BFSI, healthcare, and legal. Stringent data governance frameworks and the need for timely, accurate reporting are prompting organizations to automate document review and summarization processes. Additionally, the rise of remote work and digital collaboration has intensified the demand for solutions that can streamline knowledge sharing and information dissemination across distributed teams. As organizations continue to embrace digital transformation, the strategic value of Document Summarization AI in enhancing operational agility, reducing manual workload, and mitigating compliance risks is becoming increasingly evident. This confluence of regulatory, technological, and business drivers is expected to shape the market landscape in the coming years.

From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest share in 2024. The region's leadership is attributed to the early adoption of AI technologies, a mature digital infrastructure, and a strong presence of key market players. However, Asia Pacific is emerging as the fastest-growing region, driven by rapid digitalization, expanding enterprise IT budgets, and government initiatives to foster AI innovation. Europe is also witnessing substantial growth, supported by robust regulatory frameworks and increasing investments in AI research. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with rising awareness and adoption among enterprises and public sector organizations. The global market is thus characterized by dynamic regional trends, with each geography presenting unique opportunities and challenges for stakeholders.

Component Analysis

The Document Summarization AI market is segmented by component into
O
GovReport
opendatalab.com
tensorflow.org
zip
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Illinois Urbana-Champaign (2022). GovReport [Dataset]. https://opendatalab.com/OpenDataLab/GovReport
Explore at:
zip(1192078548 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
University of Illinois Urbana-Champaign
University of Michigan
Description
GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by government research agencies including Congressional Research Service and U.S. Government Accountability Office. Compared with other long document summarization datasets, government report dataset has longer summaries and documents and requires reading in more context to cover salient words to be summarized.
s
Citation Trends for "A study of extractive summarization of long documents...
shibatadb.com
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2024). Citation Trends for "A study of extractive summarization of long documents incorporating local topic and hierarchical information" [Dataset]. https://www.shibatadb.com/article/BhTiuUDo
Explore at:
Dataset updated
May 2, 2024
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2024 - 2025
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "A study of extractive summarization of long documents incorporating local topic and hierarchical information".
E
Data from: Slovenian text summarization models
live.european-language-grid.eu
Updated Dec 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Slovenian text summarization models [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20871
Explore at:
Dataset updated
Dec 20, 2022
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:

Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.

Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.

The web service with a demo is available at https://slovenscina.eu/povzemanje.
News Article Summarization Dataset
kaggle.com
zip
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
🇹🇷 Şahide Şeker, MSc (2025). News Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/sahideseker/news-article-summarization-dataset/data
Explore at:
zip(758 bytes)Available download formats
Dataset updated
Apr 3, 2025
Authors
🇹🇷 Şahide Şeker, MSc
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🇺🇸 English:

This synthetic dataset is created for learning and testing abstractive text summarization models. Each row contains a news-style article and a short summary. The dataset is ideal for experimenting with HuggingFace models such as t5-base, facebook/bart-large-cnn, or google/pegasus-xsum.

🇹🇷 Türkçe:

Bu sentetik veri seti, haber metinlerinden özet üretmek isteyenler için tasarlanmıştır. Her satırda İngilizce uzun bir haber metni ve karşılık gelen kısa bir özet yer alır. T5, BART veya Pegasus modelleriyle uyumludur.
m
MyTextSum : Malay Text Summarization Dataset
data.mendeley.com
Updated Sep 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suraya Alias (2021). MyTextSum : Malay Text Summarization Dataset [Dataset]. http://doi.org/10.17632/r54zh37mc7.1
Explore at:
Unique identifier
https://doi.org/10.17632/r54zh37mc7.1
Dataset updated
Sep 14, 2021
Authors
Suraya Alias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset consists of 100 news articles covering Natural Disaster and Events domains in Malaysia. The news articles are formatted in XML following the DUC 2002 text summarization dataset preparation. The total of 300 human summaries from 3 domains experts are included for summary evaluation. Our MyTextSum model with the application of Pattern-Growth Sentence Compression technique has shown promising results of F-Measure score of 0.5752 agreements when evaluated against human summaries and perform better than the Baselines (uncompressed) model.

Citation Reference:

Alias, S., Sainin, M. S., & Mohammad, S. K. (2020). Model Peringkasan Teks Ekstraktif Dwibahasa menggunakan Fitur Kekangan Corak Tekstual (Bilingual Extractive Text Summarization Model using Textual Pattern Constraints). GEMA Online® Journal of Language Studies, 20(3).

Alias S., Mohammad S.K., Gan K.H., Ping T.T. (2018) MYTextSum: A Malay Text Summarizer Model Using a Constrained Pattern-Growth Sentence Compression Technique. In: Alfred R., Iida H., Ag. Ibrahim A., Lim Y. (eds) Computational Science and Technology. ICCST 2017. Lecture Notes in Electrical Engineering, vol 488. Springer, Singapore. https://doi.org/10.1007/978-981-10-8276-4_14
r
VideoXum
resodate.org
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jingyang Lin; Hang Hua; Ming Chen; Yikang Li; Jenhao Hsiao; Chiuman Ho; Jiebo Luo (2024). VideoXum [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdmlkZW94dW0=
Explore at:
Dataset updated
Dec 16, 2024
Dataset provided by
Leibniz Data Manager
Authors
Jingyang Lin; Hang Hua; Ming Chen; Yikang Li; Jenhao Hsiao; Chiuman Ho; Jiebo Luo
Description
VideoXum is a large-scale video summarization dataset that contains 14,001 long videos with corresponding human-annotated video and text summaries.
G
Document Summarization AI Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Sep 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Document Summarization AI Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/document-summarization-ai-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Sep 1, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Document Summarization AI Market Outlook

According to our latest research, the Document Summarization AI market size reached USD 1.42 billion globally in 2024, driven by the accelerating adoption of artificial intelligence across industries and the exponential growth in unstructured data. The market is projected to expand at a robust CAGR of 22.6% from 2025 to 2033, reaching approximately USD 10.86 billion by 2033. This remarkable growth is largely attributed to advancements in natural language processing, increasing demand for automation in document management, and the need for efficient information retrieval from massive data repositories.

The rapid digital transformation across sectors such as legal, healthcare, BFSI, and government has been a significant growth factor for the Document Summarization AI market. Organizations are dealing with an unprecedented volume of digital documents, emails, reports, and legal papers, making manual summarization inefficient and error-prone. AI-powered document summarization solutions are being rapidly adopted to automate this process, enabling faster decision-making, enhanced productivity, and substantial cost savings. The integration of advanced NLP techniques, including transformer-based models and deep learning, has further improved the accuracy and relevance of AI-generated summaries, making them invaluable for knowledge workers and executives who need to process large amounts of information quickly.

Another key driver for the Document Summarization AI market is the growing emphasis on compliance and risk management, especially in highly regulated industries like finance and healthcare. Automated summarization tools help organizations extract critical information from lengthy compliance documents, contracts, and medical records, ensuring that essential details are not overlooked. This capability is crucial for meeting regulatory requirements, avoiding legal pitfalls, and maintaining robust audit trails. Furthermore, the increasing use of AI-based summarization in customer service chatbots and virtual assistants is enhancing user experiences by providing concise, contextually relevant responses, thereby improving customer satisfaction and loyalty.

The proliferation of cloud computing and the availability of scalable AI platforms have also contributed significantly to market expansion. Cloud-based document summarization AI solutions offer businesses the flexibility to deploy and scale services according to their needs, reducing infrastructure costs and facilitating seamless integration with existing enterprise workflows. Additionally, the democratization of AI through APIs and low-code/no-code platforms is enabling small and medium enterprises (SMEs) to leverage advanced document summarization capabilities without the need for extensive technical expertise. This trend is expected to further boost market penetration across diverse industry verticals in the coming years.

From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest revenue share in 2024. The regionÂ’s leadership can be attributed to the strong presence of leading AI technology providers, high digital adoption rates, and significant investments in research and development. Europe follows closely, driven by stringent data privacy regulations and increasing demand for automation in public and private sectors. Meanwhile, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding enterprise IT infrastructures, and rising awareness of AI-driven document management solutions among businesses and government agencies.

In the realm of education, Lecture Summarization Tools are becoming increasingly vital as they offer a streamlined way for students and educators to process and retain vast amounts of information. These tools utilize advanced AI algorithms to distill lecture content into concise summaries, making it easier for students to review and comprehend complex subjects. By integrating lecture summarization capabilities, educational institutions can enhance learning outcomes, provide personalized study materials, and support diverse learning styles. As the demand for digital learning solutions grows, the role of Lecture Summarization Tools in education is set to expand, offering significant be
h
booksum
huggingface.co
Updated Dec 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Karim Foda (2021). booksum [Dataset]. https://huggingface.co/datasets/kmfoda/booksum
Explore at:
Dataset updated
Dec 24, 2021
Authors
Karim Foda
License
https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
Description
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

Introduction

The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.
h
high-quality-summary-v2
huggingface.co
Updated May 15, 1999
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng (1999). high-quality-summary-v2 [Dataset]. https://huggingface.co/datasets/agentlans/high-quality-summary-v2
Explore at:
Dataset updated
May 15, 1999
Authors
Alan Tseng
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
High Quality Long Text Summarization Dataset

Input texts from agentlans/high-quality-text-long sample_k10000 config Summaries generated by google/gemma-3-12b-it Summaries rewritten by agentlans/granite-3.3-2b-reviser

Facebook

Twitter

Click to copy link

Link copied

Cite

Syndrigasti (2023). Arxiv Summary Dataset [Dataset]. https://www.kaggle.com/datasets/syndri224/arxiv-summary-dataset

Arxiv Summary Dataset

Dataset for summarization of long documents.

Explore at:

16 scholarly articles cite this dataset (View in Google Scholar)

zip(2242895176 bytes)Available download formats

Dataset updated

Nov 26, 2023

Authors

Syndrigasti

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization

Clear search

Close search

Google apps

Main menu

Arxiv Summary Dataset

Data from: Legal Case Document Summarization: Extractive and Abstractive...

pubmed-summarization

CCDV Arxiv Summarization Dataset

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

long-context-text-summarization-alpaca-format

Vietnamese Multi Document Summarization Dataset

Vietnamese Multiple Document Summarization Dataset

ViMs Dataset (ViMs folder)

Vietnamese MDS (clusters folder)

multi_document_summarization

BUSUM-BNLP Dataset (Multi-Document Bangla Summary)

Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0

govreport-summarization

Document Summarization AI Market Research Report 2033

Document Summarization AI Market Outlook

Component Analysis

GovReport

Citation Trends for "A study of extractive summarization of long documents...

Data from: Slovenian text summarization models

News Article Summarization Dataset

MyTextSum : Malay Text Summarization Dataset

VideoXum

Document Summarization AI Market Research Report 2033

Document Summarization AI Market Outlook

booksum

high-quality-summary-v2

Arxiv Summary Dataset

Dataset for summarization of long documents.