Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }
Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the following 3 datasets for legal document summarization :
- IN-Abs : Indian Supreme Court case documents & their `abstractive' summaries, obtained from http://www.liiofindia.org/in/cases/cen/INSC/
- IN-Ext : Indian Supreme Court case documents & their `extractive' summaries, written by two law experts (A1, A2).
- UK-Abs : United Kingdom (U.K.) Supreme Court case documents & their `abstractive' summaries, obtained from https://www.supremecourt.uk/decided-cases/
Please refer to the paper and the README file for more details.
Facebook
TwitterPubMed dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.
The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.
Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.
With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.
Introduction:
File Description:
validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.
train.csv: The purpose of this file is to provide training data for summarizing scientific articles.
test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.
Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.
Usage Examples: This dataset can be utilized in various ways:
a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.
b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.
c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.
- Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.
Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions
- Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
- Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
- Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Facebook
Twitterantash420/long-context-text-summarization-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis work was supported by the Ho Chi Minh City Department of Science and Technology, Grant Numbers 15/2016/HÐ-SKHCN
Data construction process: In this work, we aim to have 300 clusters of documents extracted from news. To this end, we made use of the Vietnamese language version of Google News. Due to the copyright issue, we did not collect articles from every source listed on Google News, but limited to some sources that are open for research purposes. The collected articles belong to five genres: world news, domestic news, business, entertainment, and sports. Every cluster contains from four to ten news articles. Each article is represented by the following information: the title, the plain text content, the news source, the date of publication, the author(s), the tag(s) and the headline summary.
After that, two summaries are created for each cluster (produced in the first subtask above) by two distinguished annotators using the MDSWriter system (Meyer, Christian M., et al. "MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora." Proceedings of ACL-2016 System Demonstrations). These annotators are Vietnamese native speakers and they are undergraduate students or graduate students. Most of them know about natural language processing. The full annotation process consists of seven steps that must be done sequentially from the first to the seventh one.
Data information: Original folder: Containing 300 subdirectories which are 300 news clusters. Articles (documents) in each cluster belong to a similar topic and there are from four to ten of them. The number of articles is 1,945.
Summary folder: Contains 300 subdirectories which are 600 final summaries. Every input cluster has two manual abstract summaries from two distinguished annotators. ViMs can be used for both implementing and evaluating supervised machine learning-based systems for Vietnamese abstractive multi-document summarization.
S3_summary folder: Contains 300 subdirectories including 600 ''best sentence selection'' summaries, the result of step 3 -- best sentence selection step. Sentences in a group are separated from others by a blank line. The most important sentence is marked as 1 while 0 is the label for others.
@article{tran2020vims, title={ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization}, author={Tran, Nhi-Thao and Nghiem, Minh-Quoc and Nguyen, Nhung TH and Nguyen, Ngan Luu-Thuy and Van Chi, Nam and Dinh, Dien}, journal={Language Resources and Evaluation}, volume={54}, number={4}, pages={893--920}, year={2020}, publisher={Springer} }
Author: Tran Mai Vu, Vu Trong Hoa, Phi Van Thuy, Le Duc Trong, Ha Quang Thuy Affiliation: Knowledge Technology Laboratory, University of Technology, VNU Hanoi Research Topic: Design and Implementation of a Multi-document Summarization Program for the Vietnamese Language, Funded by the Ministry of Education (Project Code: B2012-01-24)
Data construction process: The data construction process is entirely manual. It consists of two steps:
Data information: Data Volume: 200 clusters
Each cluster corresponds to a folder, and it typically contains 2-5 documents (often 3). The folder's name represents the cluster.
Within each folder:
All files within the same folder represent documents (online articles) belonging to the cluster:
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
Facebook
Twitter=====================================================================================
BUSUM-BNLP DATASET
=====================================================================================
A Public Dataset for Multi-Document Update Summarization Task: To Improve the artificial intelligence-centered Information Retrieval Mission
=====================================================================================
This is the README file for the BUSUM-BNLP dataset. Good performance in NLP projects depends on high-quality datasets. For the multi-document update summarization task, we have researched some NLP datasets and also tried to generate a new one in Bangla. After researching various literature, we have found many English datasets like DUC2002, DUC2007, Daily-Mail, TAC dataset, and some Bangla datasets like Bangla Summarization Dataset (Prothom Alo), Bangla-Extra Sum Dataset, and BNLPC Dataset. In several papers, DUC2002 and DUC2007 were used to generate update summaries, while Daily Mail was tested for extractive and abstractive summaries. Though in Bengali summarization, there are not so many works done yet.
=====================================================================================
For our dataset, we have collected some old and new-dated articles of the same news from the online websites of Prothom Alo, Kaler Kontho, BBC News, Jugantor, etc. Both old and new news can contain slimier information. We have sorted out the old and new news and developed our latest dataset for applying our multi-document update summarization task.
=====================================================================================
Our multi-document Bangla dataset can be used for keyword detection in multiple files related to a specific topic. You can develop various summarization models, including updated summarizers and generic summarizers, using machine learning and deep learning techniques. This dataset can serve as a valuable resource for inspiring and generating new datasets by scraping news articles and other sources. It can also aid in developing domain-specific title detection models and extracting relevant features to improve accuracy.
=====================================================================================
One can follow the given methods for pre-processing the data: (POS) Tagging: This method will group or organise text phrases corresponding to language types such as nouns, verbs, adverbs, adjectives, etc. Cleaning Stop Words: It will eliminate common words from a document that give no useful information. Some of the stop words are like they, there, this, were, etc. Discard words or numerals containing digits: This sort of term, such as wordscapes59 or game5ts7, is tough to handle, so it's best to eliminate them or change them with an empty set and use regular expressions instead. Erase extra white spaces: In the pre-processing, the regular expression library works well to remove extra space, which is unnecessary. Cut-off Punctuations: There are 32 major punctuations in all that need to be considered. The string module and a regular expression can be used to replace any punctuation in the text with an empty string. Converting into the Same Case: If the text is in the same case throughout, a computer can easily understand the words since machines perceive lowercase and uppercase letters differently. Recognition of Named Entities: Keywords in the response text should be identified as item labels (i.e., individual, place, company title, etc.). Lemmatization or stemming: Reduction of the words to their base (like run is the ground word of runs, running, runed) form will be performed by lemmatization (which matches words with a linguistic dictionary) and stemming (which removes the suffix and the prefix from the word). Enlarge Contractions: Words like don't, which signifies "do not," and are not, which means "are not," are examples of words that have been contracted. It will be easier to carry out sentence processing duties if the contraction is expanded. Tokenization: The tokenization technique breaks text flows into tokens, which can be words, phrases, symbols, or other significant pieces of information.
====================================================================================
During our project, we encountered a few limitations that affected our data collection and modeling efforts. Firstly, to collect news on a particular topic, we had to put in a considerable amount of effort, as newspapers do not publish news in a serial manner every day. This meant that we had to read through entire texts to select relevant news articles for our dataset. Additionally, generating human-generated summaries was not an easy task, as we had to read lengthy documents and condense them into shorter summaries. However, due to time constraints and difficulties in r...
Facebook
Twitterhttps://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.
The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.
The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.
References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
Facebook
TwitterGovReport dataset for summarization
Dataset for summarization of long documents.Adapted from this repo and this paperThis dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/govreport-summarization": ("report", "summary")
Data Fields
id: paper id report: a string containing the body of the reportsummary: a string containing the summary of the report
Data Splits… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/govreport-summarization.
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Document Summarization AI market size reached USD 1.54 billion in 2024, reflecting robust adoption across industries. The market is projected to expand at a CAGR of 23.7% from 2025 to 2033, driven by increasing demand for automated content processing and smarter information retrieval. By 2033, the Document Summarization AI market is forecasted to achieve a value of USD 12.38 billion, underlining its transformative impact on enterprise operations and knowledge management. This rapid growth is primarily fueled by the proliferation of unstructured data, the need for efficient decision-making, and advancements in natural language processing (NLP) technologies.
One of the primary growth factors for the Document Summarization AI market is the exponential surge in digital content generation across industries. Enterprises, government agencies, and academic institutions are inundated with vast volumes of unstructured data, including emails, reports, legal documents, and research papers. Manual processing of such data is time-consuming, error-prone, and often leads to information overload. The integration of AI-driven summarization tools enables organizations to extract key insights, reduce redundancy, and accelerate workflow automation. As a result, businesses are able to enhance productivity, improve compliance, and make data-driven decisions more efficiently. This growing need for automation and intelligent data curation is a critical driver propelling the adoption of Document Summarization AI solutions worldwide.
Another significant factor contributing to market growth is the advancement and democratization of natural language processing (NLP) and machine learning algorithms. Leading AI vendors are investing heavily in research and development to enhance the accuracy, context-awareness, and linguistic versatility of summarization models. With the evolution of transformer-based architectures and large language models, Document Summarization AI tools are now capable of handling complex, domain-specific content with greater precision. Moreover, the availability of cloud-based AI services has lowered the entry barriers for small and medium-sized enterprises (SMEs), enabling them to leverage sophisticated summarization capabilities without significant upfront investment. This technological progress, coupled with growing awareness about the benefits of AI-powered document management, is expected to sustain high growth momentum over the forecast period.
The Document Summarization AI market is also witnessing strong traction due to regulatory and compliance requirements, particularly in sectors like BFSI, healthcare, and legal. Stringent data governance frameworks and the need for timely, accurate reporting are prompting organizations to automate document review and summarization processes. Additionally, the rise of remote work and digital collaboration has intensified the demand for solutions that can streamline knowledge sharing and information dissemination across distributed teams. As organizations continue to embrace digital transformation, the strategic value of Document Summarization AI in enhancing operational agility, reducing manual workload, and mitigating compliance risks is becoming increasingly evident. This confluence of regulatory, technological, and business drivers is expected to shape the market landscape in the coming years.
From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest share in 2024. The region's leadership is attributed to the early adoption of AI technologies, a mature digital infrastructure, and a strong presence of key market players. However, Asia Pacific is emerging as the fastest-growing region, driven by rapid digitalization, expanding enterprise IT budgets, and government initiatives to foster AI innovation. Europe is also witnessing substantial growth, supported by robust regulatory frameworks and increasing investments in AI research. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with rising awareness and adoption among enterprises and public sector organizations. The global market is thus characterized by dynamic regional trends, with each geography presenting unique opportunities and challenges for stakeholders.
The Document Summarization AI market is segmented by component into
Facebook
TwitterGovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by government research agencies including Congressional Research Service and U.S. Government Accountability Office. Compared with other long document summarization datasets, government report dataset has longer summaries and documents and requires reading in more context to cover salient words to be summarized.
Facebook
Twitterhttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Yearly citation counts for the publication titled "A study of extractive summarization of long documents incorporating local topic and hierarchical information".
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:
Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.
Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.
The web service with a demo is available at https://slovenscina.eu/povzemanje.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🇺🇸 English:
This synthetic dataset is created for learning and testing abstractive text summarization models. Each row contains a news-style article and a short summary. The dataset is ideal for experimenting with HuggingFace models such as t5-base, facebook/bart-large-cnn, or google/pegasus-xsum.
🇹🇷 Türkçe:
Bu sentetik veri seti, haber metinlerinden özet üretmek isteyenler için tasarlanmıştır. Her satırda İngilizce uzun bir haber metni ve karşılık gelen kısa bir özet yer alır. T5, BART veya Pegasus modelleriyle uyumludur.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consists of 100 news articles covering Natural Disaster and Events domains in Malaysia. The news articles are formatted in XML following the DUC 2002 text summarization dataset preparation. The total of 300 human summaries from 3 domains experts are included for summary evaluation. Our MyTextSum model with the application of Pattern-Growth Sentence Compression technique has shown promising results of F-Measure score of 0.5752 agreements when evaluated against human summaries and perform better than the Baselines (uncompressed) model.
Citation Reference:
Alias, S., Sainin, M. S., & Mohammad, S. K. (2020). Model Peringkasan Teks Ekstraktif Dwibahasa menggunakan Fitur Kekangan Corak Tekstual (Bilingual Extractive Text Summarization Model using Textual Pattern Constraints). GEMA Online® Journal of Language Studies, 20(3).
Alias S., Mohammad S.K., Gan K.H., Ping T.T. (2018) MYTextSum: A Malay Text Summarizer Model Using a Constrained Pattern-Growth Sentence Compression Technique. In: Alfred R., Iida H., Ag. Ibrahim A., Lim Y. (eds) Computational Science and Technology. ICCST 2017. Lecture Notes in Electrical Engineering, vol 488. Springer, Singapore. https://doi.org/10.1007/978-981-10-8276-4_14
Facebook
TwitterVideoXum is a large-scale video summarization dataset that contains 14,001 long videos with corresponding human-annotated video and text summaries.
Facebook
Twitter
According to our latest research, the Document Summarization AI market size reached USD 1.42 billion globally in 2024, driven by the accelerating adoption of artificial intelligence across industries and the exponential growth in unstructured data. The market is projected to expand at a robust CAGR of 22.6% from 2025 to 2033, reaching approximately USD 10.86 billion by 2033. This remarkable growth is largely attributed to advancements in natural language processing, increasing demand for automation in document management, and the need for efficient information retrieval from massive data repositories.
The rapid digital transformation across sectors such as legal, healthcare, BFSI, and government has been a significant growth factor for the Document Summarization AI market. Organizations are dealing with an unprecedented volume of digital documents, emails, reports, and legal papers, making manual summarization inefficient and error-prone. AI-powered document summarization solutions are being rapidly adopted to automate this process, enabling faster decision-making, enhanced productivity, and substantial cost savings. The integration of advanced NLP techniques, including transformer-based models and deep learning, has further improved the accuracy and relevance of AI-generated summaries, making them invaluable for knowledge workers and executives who need to process large amounts of information quickly.
Another key driver for the Document Summarization AI market is the growing emphasis on compliance and risk management, especially in highly regulated industries like finance and healthcare. Automated summarization tools help organizations extract critical information from lengthy compliance documents, contracts, and medical records, ensuring that essential details are not overlooked. This capability is crucial for meeting regulatory requirements, avoiding legal pitfalls, and maintaining robust audit trails. Furthermore, the increasing use of AI-based summarization in customer service chatbots and virtual assistants is enhancing user experiences by providing concise, contextually relevant responses, thereby improving customer satisfaction and loyalty.
The proliferation of cloud computing and the availability of scalable AI platforms have also contributed significantly to market expansion. Cloud-based document summarization AI solutions offer businesses the flexibility to deploy and scale services according to their needs, reducing infrastructure costs and facilitating seamless integration with existing enterprise workflows. Additionally, the democratization of AI through APIs and low-code/no-code platforms is enabling small and medium enterprises (SMEs) to leverage advanced document summarization capabilities without the need for extensive technical expertise. This trend is expected to further boost market penetration across diverse industry verticals in the coming years.
From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest revenue share in 2024. The regionÂ’s leadership can be attributed to the strong presence of leading AI technology providers, high digital adoption rates, and significant investments in research and development. Europe follows closely, driven by stringent data privacy regulations and increasing demand for automation in public and private sectors. Meanwhile, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding enterprise IT infrastructures, and rising awareness of AI-driven document management solutions among businesses and government agencies.
In the realm of education, Lecture Summarization Tools are becoming increasingly vital as they offer a streamlined way for students and educators to process and retain vast amounts of information. These tools utilize advanced AI algorithms to distill lecture content into concise summaries, making it easier for students to review and comprehend complex subjects. By integrating lecture summarization capabilities, educational institutions can enhance learning outcomes, provide personalized study materials, and support diverse learning styles. As the demand for digital learning solutions grows, the role of Lecture Summarization Tools in education is set to expand, offering significant be
Facebook
Twitterhttps://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization
Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev
Introduction
The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.
Facebook
Twitterhttps://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
High Quality Long Text Summarization Dataset
Input texts from agentlans/high-quality-text-long sample_k10000 config Summaries generated by google/gemma-3-12b-it Summaries rewritten by agentlans/granite-3.3-2b-reviser
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }
Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization