100+ datasets found
  1. Arxiv Summary Dataset

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syndrigasti (2023). Arxiv Summary Dataset [Dataset]. https://www.kaggle.com/datasets/syndri224/arxiv-summary-dataset
    Explore at:
    zip(2242895176 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    Syndrigasti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    @inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

    Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization

  2. Data from: Legal Case Document Summarization: Extractive and Abstractive...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, zip
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh (2022). Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation [Dataset]. http://doi.org/10.5281/zenodo.7151679
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the following 3 datasets for legal document summarization :

    - IN-Abs : Indian Supreme Court case documents & their `abstractive' summaries, obtained from http://www.liiofindia.org/in/cases/cen/INSC/
    - IN-Ext : Indian Supreme Court case documents & their `extractive' summaries, written by two law experts (A1, A2).
    - UK-Abs : United Kingdom (U.K.) Supreme Court case documents & their `abstractive' summaries, obtained from https://www.supremecourt.uk/decided-cases/

    Please refer to the paper and the README file for more details.

  3. h

    pubmed-summarization

    • huggingface.co
    • opendatalab.com
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2021
    Authors
    ccdv
    Description

    PubMed dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.

  4. CCDV Arxiv Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
    Explore at:
    zip(2219742528 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CCDV Arxiv Summarization Dataset

    Arxiv Summarization Dataset for CCDV

    By ccdv (From Huggingface) [source]

    About this dataset

    The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

    The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

    Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

    With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

    How to use the dataset

    • Introduction:

    • File Description:

    • validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

    • train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

    • test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

    • Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

    • Usage Examples: This dataset can be utilized in various ways:

    a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

    b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

    c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

    • Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

    Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

    Research Ideas

    • Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
    • Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
    • Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

  5. h

    long-context-text-summarization-alpaca-format

    • huggingface.co
    Updated Nov 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antash Mishra (2024). long-context-text-summarization-alpaca-format [Dataset]. https://huggingface.co/datasets/antash420/long-context-text-summarization-alpaca-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2024
    Authors
    Antash Mishra
    Description

    antash420/long-context-text-summarization-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. Vietnamese Multi Document Summarization Dataset

    • kaggle.com
    zip
    Updated Oct 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vũ trần anh (2023). Vietnamese Multi Document Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/vtrnanh/sust-feature-data-new
    Explore at:
    zip(8072886 bytes)Available download formats
    Dataset updated
    Oct 1, 2023
    Authors
    vũ trần anh
    Description

    Vietnamese Multiple Document Summarization Dataset

    ViMs Dataset (ViMs folder)

    This work was supported by the Ho Chi Minh City Department of Science and Technology, Grant Numbers 15/2016/HÐ-SKHCN

    Data construction process: In this work, we aim to have 300 clusters of documents extracted from news. To this end, we made use of the Vietnamese language version of Google News. Due to the copyright issue, we did not collect articles from every source listed on Google News, but limited to some sources that are open for research purposes. The collected articles belong to five genres: world news, domestic news, business, entertainment, and sports. Every cluster contains from four to ten news articles. Each article is represented by the following information: the title, the plain text content, the news source, the date of publication, the author(s), the tag(s) and the headline summary.

    After that, two summaries are created for each cluster (produced in the first subtask above) by two distinguished annotators using the MDSWriter system (Meyer, Christian M., et al. "MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora." Proceedings of ACL-2016 System Demonstrations). These annotators are Vietnamese native speakers and they are undergraduate students or graduate students. Most of them know about natural language processing. The full annotation process consists of seven steps that must be done sequentially from the first to the seventh one.

    Data information: Original folder: Containing 300 subdirectories which are 300 news clusters. Articles (documents) in each cluster belong to a similar topic and there are from four to ten of them. The number of articles is 1,945.

    Summary folder: Contains 300 subdirectories which are 600 final summaries. Every input cluster has two manual abstract summaries from two distinguished annotators. ViMs can be used for both implementing and evaluating supervised machine learning-based systems for Vietnamese abstractive multi-document summarization.

    S3_summary folder: Contains 300 subdirectories including 600 ''best sentence selection'' summaries, the result of step 3 -- best sentence selection step. Sentences in a group are separated from others by a blank line. The most important sentence is marked as 1 while 0 is the label for others.

    @article{tran2020vims, title={ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization}, author={Tran, Nhi-Thao and Nghiem, Minh-Quoc and Nguyen, Nhung TH and Nguyen, Ngan Luu-Thuy and Van Chi, Nam and Dinh, Dien}, journal={Language Resources and Evaluation}, volume={54}, number={4}, pages={893--920}, year={2020}, publisher={Springer} }

    Vietnamese MDS (clusters folder)

    Author: Tran Mai Vu, Vu Trong Hoa, Phi Van Thuy, Le Duc Trong, Ha Quang Thuy Affiliation: Knowledge Technology Laboratory, University of Technology, VNU Hanoi Research Topic: Design and Implementation of a Multi-document Summarization Program for the Vietnamese Language, Funded by the Ministry of Education (Project Code: B2012-01-24)

    Data construction process: The data construction process is entirely manual. It consists of two steps:

    • Step 1: Data Collection and Clustering - Data is collected from the Baomoi website and organized into clusters, where each cluster contains documents related to a specific topic. The data is collected from various subjects on Baomoi, typically encompassing 8-10 main categories, such as World, Society, Culture, Economics, Science-Technology, Sports, Entertainment, Law, Education, Health, Automobiles, and Real Estate.
    • Step 2: Summarization - Two individuals participate in creating reference summaries. The summarization process includes two stages: extracting important sentences and rewriting them into a coherent paragraph.

    Data information: Data Volume: 200 clusters

    Each cluster corresponds to a folder, and it typically contains 2-5 documents (often 3). The folder's name represents the cluster.

    Within each folder:

    • .info: Contains cluster ID and cluster label (labels are assigned by the cluster creator based on content).
    • .ref1.txt: Reference summary created by summarizer 1.
    • .ref1.tok.txt: Tokenized version of reference summary 1, with sentences and words separated.
    • .ref2.txt: Reference summary created by summarizer 2.
    • .ref2.tok.txt: Tokenized version of reference summary 2, with sentences and words separated.
    • .sum.txt: Machine-generated summary.
    • .sum.tok.txt: Tokenized version of the machine-generated summary, with sentences and words separated.

    All files within the same folder represent documents (online articles) belonging to the cluster:

    • .body.txt: Contains the main content of the document.
    • .body.tok.txt: Contains the document's content with sentences and words separated.
    • .info.txt: Contains other in...
  7. h

    multi_document_summarization

    • huggingface.co
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arka Das (2024). multi_document_summarization [Dataset]. https://huggingface.co/datasets/arka0821/multi_document_summarization
    Explore at:
    Dataset updated
    Feb 8, 2024
    Authors
    Arka Das
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.

  8. BUSUM-BNLP Dataset (Multi-Document Bangla Summary)

    • kaggle.com
    zip
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marwa_Nurtaj (2023). BUSUM-BNLP Dataset (Multi-Document Bangla Summary) [Dataset]. https://www.kaggle.com/datasets/marwanurtaj/busum-bnlp-dataset-multi-document-bangla-summary
    Explore at:
    zip(1048000 bytes)Available download formats
    Dataset updated
    Oct 11, 2023
    Authors
    Marwa_Nurtaj
    Area covered
    Büsum
    Description

    =====================================================================================

    BUSUM-BNLP DATASET

    =====================================================================================

    A Public Dataset for Multi-Document Update Summarization Task: To Improve the artificial intelligence-centered Information Retrieval Mission

    =====================================================================================

    This is the README file for the BUSUM-BNLP dataset. Good performance in NLP projects depends on high-quality datasets. For the multi-document update summarization task, we have researched some NLP datasets and also tried to generate a new one in Bangla. After researching various literature, we have found many English datasets like DUC2002, DUC2007, Daily-Mail, TAC dataset, and some Bangla datasets like Bangla Summarization Dataset (Prothom Alo), Bangla-Extra Sum Dataset, and BNLPC Dataset. In several papers, DUC2002 and DUC2007 were used to generate update summaries, while Daily Mail was tested for extractive and abstractive summaries. Though in Bengali summarization, there are not so many works done yet.

    =====================================================================================

    For our dataset, we have collected some old and new-dated articles of the same news from the online websites of Prothom Alo, Kaler Kontho, BBC News, Jugantor, etc. Both old and new news can contain slimier information. We have sorted out the old and new news and developed our latest dataset for applying our multi-document update summarization task.

    =====================================================================================

    Our multi-document Bangla dataset can be used for keyword detection in multiple files related to a specific topic. You can develop various summarization models, including updated summarizers and generic summarizers, using machine learning and deep learning techniques. This dataset can serve as a valuable resource for inspiring and generating new datasets by scraping news articles and other sources. It can also aid in developing domain-specific title detection models and extracting relevant features to improve accuracy.

    =====================================================================================

    One can follow the given methods for pre-processing the data: (POS) Tagging: This method will group or organise text phrases corresponding to language types such as nouns, verbs, adverbs, adjectives, etc. Cleaning Stop Words: It will eliminate common words from a document that give no useful information. Some of the stop words are like they, there, this, were, etc. Discard words or numerals containing digits: This sort of term, such as wordscapes59 or game5ts7, is tough to handle, so it's best to eliminate them or change them with an empty set and use regular expressions instead. Erase extra white spaces: In the pre-processing, the regular expression library works well to remove extra space, which is unnecessary. Cut-off Punctuations: There are 32 major punctuations in all that need to be considered. The string module and a regular expression can be used to replace any punctuation in the text with an empty string. Converting into the Same Case: If the text is in the same case throughout, a computer can easily understand the words since machines perceive lowercase and uppercase letters differently. Recognition of Named Entities: Keywords in the response text should be identified as item labels (i.e., individual, place, company title, etc.). Lemmatization or stemming: Reduction of the words to their base (like run is the ground word of runs, running, runed) form will be performed by lemmatization (which matches words with a linguistic dictionary) and stemming (which removes the suffix and the prefix from the word). Enlarge Contractions: Words like don't, which signifies "do not," and are not, which means "are not," are examples of words that have been contracted. It will be easier to carry out sentence processing duties if the contraction is expanded. Tokenization: The tokenization technique breaks text flows into tokens, which can be words, phrases, symbols, or other significant pieces of information.

    ====================================================================================

    During our project, we encountered a few limitations that affected our data collection and modeling efforts. Firstly, to collect news on a particular topic, we had to put in a considerable amount of effort, as newspapers do not publish news in a serial manner every day. This meant that we had to read through entire texts to select relevant news articles for our dataset. Additionally, generating human-generated summaries was not an easy task, as we had to read lengthy documents and condense them into shorter summaries. However, due to time constraints and difficulties in r...

  9. E

    Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0

    • live.european-language-grid.eu
    binary format
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Summarization datasets from the KAS corpus KAS-Sum 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20154
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 3, 2022
    License

    https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0

    Description

    Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.

    The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.

    The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.

    References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228

  10. h

    govreport-summarization

    • huggingface.co
    • opendatalab.com
    Updated Aug 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2022). govreport-summarization [Dataset]. https://huggingface.co/datasets/ccdv/govreport-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2022
    Authors
    ccdv
    Description

    GovReport dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo and this paperThis dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/govreport-summarization": ("report", "summary")

      Data Fields
    

    id: paper id report: a string containing the body of the reportsummary: a string containing the summary of the report

      Data Splits… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/govreport-summarization.
    
  11. D

    Document Summarization AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Document Summarization AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/document-summarization-ai-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Document Summarization AI Market Outlook



    According to our latest research, the global Document Summarization AI market size reached USD 1.54 billion in 2024, reflecting robust adoption across industries. The market is projected to expand at a CAGR of 23.7% from 2025 to 2033, driven by increasing demand for automated content processing and smarter information retrieval. By 2033, the Document Summarization AI market is forecasted to achieve a value of USD 12.38 billion, underlining its transformative impact on enterprise operations and knowledge management. This rapid growth is primarily fueled by the proliferation of unstructured data, the need for efficient decision-making, and advancements in natural language processing (NLP) technologies.




    One of the primary growth factors for the Document Summarization AI market is the exponential surge in digital content generation across industries. Enterprises, government agencies, and academic institutions are inundated with vast volumes of unstructured data, including emails, reports, legal documents, and research papers. Manual processing of such data is time-consuming, error-prone, and often leads to information overload. The integration of AI-driven summarization tools enables organizations to extract key insights, reduce redundancy, and accelerate workflow automation. As a result, businesses are able to enhance productivity, improve compliance, and make data-driven decisions more efficiently. This growing need for automation and intelligent data curation is a critical driver propelling the adoption of Document Summarization AI solutions worldwide.




    Another significant factor contributing to market growth is the advancement and democratization of natural language processing (NLP) and machine learning algorithms. Leading AI vendors are investing heavily in research and development to enhance the accuracy, context-awareness, and linguistic versatility of summarization models. With the evolution of transformer-based architectures and large language models, Document Summarization AI tools are now capable of handling complex, domain-specific content with greater precision. Moreover, the availability of cloud-based AI services has lowered the entry barriers for small and medium-sized enterprises (SMEs), enabling them to leverage sophisticated summarization capabilities without significant upfront investment. This technological progress, coupled with growing awareness about the benefits of AI-powered document management, is expected to sustain high growth momentum over the forecast period.




    The Document Summarization AI market is also witnessing strong traction due to regulatory and compliance requirements, particularly in sectors like BFSI, healthcare, and legal. Stringent data governance frameworks and the need for timely, accurate reporting are prompting organizations to automate document review and summarization processes. Additionally, the rise of remote work and digital collaboration has intensified the demand for solutions that can streamline knowledge sharing and information dissemination across distributed teams. As organizations continue to embrace digital transformation, the strategic value of Document Summarization AI in enhancing operational agility, reducing manual workload, and mitigating compliance risks is becoming increasingly evident. This confluence of regulatory, technological, and business drivers is expected to shape the market landscape in the coming years.




    From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest share in 2024. The region's leadership is attributed to the early adoption of AI technologies, a mature digital infrastructure, and a strong presence of key market players. However, Asia Pacific is emerging as the fastest-growing region, driven by rapid digitalization, expanding enterprise IT budgets, and government initiatives to foster AI innovation. Europe is also witnessing substantial growth, supported by robust regulatory frameworks and increasing investments in AI research. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with rising awareness and adoption among enterprises and public sector organizations. The global market is thus characterized by dynamic regional trends, with each geography presenting unique opportunities and challenges for stakeholders.



    Component Analysis



    The Document Summarization AI market is segmented by component into

  12. O

    GovReport

    • opendatalab.com
    • tensorflow.org
    zip
    Updated Sep 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Illinois Urbana-Champaign (2022). GovReport [Dataset]. https://opendatalab.com/OpenDataLab/GovReport
    Explore at:
    zip(1192078548 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    University of Illinois Urbana-Champaign
    University of Michigan
    Description

    GovReport is a dataset for long document summarization, with significantly longer documents and summaries. It consists of reports written by government research agencies including Congressional Research Service and U.S. Government Accountability Office. Compared with other long document summarization datasets, government report dataset has longer summaries and documents and requires reading in more context to cover salient words to be summarized.

  13. s

    Citation Trends for "A study of extractive summarization of long documents...

    • shibatadb.com
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2024). Citation Trends for "A study of extractive summarization of long documents incorporating local topic and hierarchical information" [Dataset]. https://www.shibatadb.com/article/BhTiuUDo
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Time period covered
    2024 - 2025
    Variables measured
    New Citations per Year
    Description

    Yearly citation counts for the publication titled "A study of extractive summarization of long documents incorporating local topic and hierarchical information".

  14. E

    Data from: Slovenian text summarization models

    • live.european-language-grid.eu
    Updated Dec 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Slovenian text summarization models [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20871
    Explore at:
    Dataset updated
    Dec 20, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:

    Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.

    Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.

    The web service with a demo is available at https://slovenscina.eu/povzemanje.

  15. News Article Summarization Dataset

    • kaggle.com
    zip
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    🇹🇷 Şahide Şeker, MSc (2025). News Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/sahideseker/news-article-summarization-dataset/data
    Explore at:
    zip(758 bytes)Available download formats
    Dataset updated
    Apr 3, 2025
    Authors
    🇹🇷 Şahide Şeker, MSc
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🇺🇸 English:

    This synthetic dataset is created for learning and testing abstractive text summarization models. Each row contains a news-style article and a short summary. The dataset is ideal for experimenting with HuggingFace models such as t5-base, facebook/bart-large-cnn, or google/pegasus-xsum.

    🇹🇷 Türkçe:

    Bu sentetik veri seti, haber metinlerinden özet üretmek isteyenler için tasarlanmıştır. Her satırda İngilizce uzun bir haber metni ve karşılık gelen kısa bir özet yer alır. T5, BART veya Pegasus modelleriyle uyumludur.

  16. m

    MyTextSum : Malay Text Summarization Dataset

    • data.mendeley.com
    Updated Sep 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suraya Alias (2021). MyTextSum : Malay Text Summarization Dataset [Dataset]. http://doi.org/10.17632/r54zh37mc7.1
    Explore at:
    Dataset updated
    Sep 14, 2021
    Authors
    Suraya Alias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset consists of 100 news articles covering Natural Disaster and Events domains in Malaysia. The news articles are formatted in XML following the DUC 2002 text summarization dataset preparation. The total of 300 human summaries from 3 domains experts are included for summary evaluation. Our MyTextSum model with the application of Pattern-Growth Sentence Compression technique has shown promising results of F-Measure score of 0.5752 agreements when evaluated against human summaries and perform better than the Baselines (uncompressed) model.

    Citation Reference:

    Alias, S., Sainin, M. S., & Mohammad, S. K. (2020). Model Peringkasan Teks Ekstraktif Dwibahasa menggunakan Fitur Kekangan Corak Tekstual (Bilingual Extractive Text Summarization Model using Textual Pattern Constraints). GEMA Online® Journal of Language Studies, 20(3).

    Alias S., Mohammad S.K., Gan K.H., Ping T.T. (2018) MYTextSum: A Malay Text Summarizer Model Using a Constrained Pattern-Growth Sentence Compression Technique. In: Alfred R., Iida H., Ag. Ibrahim A., Lim Y. (eds) Computational Science and Technology. ICCST 2017. Lecture Notes in Electrical Engineering, vol 488. Springer, Singapore. https://doi.org/10.1007/978-981-10-8276-4_14

  17. r

    VideoXum

    • resodate.org
    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jingyang Lin; Hang Hua; Ming Chen; Yikang Li; Jenhao Hsiao; Chiuman Ho; Jiebo Luo (2024). VideoXum [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdmlkZW94dW0=
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Jingyang Lin; Hang Hua; Ming Chen; Yikang Li; Jenhao Hsiao; Chiuman Ho; Jiebo Luo
    Description

    VideoXum is a large-scale video summarization dataset that contains 14,001 long videos with corresponding human-annotated video and text summaries.

  18. G

    Document Summarization AI Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Sep 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Document Summarization AI Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/document-summarization-ai-market
    Explore at:
    csv, pptx, pdfAvailable download formats
    Dataset updated
    Sep 1, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Document Summarization AI Market Outlook



    According to our latest research, the Document Summarization AI market size reached USD 1.42 billion globally in 2024, driven by the accelerating adoption of artificial intelligence across industries and the exponential growth in unstructured data. The market is projected to expand at a robust CAGR of 22.6% from 2025 to 2033, reaching approximately USD 10.86 billion by 2033. This remarkable growth is largely attributed to advancements in natural language processing, increasing demand for automation in document management, and the need for efficient information retrieval from massive data repositories.




    The rapid digital transformation across sectors such as legal, healthcare, BFSI, and government has been a significant growth factor for the Document Summarization AI market. Organizations are dealing with an unprecedented volume of digital documents, emails, reports, and legal papers, making manual summarization inefficient and error-prone. AI-powered document summarization solutions are being rapidly adopted to automate this process, enabling faster decision-making, enhanced productivity, and substantial cost savings. The integration of advanced NLP techniques, including transformer-based models and deep learning, has further improved the accuracy and relevance of AI-generated summaries, making them invaluable for knowledge workers and executives who need to process large amounts of information quickly.




    Another key driver for the Document Summarization AI market is the growing emphasis on compliance and risk management, especially in highly regulated industries like finance and healthcare. Automated summarization tools help organizations extract critical information from lengthy compliance documents, contracts, and medical records, ensuring that essential details are not overlooked. This capability is crucial for meeting regulatory requirements, avoiding legal pitfalls, and maintaining robust audit trails. Furthermore, the increasing use of AI-based summarization in customer service chatbots and virtual assistants is enhancing user experiences by providing concise, contextually relevant responses, thereby improving customer satisfaction and loyalty.




    The proliferation of cloud computing and the availability of scalable AI platforms have also contributed significantly to market expansion. Cloud-based document summarization AI solutions offer businesses the flexibility to deploy and scale services according to their needs, reducing infrastructure costs and facilitating seamless integration with existing enterprise workflows. Additionally, the democratization of AI through APIs and low-code/no-code platforms is enabling small and medium enterprises (SMEs) to leverage advanced document summarization capabilities without the need for extensive technical expertise. This trend is expected to further boost market penetration across diverse industry verticals in the coming years.




    From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest revenue share in 2024. The regionÂ’s leadership can be attributed to the strong presence of leading AI technology providers, high digital adoption rates, and significant investments in research and development. Europe follows closely, driven by stringent data privacy regulations and increasing demand for automation in public and private sectors. Meanwhile, the Asia Pacific region is witnessing the fastest growth, propelled by rapid digitalization, expanding enterprise IT infrastructures, and rising awareness of AI-driven document management solutions among businesses and government agencies.



    In the realm of education, Lecture Summarization Tools are becoming increasingly vital as they offer a streamlined way for students and educators to process and retain vast amounts of information. These tools utilize advanced AI algorithms to distill lecture content into concise summaries, making it easier for students to review and comprehend complex subjects. By integrating lecture summarization capabilities, educational institutions can enhance learning outcomes, provide personalized study materials, and support diverse learning styles. As the demand for digital learning solutions grows, the role of Lecture Summarization Tools in education is set to expand, offering significant be

  19. h

    booksum

    • huggingface.co
    Updated Dec 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karim Foda (2021). booksum [Dataset]. https://huggingface.co/datasets/kmfoda/booksum
    Explore at:
    Dataset updated
    Dec 24, 2021
    Authors
    Karim Foda
    License

    https://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/

    Description

    BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

    Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev

      Introduction
    

    The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.

  20. h

    high-quality-summary-v2

    • huggingface.co
    Updated May 15, 1999
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng (1999). high-quality-summary-v2 [Dataset]. https://huggingface.co/datasets/agentlans/high-quality-summary-v2
    Explore at:
    Dataset updated
    May 15, 1999
    Authors
    Alan Tseng
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    High Quality Long Text Summarization Dataset

    Input texts from agentlans/high-quality-text-long sample_k10000 config Summaries generated by google/gemma-3-12b-it Summaries rewritten by agentlans/granite-3.3-2b-reviser

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Syndrigasti (2023). Arxiv Summary Dataset [Dataset]. https://www.kaggle.com/datasets/syndri224/arxiv-summary-dataset
Organization logo

Arxiv Summary Dataset

Dataset for summarization of long documents.

Explore at:
16 scholarly articles cite this dataset (View in Google Scholar)
zip(2242895176 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
Syndrigasti
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization

Search
Clear search
Close search
Google apps
Main menu