Facebook
TwitterThis dataset was created by Đan Trường phan đình
Facebook
TwitterAutoTrain Dataset for project: summarization
Dataset Description
This dataset has been automatically processed by AutoTrain for project summarization.
Languages
The BCP-47 code for the dataset's language is en.
Dataset Structure
Data Instances
A sample from this dataset looks as follows: [ { "feat_id": "train_0", "text": "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset comprises tables in JSON format, each accompanied by a concise summary of approximately 10 lines. Originally created to train large language models (LLMs) for efficient tabular data summarization, this dataset serves as a valuable resource for developing advanced summarization algorithms. The tables contain a mix of textual and numerical data, reflecting a wide array of information types. Each summary captures the essential insights and key patterns from the corresponding table, providing a clear and succinct overview. The dataset's diverse entries, ranging from descriptive text to precise numerical values, present a robust challenge for LLMs, facilitating the enhancement of their summarization capabilities. By leveraging this dataset, researchers and developers can improve the accuracy and efficiency of their models in interpreting and summarizing complex tabular data.
Facebook
TwitterAutoTrain Dataset for project: summarization-xlsum
Dataset Description
This dataset has been automatically processed by AutoTrain for project summarization-xlsum.
Languages
The BCP-47 code for the dataset's language is unk.
Dataset Structure
Data Instances
A sample from this dataset looks as follows: [ { "text": "\u092e\u0928 \u0915\u0940 \u0917\u0939\u0930\u093e\u0907\u092f\u094b\u0902 \u092e\u0947\u0902… See the full description on the dataset page: https://huggingface.co/datasets/viditsorg/autotrain-data-summarization-xlsum.
Facebook
TwitterValues in parentheses are for the highest resolution shell.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.
The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.
Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.
With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.
Introduction:
File Description:
validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.
train.csv: The purpose of this file is to provide training data for summarizing scientific articles.
test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.
Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.
Usage Examples: This dataset can be utilized in various ways:
a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.
b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.
c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.
- Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.
Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions
- Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
- Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
- Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Facebook
TwitterThis page provides data for the 3rd Grade Reading Level Proficiency performance measure.The dataset includes the student performance results on the English/Language Arts section of the AzMERIT from the Fall 2017 and Spring 2018. Data is representive of students in third grade in public elementary schools in Tempe. This includes schools from both Tempe Elementary and Kyrene districts. Results are by school and provide the total number of students tested, total percentage passing and percentage of students scoring at each of the four levels of proficiency. The performance measure dashboard is available at 3.07 3rd Grade Reading Level Proficiency.Additional InformationSource: Arizona Department of EducationContact: Ann Lynn DiDomenicoContact E-Mail: Ann_DiDomenico@tempe.govData Source Type: Excel/ CSVPreparation Method: Filters on original dataset: within "Schools" Tab School District [select Tempe School District and Kyrene School District]; School Name [deselect Kyrene SD not in Tempe city limits]; Content Area [select English Language Arts]; Test Level [select Grade 3]; Subgroup/Ethnicity [select All Students] Remove irrelevant fields; Add Fiscal YearPublish Frequency: Annually as data becomes availablePublish Method: ManualData Dictionary
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
According to our latest research, the global Document Summarization AI market size reached USD 1.54 billion in 2024, reflecting robust adoption across industries. The market is projected to expand at a CAGR of 23.7% from 2025 to 2033, driven by increasing demand for automated content processing and smarter information retrieval. By 2033, the Document Summarization AI market is forecasted to achieve a value of USD 12.38 billion, underlining its transformative impact on enterprise operations and knowledge management. This rapid growth is primarily fueled by the proliferation of unstructured data, the need for efficient decision-making, and advancements in natural language processing (NLP) technologies.
One of the primary growth factors for the Document Summarization AI market is the exponential surge in digital content generation across industries. Enterprises, government agencies, and academic institutions are inundated with vast volumes of unstructured data, including emails, reports, legal documents, and research papers. Manual processing of such data is time-consuming, error-prone, and often leads to information overload. The integration of AI-driven summarization tools enables organizations to extract key insights, reduce redundancy, and accelerate workflow automation. As a result, businesses are able to enhance productivity, improve compliance, and make data-driven decisions more efficiently. This growing need for automation and intelligent data curation is a critical driver propelling the adoption of Document Summarization AI solutions worldwide.
Another significant factor contributing to market growth is the advancement and democratization of natural language processing (NLP) and machine learning algorithms. Leading AI vendors are investing heavily in research and development to enhance the accuracy, context-awareness, and linguistic versatility of summarization models. With the evolution of transformer-based architectures and large language models, Document Summarization AI tools are now capable of handling complex, domain-specific content with greater precision. Moreover, the availability of cloud-based AI services has lowered the entry barriers for small and medium-sized enterprises (SMEs), enabling them to leverage sophisticated summarization capabilities without significant upfront investment. This technological progress, coupled with growing awareness about the benefits of AI-powered document management, is expected to sustain high growth momentum over the forecast period.
The Document Summarization AI market is also witnessing strong traction due to regulatory and compliance requirements, particularly in sectors like BFSI, healthcare, and legal. Stringent data governance frameworks and the need for timely, accurate reporting are prompting organizations to automate document review and summarization processes. Additionally, the rise of remote work and digital collaboration has intensified the demand for solutions that can streamline knowledge sharing and information dissemination across distributed teams. As organizations continue to embrace digital transformation, the strategic value of Document Summarization AI in enhancing operational agility, reducing manual workload, and mitigating compliance risks is becoming increasingly evident. This confluence of regulatory, technological, and business drivers is expected to shape the market landscape in the coming years.
From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest share in 2024. The region's leadership is attributed to the early adoption of AI technologies, a mature digital infrastructure, and a strong presence of key market players. However, Asia Pacific is emerging as the fastest-growing region, driven by rapid digitalization, expanding enterprise IT budgets, and government initiatives to foster AI innovation. Europe is also witnessing substantial growth, supported by robust regulatory frameworks and increasing investments in AI research. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with rising awareness and adoption among enterprises and public sector organizations. The global market is thus characterized by dynamic regional trends, with each geography presenting unique opportunities and challenges for stakeholders.
The Document Summarization AI market is segmented by component into
Facebook
TwitterThis work was supported by the Ho Chi Minh City Department of Science and Technology, Grant Numbers 15/2016/HÐ-SKHCN
Data construction process: In this work, we aim to have 300 clusters of documents extracted from news. To this end, we made use of the Vietnamese language version of Google News. Due to the copyright issue, we did not collect articles from every source listed on Google News, but limited to some sources that are open for research purposes. The collected articles belong to five genres: world news, domestic news, business, entertainment, and sports. Every cluster contains from four to ten news articles. Each article is represented by the following information: the title, the plain text content, the news source, the date of publication, the author(s), the tag(s) and the headline summary.
After that, two summaries are created for each cluster (produced in the first subtask above) by two distinguished annotators using the MDSWriter system (Meyer, Christian M., et al. "MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora." Proceedings of ACL-2016 System Demonstrations). These annotators are Vietnamese native speakers and they are undergraduate students or graduate students. Most of them know about natural language processing. The full annotation process consists of seven steps that must be done sequentially from the first to the seventh one.
Data information: Original folder: Containing 300 subdirectories which are 300 news clusters. Articles (documents) in each cluster belong to a similar topic and there are from four to ten of them. The number of articles is 1,945.
Summary folder: Contains 300 subdirectories which are 600 final summaries. Every input cluster has two manual abstract summaries from two distinguished annotators. ViMs can be used for both implementing and evaluating supervised machine learning-based systems for Vietnamese abstractive multi-document summarization.
S3_summary folder: Contains 300 subdirectories including 600 ''best sentence selection'' summaries, the result of step 3 -- best sentence selection step. Sentences in a group are separated from others by a blank line. The most important sentence is marked as 1 while 0 is the label for others.
@article{tran2020vims, title={ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization}, author={Tran, Nhi-Thao and Nghiem, Minh-Quoc and Nguyen, Nhung TH and Nguyen, Ngan Luu-Thuy and Van Chi, Nam and Dinh, Dien}, journal={Language Resources and Evaluation}, volume={54}, number={4}, pages={893--920}, year={2020}, publisher={Springer} }
Author: Tran Mai Vu, Vu Trong Hoa, Phi Van Thuy, Le Duc Trong, Ha Quang Thuy Affiliation: Knowledge Technology Laboratory, University of Technology, VNU Hanoi Research Topic: Design and Implementation of a Multi-document Summarization Program for the Vietnamese Language, Funded by the Ministry of Education (Project Code: B2012-01-24)
Data construction process: The data construction process is entirely manual. It consists of two steps:
Data information: Data Volume: 200 clusters
Each cluster corresponds to a folder, and it typically contains 2-5 documents (often 3). The folder's name represents the cluster.
Within each folder:
All files within the same folder represent documents (online articles) belonging to the cluster:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the following 3 datasets for legal document summarization :
- IN-Abs : Indian Supreme Court case documents & their `abstractive' summaries, obtained from http://www.liiofindia.org/in/cases/cen/INSC/
- IN-Ext : Indian Supreme Court case documents & their `extractive' summaries, written by two law experts (A1, A2).
- UK-Abs : United Kingdom (U.K.) Supreme Court case documents & their `abstractive' summaries, obtained from https://www.supremecourt.uk/decided-cases/
Please refer to the paper and the README file for more details.
Facebook
Twitterhttps://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.
The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.
The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.
References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
Facebook
TwitterPubMed dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
NASA's exploration and scientific missions will produce terabytes of information. As NASA enters a new phase of space exploration, managing large amounts of scientific and operational data will become even more challenging. Robots conducting planetary exploration will produce data for selection and preparation of exploration sites. Robots and space probes will collect scientific data to improve understanding of the solar system. Satellites in low Earth orbit will collect data for monitoring changes in the Earth's atmosphere and surface environment. Key challenges for all these missions are understanding and summarizing what data have been collected and using this knowledge to improve data access. TRACLabs and CMU propose to develop context aware image manipulation software for managing data collected remotely during NASA missions. This software will filter and search large image archives using the temporal and spatial characteristics of images, and the robotic, instrument, and environmental conditions when images were taken. It also will implement techniques for finding which images show a terrain feature specified by the user. In Phase II we will implement this software and evaluate its effectiveness for NASA missions. At the end of Phase II, context aware image manipulation software at TRL 5-6 will be delivered to NASA.
Facebook
TwitterUnder the new quarterly data summary (QDS) framework departments’ spending data is published every quarter; to show the taxpayer how the government is spending their money.
The QDS grew out of commitments made in the 2011 Budget and the written ministerial statement on business plans. For the financial year 2012 to 2013 the QDS has been revised and improved in line with action 9 of the Civil Service Reform Plan to provide a common set of data that will enable comparisons of operational performance across government so that departments and individuals can be held to account.
The QDS breaks down the total spend of the department in 3 ways:
The QDS template is the same for all departments, though the individual detail of grants and policy will differ from department to department. In using this data:
Please note that the quarter 1 2012 to 2013 return for the Department of Transport (DfT) is for the core department only.
Quarterly data summaries for April 2012 are as follows:
Quarterly data summaries for January 2012 are as follows:
Quarterly data summaries for October 2011 are as follows:
Facebook
TwitterThe Big Data Interagency Working Group (BD IWG) held a workshop, Measuring the Impact of Digital Repositories, on February 28 - March 1, 2017 in Arlington, VA. The aim of the workshop was to identify current assessment metrics, tools, and methodologies that are effective in measuring the impact of digital data repositories, and to identify the assessment issues, obstacles, and tools that require additional research and development (R&D). This workshop brought together leaders from academic, journal, government, and international data repository funders, users, and developers to discuss these issues...
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:
Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.
Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.
The web service with a demo is available at https://slovenscina.eu/povzemanje.
Facebook
TwitterThe NewsRoom dataset consists of 60 input source texts and 7 output summaries for each sample.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.
In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.
Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.
Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.
Introduction:
Dataset Structure:
- article: The full text of a scientific article from the PubMed database (Text).
- abstract: A summary of the main findings and conclusions of the article (Text).
Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:
Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.
Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.
Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.
Tips for Utilizing the Dataset Effectively:
Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.
Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.
Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.
Conclusion:
- Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Update Frequency: Annual
Updated for 2022. End of year assessed property values for the City of Milwaukee for the years 1992-present. These values include real estate property and personal property in Milwaukee, Washington, and Waukesha Counties.
One data row per year.
To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Russian summarization data mix
Total Number of items in Train: 197561. Total Number of items in Golden Test set: 258 (manually verified semi-synthetic data for evaluation purpose).
We use this datasets for train mix:
XLSum Gazeta WikiLingua MLSUM Reviews (ru) Curation-corpus (ru) Matreshka DialogSum (ru) SAMSum (ru)
Cite us
@misc{akhmetgareeva2024summary, title={Towards Russian Summarization: can architecture solve data limitations problems?}… See the full description on the dataset page: https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset.
Facebook
TwitterThis dataset was created by Đan Trường phan đình