100+ datasets found
  1. Data from: data summarization

    • kaggle.com
    zip
    Updated Dec 8, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Đan Trường phan đình (2021). data summarization [Dataset]. https://www.kaggle.com/datasets/antrngphannh/data-summarization
    Explore at:
    zip(56154866 bytes)Available download formats
    Dataset updated
    Dec 8, 2021
    Authors
    Đan Trường phan đình
    Description

    Dataset

    This dataset was created by Đan Trường phan đình

    Contents

  2. h

    autotrain-data-summarization

    • huggingface.co
    Updated Aug 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    neil (2023). autotrain-data-summarization [Dataset]. https://huggingface.co/datasets/neil-code/autotrain-data-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2023
    Authors
    neil
    Description

    AutoTrain Dataset for project: summarization

      Dataset Description
    

    This dataset has been automatically processed by AutoTrain for project summarization.

      Languages
    

    The BCP-47 code for the dataset's language is en.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    A sample from this dataset looks as follows: [ { "feat_id": "train_0", "text": "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?

    Person2#: I found it would be a good… See the full description on the dataset page: https://huggingface.co/datasets/neil-code/autotrain-data-summarization.

  3. Large Table Summarization Dataset

    • kaggle.com
    zip
    Updated Jul 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AmruthWarrier (2024). Large Table Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/amruthwarrier/large-table-summarization-dataset/discussion
    Explore at:
    zip(551099 bytes)Available download formats
    Dataset updated
    Jul 3, 2024
    Authors
    AmruthWarrier
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    This dataset comprises tables in JSON format, each accompanied by a concise summary of approximately 10 lines. Originally created to train large language models (LLMs) for efficient tabular data summarization, this dataset serves as a valuable resource for developing advanced summarization algorithms. The tables contain a mix of textual and numerical data, reflecting a wide array of information types. Each summary captures the essential insights and key patterns from the corresponding table, providing a clear and succinct overview. The dataset's diverse entries, ranging from descriptive text to precise numerical values, present a robust challenge for LLMs, facilitating the enhancement of their summarization capabilities. By leveraging this dataset, researchers and developers can improve the accuracy and efficiency of their models in interpreting and summarizing complex tabular data.

  4. h

    autotrain-data-summarization-xlsum

    • huggingface.co
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vidit Organization (2023). autotrain-data-summarization-xlsum [Dataset]. https://huggingface.co/datasets/viditsorg/autotrain-data-summarization-xlsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 24, 2023
    Dataset authored and provided by
    Vidit Organization
    Description

    AutoTrain Dataset for project: summarization-xlsum

      Dataset Description
    

    This dataset has been automatically processed by AutoTrain for project summarization-xlsum.

      Languages
    

    The BCP-47 code for the dataset's language is unk.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    A sample from this dataset looks as follows: [ { "text": "\u092e\u0928 \u0915\u0940 \u0917\u0939\u0930\u093e\u0907\u092f\u094b\u0902 \u092e\u0947\u0902… See the full description on the dataset page: https://huggingface.co/datasets/viditsorg/autotrain-data-summarization-xlsum.

  5. f

    Summary of data collection and refinement statistics.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dusane, Abhishek; Bellini, Valeria; Hakimi, Mohamed-Ali; Sharma, Amit; Babbar, Palak; Bougdour, Alexandre; Laleu, Benoît; Manickam, Yogavel; Mishra, Siddhartha; Malhotra, Nipun (2022). Summary of data collection and refinement statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000269387
    Explore at:
    Dataset updated
    Mar 25, 2022
    Authors
    Dusane, Abhishek; Bellini, Valeria; Hakimi, Mohamed-Ali; Sharma, Amit; Babbar, Palak; Bougdour, Alexandre; Laleu, Benoît; Manickam, Yogavel; Mishra, Siddhartha; Malhotra, Nipun
    Description

    Values in parentheses are for the highest resolution shell.

  6. CCDV Arxiv Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
    Explore at:
    zip(2219742528 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CCDV Arxiv Summarization Dataset

    Arxiv Summarization Dataset for CCDV

    By ccdv (From Huggingface) [source]

    About this dataset

    The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

    The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

    Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

    With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

    How to use the dataset

    • Introduction:

    • File Description:

    • validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

    • train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

    • test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

    • Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

    • Usage Examples: This dataset can be utilized in various ways:

    a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

    b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

    c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

    • Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

    Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

    Research Ideas

    • Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
    • Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
    • Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

  7. d

    3.07 AZ Merit Data (summary)

    • catalog.data.gov
    • data-academy.tempe.gov
    • +13more
    Updated Jan 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Tempe (2025). 3.07 AZ Merit Data (summary) [Dataset]. https://catalog.data.gov/dataset/3-07-az-merit-data-summary-55307
    Explore at:
    Dataset updated
    Jan 17, 2025
    Dataset provided by
    City of Tempe
    Description

    This page provides data for the 3rd Grade Reading Level Proficiency performance measure.The dataset includes the student performance results on the English/Language Arts section of the AzMERIT from the Fall 2017 and Spring 2018. Data is representive of students in third grade in public elementary schools in Tempe. This includes schools from both Tempe Elementary and Kyrene districts. Results are by school and provide the total number of students tested, total percentage passing and percentage of students scoring at each of the four levels of proficiency. The performance measure dashboard is available at 3.07 3rd Grade Reading Level Proficiency.Additional InformationSource: Arizona Department of EducationContact: Ann Lynn DiDomenicoContact E-Mail: Ann_DiDomenico@tempe.govData Source Type: Excel/ CSVPreparation Method: Filters on original dataset: within "Schools" Tab School District [select Tempe School District and Kyrene School District]; School Name [deselect Kyrene SD not in Tempe city limits]; Content Area [select English Language Arts]; Test Level [select Grade 3]; Subgroup/Ethnicity [select All Students] Remove irrelevant fields; Add Fiscal YearPublish Frequency: Annually as data becomes availablePublish Method: ManualData Dictionary

  8. D

    Document Summarization AI Market Research Report 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Document Summarization AI Market Research Report 2033 [Dataset]. https://dataintelo.com/report/document-summarization-ai-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Document Summarization AI Market Outlook



    According to our latest research, the global Document Summarization AI market size reached USD 1.54 billion in 2024, reflecting robust adoption across industries. The market is projected to expand at a CAGR of 23.7% from 2025 to 2033, driven by increasing demand for automated content processing and smarter information retrieval. By 2033, the Document Summarization AI market is forecasted to achieve a value of USD 12.38 billion, underlining its transformative impact on enterprise operations and knowledge management. This rapid growth is primarily fueled by the proliferation of unstructured data, the need for efficient decision-making, and advancements in natural language processing (NLP) technologies.




    One of the primary growth factors for the Document Summarization AI market is the exponential surge in digital content generation across industries. Enterprises, government agencies, and academic institutions are inundated with vast volumes of unstructured data, including emails, reports, legal documents, and research papers. Manual processing of such data is time-consuming, error-prone, and often leads to information overload. The integration of AI-driven summarization tools enables organizations to extract key insights, reduce redundancy, and accelerate workflow automation. As a result, businesses are able to enhance productivity, improve compliance, and make data-driven decisions more efficiently. This growing need for automation and intelligent data curation is a critical driver propelling the adoption of Document Summarization AI solutions worldwide.




    Another significant factor contributing to market growth is the advancement and democratization of natural language processing (NLP) and machine learning algorithms. Leading AI vendors are investing heavily in research and development to enhance the accuracy, context-awareness, and linguistic versatility of summarization models. With the evolution of transformer-based architectures and large language models, Document Summarization AI tools are now capable of handling complex, domain-specific content with greater precision. Moreover, the availability of cloud-based AI services has lowered the entry barriers for small and medium-sized enterprises (SMEs), enabling them to leverage sophisticated summarization capabilities without significant upfront investment. This technological progress, coupled with growing awareness about the benefits of AI-powered document management, is expected to sustain high growth momentum over the forecast period.




    The Document Summarization AI market is also witnessing strong traction due to regulatory and compliance requirements, particularly in sectors like BFSI, healthcare, and legal. Stringent data governance frameworks and the need for timely, accurate reporting are prompting organizations to automate document review and summarization processes. Additionally, the rise of remote work and digital collaboration has intensified the demand for solutions that can streamline knowledge sharing and information dissemination across distributed teams. As organizations continue to embrace digital transformation, the strategic value of Document Summarization AI in enhancing operational agility, reducing manual workload, and mitigating compliance risks is becoming increasingly evident. This confluence of regulatory, technological, and business drivers is expected to shape the market landscape in the coming years.




    From a regional perspective, North America currently dominates the Document Summarization AI market, accounting for the largest share in 2024. The region's leadership is attributed to the early adoption of AI technologies, a mature digital infrastructure, and a strong presence of key market players. However, Asia Pacific is emerging as the fastest-growing region, driven by rapid digitalization, expanding enterprise IT budgets, and government initiatives to foster AI innovation. Europe is also witnessing substantial growth, supported by robust regulatory frameworks and increasing investments in AI research. Meanwhile, Latin America and the Middle East & Africa are gradually catching up, with rising awareness and adoption among enterprises and public sector organizations. The global market is thus characterized by dynamic regional trends, with each geography presenting unique opportunities and challenges for stakeholders.



    Component Analysis



    The Document Summarization AI market is segmented by component into

  9. Vietnamese Multi Document Summarization Dataset

    • kaggle.com
    zip
    Updated Oct 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vũ trần anh (2023). Vietnamese Multi Document Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/vtrnanh/sust-feature-data-new
    Explore at:
    zip(8072886 bytes)Available download formats
    Dataset updated
    Oct 1, 2023
    Authors
    vũ trần anh
    Description

    Vietnamese Multiple Document Summarization Dataset

    ViMs Dataset (ViMs folder)

    This work was supported by the Ho Chi Minh City Department of Science and Technology, Grant Numbers 15/2016/HÐ-SKHCN

    Data construction process: In this work, we aim to have 300 clusters of documents extracted from news. To this end, we made use of the Vietnamese language version of Google News. Due to the copyright issue, we did not collect articles from every source listed on Google News, but limited to some sources that are open for research purposes. The collected articles belong to five genres: world news, domestic news, business, entertainment, and sports. Every cluster contains from four to ten news articles. Each article is represented by the following information: the title, the plain text content, the news source, the date of publication, the author(s), the tag(s) and the headline summary.

    After that, two summaries are created for each cluster (produced in the first subtask above) by two distinguished annotators using the MDSWriter system (Meyer, Christian M., et al. "MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora." Proceedings of ACL-2016 System Demonstrations). These annotators are Vietnamese native speakers and they are undergraduate students or graduate students. Most of them know about natural language processing. The full annotation process consists of seven steps that must be done sequentially from the first to the seventh one.

    Data information: Original folder: Containing 300 subdirectories which are 300 news clusters. Articles (documents) in each cluster belong to a similar topic and there are from four to ten of them. The number of articles is 1,945.

    Summary folder: Contains 300 subdirectories which are 600 final summaries. Every input cluster has two manual abstract summaries from two distinguished annotators. ViMs can be used for both implementing and evaluating supervised machine learning-based systems for Vietnamese abstractive multi-document summarization.

    S3_summary folder: Contains 300 subdirectories including 600 ''best sentence selection'' summaries, the result of step 3 -- best sentence selection step. Sentences in a group are separated from others by a blank line. The most important sentence is marked as 1 while 0 is the label for others.

    @article{tran2020vims, title={ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization}, author={Tran, Nhi-Thao and Nghiem, Minh-Quoc and Nguyen, Nhung TH and Nguyen, Ngan Luu-Thuy and Van Chi, Nam and Dinh, Dien}, journal={Language Resources and Evaluation}, volume={54}, number={4}, pages={893--920}, year={2020}, publisher={Springer} }

    Vietnamese MDS (clusters folder)

    Author: Tran Mai Vu, Vu Trong Hoa, Phi Van Thuy, Le Duc Trong, Ha Quang Thuy Affiliation: Knowledge Technology Laboratory, University of Technology, VNU Hanoi Research Topic: Design and Implementation of a Multi-document Summarization Program for the Vietnamese Language, Funded by the Ministry of Education (Project Code: B2012-01-24)

    Data construction process: The data construction process is entirely manual. It consists of two steps:

    • Step 1: Data Collection and Clustering - Data is collected from the Baomoi website and organized into clusters, where each cluster contains documents related to a specific topic. The data is collected from various subjects on Baomoi, typically encompassing 8-10 main categories, such as World, Society, Culture, Economics, Science-Technology, Sports, Entertainment, Law, Education, Health, Automobiles, and Real Estate.
    • Step 2: Summarization - Two individuals participate in creating reference summaries. The summarization process includes two stages: extracting important sentences and rewriting them into a coherent paragraph.

    Data information: Data Volume: 200 clusters

    Each cluster corresponds to a folder, and it typically contains 2-5 documents (often 3). The folder's name represents the cluster.

    Within each folder:

    • .info: Contains cluster ID and cluster label (labels are assigned by the cluster creator based on content).
    • .ref1.txt: Reference summary created by summarizer 1.
    • .ref1.tok.txt: Tokenized version of reference summary 1, with sentences and words separated.
    • .ref2.txt: Reference summary created by summarizer 2.
    • .ref2.tok.txt: Tokenized version of reference summary 2, with sentences and words separated.
    • .sum.txt: Machine-generated summary.
    • .sum.tok.txt: Tokenized version of the machine-generated summary, with sentences and words separated.

    All files within the same folder represent documents (online articles) belonging to the cluster:

    • .body.txt: Contains the main content of the document.
    • .body.tok.txt: Contains the document's content with sentences and words separated.
    • .info.txt: Contains other in...
  10. Data from: Legal Case Document Summarization: Extractive and Abstractive...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    bin, zip
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh (2022). Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation [Dataset]. http://doi.org/10.5281/zenodo.7151679
    Explore at:
    bin, zipAvailable download formats
    Dataset updated
    Nov 23, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the following 3 datasets for legal document summarization :

    - IN-Abs : Indian Supreme Court case documents & their `abstractive' summaries, obtained from http://www.liiofindia.org/in/cases/cen/INSC/
    - IN-Ext : Indian Supreme Court case documents & their `extractive' summaries, written by two law experts (A1, A2).
    - UK-Abs : United Kingdom (U.K.) Supreme Court case documents & their `abstractive' summaries, obtained from https://www.supremecourt.uk/decided-cases/

    Please refer to the paper and the README file for more details.

  11. E

    Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0

    • live.european-language-grid.eu
    binary format
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Summarization datasets from the KAS corpus KAS-Sum 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20154
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 3, 2022
    License

    https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0

    Description

    Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.

    The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.

    The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.

    References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228

  12. h

    pubmed-summarization

    • huggingface.co
    • opendatalab.com
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2021
    Authors
    ccdv
    Description

    PubMed dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.

  13. Semantic Summarization for Context Aware Manipulation of Data, Phase II

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Semantic Summarization for Context Aware Manipulation of Data, Phase II [Dataset]. https://data.nasa.gov/d/vcqh-dx6v
    Explore at:
    csv, xml, tsv, application/rssxml, json, application/rdfxmlAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    NASA's exploration and scientific missions will produce terabytes of information. As NASA enters a new phase of space exploration, managing large amounts of scientific and operational data will become even more challenging. Robots conducting planetary exploration will produce data for selection and preparation of exploration sites. Robots and space probes will collect scientific data to improve understanding of the solar system. Satellites in low Earth orbit will collect data for monitoring changes in the Earth's atmosphere and surface environment. Key challenges for all these missions are understanding and summarizing what data have been collected and using this knowledge to improve data access. TRACLabs and CMU propose to develop context aware image manipulation software for managing data collected remotely during NASA missions. This software will filter and search large image archives using the temporal and spatial characteristics of images, and the robotic, instrument, and environmental conditions when images were taken. It also will implement techniques for finding which images show a terrain feature specified by the user. In Phase II we will implement this software and evaluate its effectiveness for NASA missions. At the end of Phase II, context aware image manipulation software at TRL 5-6 will be delivered to NASA.

  14. DfT: business plan quarterly data summary (QDS)

    • gov.uk
    Updated Nov 29, 2012
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Transport (2012). DfT: business plan quarterly data summary (QDS) [Dataset]. https://www.gov.uk/government/publications/business-plan-quarterly-data-summary-qds
    Explore at:
    Dataset updated
    Nov 29, 2012
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Transport
    Description

    Under the new quarterly data summary (QDS) framework departments’ spending data is published every quarter; to show the taxpayer how the government is spending their money.

    The QDS grew out of commitments made in the 2011 Budget and the written ministerial statement on business plans. For the financial year 2012 to 2013 the QDS has been revised and improved in line with action 9 of the Civil Service Reform Plan to provide a common set of data that will enable comparisons of operational performance across government so that departments and individuals can be held to account.

    The QDS breaks down the total spend of the department in 3 ways:

    • by budget
    • by internal operation
    • by transaction.

    The QDS template is the same for all departments, though the individual detail of grants and policy will differ from department to department. In using this data:

    • people should ensure they take full note of the caveats noted in each department’s return
    • as the improvement of the QDS is an ongoing process data quality and completeness will be developed over time and therefore necessary caution should be applied to any comparative analysis undertaken

    Please note that the quarter 1 2012 to 2013 return for the Department of Transport (DfT) is for the core department only.

    April 2012

    Quarterly data summaries for April 2012 are as follows:

    January 2012

    Quarterly data summaries for January 2012 are as follows:

    October 2011

    Quarterly data summaries for October 2011 are as follows:

  15. d

    Measuring the Impact of Digital Repositories: Summary of Big Data Workshop

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated May 14, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCO NITRD (2025). Measuring the Impact of Digital Repositories: Summary of Big Data Workshop [Dataset]. https://catalog.data.gov/dataset/measuring-the-impact-of-digital-repositories-summary-of-big-data-workshop
    Explore at:
    Dataset updated
    May 14, 2025
    Dataset provided by
    NCO NITRD
    Description

    The Big Data Interagency Working Group (BD IWG) held a workshop, Measuring the Impact of Digital Repositories, on February 28 - March 1, 2017 in Arlington, VA. The aim of the workshop was to identify current assessment metrics, tools, and methodologies that are effective in measuring the impact of digital data repositories, and to identify the assessment issues, obstacles, and tools that require additional research and development (R&D). This workshop brought together leaders from academic, journal, government, and international data repository funders, users, and developers to discuss these issues...

  16. E

    Data from: Slovenian text summarization models

    • live.european-language-grid.eu
    Updated Dec 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Slovenian text summarization models [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20871
    Explore at:
    Dataset updated
    Dec 20, 2022
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:

    Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.

    Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.

    The web service with a demo is available at https://slovenscina.eu/povzemanje.

  17. r

    Data from: Towards a unified multi-dimensional evaluator for text generation...

    • resodate.org
    • service.tib.eu
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ming Zhong; Yang Liu; Da Yin; Yuning Mao; Yizhu Jiao; Pengfei Liu; Chenguang Zhu; Heng Ji; Jiawei Han (2024). Towards a unified multi-dimensional evaluator for text generation [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdG93YXJkcy1hLXVuaWZpZWQtbXVsdGktZGltZW5zaW9uYWwtZXZhbHVhdG9yLWZvci10ZXh0LWdlbmVyYXRpb24=
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Ming Zhong; Yang Liu; Da Yin; Yuning Mao; Yizhu Jiao; Pengfei Liu; Chenguang Zhu; Heng Ji; Jiawei Han
    Description

    The NewsRoom dataset consists of 60 input source texts and 7 output summaries for each sample.

  18. PubMed Article Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset
    Explore at:
    zip(686033678 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    PubMed Article Summarization Dataset

    PubMed Summarization Dataset

    By ccdv (From Huggingface) [source]

    About this dataset

    The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

    In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

    Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

    Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

    How to use the dataset

    Introduction:

    Dataset Structure:

    • article: The full text of a scientific article from the PubMed database (Text).
    • abstract: A summary of the main findings and conclusions of the article (Text).

    Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

    • Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

    • Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

    • Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

    Tips for Utilizing the Dataset Effectively:

    • Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

    • Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

    • Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

    • Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

    Conclusion:

    Research Ideas

    • Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description ...

  19. C

    Assessment Data Summary

    • data.milwaukee.gov
    csv, pdf
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Assessor's Office (2025). Assessment Data Summary [Dataset]. https://data.milwaukee.gov/dataset/assessment-data-summary
    Explore at:
    pdf(141178), csv(9732), pdf(683105)Available download formats
    Dataset updated
    Feb 14, 2025
    Dataset authored and provided by
    Assessor's Office
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update Frequency: Annual

    Updated for 2022. End of year assessed property values for the City of Milwaukee for the years 1992-present. These values include real estate property and personal property in Milwaukee, Washington, and Waukesha Counties.

    One data row per year.

    To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page.

  20. h

    Mixed-Summarization-Dataset

    • huggingface.co
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Natural Language Processing in Russian (2025). Mixed-Summarization-Dataset [Dataset]. https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Natural Language Processing in Russian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Russian summarization data mix

    Total Number of items in Train: 197561. Total Number of items in Golden Test set: 258 (manually verified semi-synthetic data for evaluation purpose).

      We use this datasets for train mix:
    

    XLSum Gazeta WikiLingua MLSUM Reviews (ru) Curation-corpus (ru) Matreshka DialogSum (ru) SAMSum (ru)

      Cite us
    

    @misc{akhmetgareeva2024summary, title={Towards Russian Summarization: can architecture solve data limitations problems?}… See the full description on the dataset page: https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Đan Trường phan đình (2021). data summarization [Dataset]. https://www.kaggle.com/datasets/antrngphannh/data-summarization
Organization logo

Data from: data summarization

Related Article
Explore at:
zip(56154866 bytes)Available download formats
Dataset updated
Dec 8, 2021
Authors
Đan Trường phan đình
Description

Dataset

This dataset was created by Đan Trường phan đình

Contents

Search
Clear search
Close search
Google apps
Main menu