100+ datasets found

PubMed Article Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset
Explore at:
zip(686033678 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
PubMed Article Summarization Dataset

PubMed Summarization Dataset

By ccdv (From Huggingface) [source]

About this dataset

The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

How to use the dataset

Introduction:

Dataset Structure:

article: The full text of a scientific article from the PubMed database (Text).

abstract: A summary of the main findings and conclusions of the article (Text).

Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

Tips for Utilizing the Dataset Effectively:

Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

Conclusion:

Research Ideas

Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description ...
CCDV Arxiv Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
Explore at:
zip(2219742528 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

By ccdv (From Huggingface) [source]

About this dataset

The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

How to use the dataset

Introduction:

File Description:

validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

Usage Examples: This dataset can be utilized in various ways:

a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

Research Ideas

Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.

Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.

Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
h
pubmed-summarization
huggingface.co
opendatalab.com
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2021
Authors
ccdv
Description
PubMed dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
Vietnamese Multi Document Summarization Dataset
kaggle.com
zip
Updated Oct 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vũ trần anh (2023). Vietnamese Multi Document Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/vtrnanh/sust-feature-data-new
Explore at:
zip(8072886 bytes)Available download formats
Dataset updated
Oct 1, 2023
Authors
vũ trần anh
Description
Vietnamese Multiple Document Summarization Dataset

ViMs Dataset (ViMs folder)

This work was supported by the Ho Chi Minh City Department of Science and Technology, Grant Numbers 15/2016/HÐ-SKHCN

Data construction process: In this work, we aim to have 300 clusters of documents extracted from news. To this end, we made use of the Vietnamese language version of Google News. Due to the copyright issue, we did not collect articles from every source listed on Google News, but limited to some sources that are open for research purposes. The collected articles belong to five genres: world news, domestic news, business, entertainment, and sports. Every cluster contains from four to ten news articles. Each article is represented by the following information: the title, the plain text content, the news source, the date of publication, the author(s), the tag(s) and the headline summary.

After that, two summaries are created for each cluster (produced in the first subtask above) by two distinguished annotators using the MDSWriter system (Meyer, Christian M., et al. "MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora." Proceedings of ACL-2016 System Demonstrations). These annotators are Vietnamese native speakers and they are undergraduate students or graduate students. Most of them know about natural language processing. The full annotation process consists of seven steps that must be done sequentially from the first to the seventh one.

Data information: Original folder: Containing 300 subdirectories which are 300 news clusters. Articles (documents) in each cluster belong to a similar topic and there are from four to ten of them. The number of articles is 1,945.

Summary folder: Contains 300 subdirectories which are 600 final summaries. Every input cluster has two manual abstract summaries from two distinguished annotators. ViMs can be used for both implementing and evaluating supervised machine learning-based systems for Vietnamese abstractive multi-document summarization.

S3_summary folder: Contains 300 subdirectories including 600 ''best sentence selection'' summaries, the result of step 3 -- best sentence selection step. Sentences in a group are separated from others by a blank line. The most important sentence is marked as 1 while 0 is the label for others.

@article{tran2020vims, title={ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization}, author={Tran, Nhi-Thao and Nghiem, Minh-Quoc and Nguyen, Nhung TH and Nguyen, Ngan Luu-Thuy and Van Chi, Nam and Dinh, Dien}, journal={Language Resources and Evaluation}, volume={54}, number={4}, pages={893--920}, year={2020}, publisher={Springer} }

Vietnamese MDS (clusters folder)

Author: Tran Mai Vu, Vu Trong Hoa, Phi Van Thuy, Le Duc Trong, Ha Quang Thuy Affiliation: Knowledge Technology Laboratory, University of Technology, VNU Hanoi Research Topic: Design and Implementation of a Multi-document Summarization Program for the Vietnamese Language, Funded by the Ministry of Education (Project Code: B2012-01-24)

Data construction process: The data construction process is entirely manual. It consists of two steps:

Step 1: Data Collection and Clustering - Data is collected from the Baomoi website and organized into clusters, where each cluster contains documents related to a specific topic. The data is collected from various subjects on Baomoi, typically encompassing 8-10 main categories, such as World, Society, Culture, Economics, Science-Technology, Sports, Entertainment, Law, Education, Health, Automobiles, and Real Estate.

Step 2: Summarization - Two individuals participate in creating reference summaries. The summarization process includes two stages: extracting important sentences and rewriting them into a coherent paragraph.

Data information: Data Volume: 200 clusters

Each cluster corresponds to a folder, and it typically contains 2-5 documents (often 3). The folder's name represents the cluster.

Within each folder:

.info: Contains cluster ID and cluster label (labels are assigned by the cluster creator based on content).

.ref1.txt: Reference summary created by summarizer 1.

.ref1.tok.txt: Tokenized version of reference summary 1, with sentences and words separated.

.ref2.txt: Reference summary created by summarizer 2.

.ref2.tok.txt: Tokenized version of reference summary 2, with sentences and words separated.

.sum.txt: Machine-generated summary.

.sum.tok.txt: Tokenized version of the machine-generated summary, with sentences and words separated.

All files within the same folder represent documents (online articles) belonging to the cluster:

.body.txt: Contains the main content of the document.

.body.tok.txt: Contains the document's content with sentences and words separated.

.info.txt: Contains other in...
h
autotrain-data-summarization
huggingface.co
Updated Aug 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
neil (2023). autotrain-data-summarization [Dataset]. https://huggingface.co/datasets/neil-code/autotrain-data-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2023
Authors
neil
Description
AutoTrain Dataset for project: summarization

Dataset Description

This dataset has been automatically processed by AutoTrain for project summarization.

Languages

The BCP-47 code for the dataset's language is en.

Dataset Structure Data Instances

A sample from this dataset looks as follows: [ { "feat_id": "train_0", "text": "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?

Person2#: I found it would be a good… See the full description on the dataset page: https://huggingface.co/datasets/neil-code/autotrain-data-summarization.
f
Summary of data collection and refinement statistics.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dusane, Abhishek; Bellini, Valeria; Hakimi, Mohamed-Ali; Sharma, Amit; Babbar, Palak; Bougdour, Alexandre; Laleu, Benoît; Manickam, Yogavel; Mishra, Siddhartha; Malhotra, Nipun (2022). Summary of data collection and refinement statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000269387
Explore at:
Dataset updated
Mar 25, 2022
Authors
Dusane, Abhishek; Bellini, Valeria; Hakimi, Mohamed-Ali; Sharma, Amit; Babbar, Palak; Bougdour, Alexandre; Laleu, Benoît; Manickam, Yogavel; Mishra, Siddhartha; Malhotra, Nipun
Description
Values in parentheses are for the highest resolution shell.
E
Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0
live.european-language-grid.eu
binary format
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Summarization datasets from the KAS corpus KAS-Sum 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20154
Explore at:
binary formatAvailable download formats
Dataset updated
Feb 3, 2022
License
https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
Description
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.

The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.

The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.

References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
Data from: Legal Case Document Summarization: Extractive and Abstractive...
zenodo.org
data.niaid.nih.gov
+1more
bin, zip
Updated Nov 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh (2022). Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation [Dataset]. http://doi.org/10.5281/zenodo.7151679
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7151679
Dataset updated
Nov 23, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh; Abhay Shukla; Paheli Bhattacharya; Soham Poddar; Rajdeep Mukherjee; Kripabandhu Ghosh; Pawan Goyal; Saptarshi Ghosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the following 3 datasets for legal document summarization :

- IN-Abs : Indian Supreme Court case documents & their `abstractive' summaries, obtained from http://www.liiofindia.org/in/cases/cen/INSC/
- IN-Ext : Indian Supreme Court case documents & their `extractive' summaries, written by two law experts (A1, A2).
- UK-Abs : United Kingdom (U.K.) Supreme Court case documents & their `abstractive' summaries, obtained from https://www.supremecourt.uk/decided-cases/

Please refer to the paper and the README file for more details.
h
Mixed-Summarization-Dataset
huggingface.co
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natural Language Processing in Russian (2025). Mixed-Summarization-Dataset [Dataset]. https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2025
Dataset authored and provided by
Natural Language Processing in Russian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Russian summarization data mix

Total Number of items in Train: 197561. Total Number of items in Golden Test set: 258 (manually verified semi-synthetic data for evaluation purpose).

We use this datasets for train mix:

XLSum Gazeta WikiLingua MLSUM Reviews (ru) Curation-corpus (ru) Matreshka DialogSum (ru) SAMSum (ru)

Cite us

@misc{akhmetgareeva2024summary, title={Towards Russian Summarization: can architecture solve data limitations problems?}… See the full description on the dataset page: https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset.
b
Guidelines for Computing Summary Statistics for Data-Sets Containing...
datahub.bvcentre.ca
Updated Jun 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Guidelines for Computing Summary Statistics for Data-Sets Containing Non-Detects - Dataset - BVRC DataHub [Dataset]. https://datahub.bvcentre.ca/dataset/guidelines-for-computing-summary-statistics-for-data-sets-containing-non-detects
Explore at:
Dataset updated
Jun 3, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
INTRODUCTION As part of its responsibilities, the BC Ministry of Environment monitors water quality in the province’s streams, rivers, and lakes. Often, it is necessary to compile statistics involving concentrations of contaminants or other compounds. Quite often the instruments used cannot measure concentrations below certain values. These observations are called non-detects or less thans. However, non-detects pose a difficulty when it is necessary to compute statistical measurements such as the mean, the median, and the standard deviation for a data set. The way non-detects are handled can affect the quality of any statistics generated. Non-detects, or censored data are found in many fields such as medicine, engineering, biology, and environmetrics. In such fields, it is often the case that the measurements of interest are below some threshold. Dealing with non-detects is a significant issue and statistical tools using survival or reliability methods have been developed. Basically, there are three approaches for treating data containing censored values: 1. substitution, which gives poor results and therefore, is not recommended in the literature; 2. maximum likelihood estimation, which requires an assumption of some distributional form; and 3. and nonparametric methods which assess the shape of the data based on observed percentiles rather than a strict distributional form. This document provides guidance on how to record censor data, and on when and how to use certain analysis methods when the percentage of censored observations is less than 50%. The methods presented in this document are:1. substitution; 2. Kaplan-Meier, as part of nonparametric methods; 3. lognormal model based on maximum likelihood estimation; 4. and robust regression on order statistics, which is a semiparametric method. Statistical software suitable for survival or reliability analysis is available for dealing with censored data. This software has been widely used in medical and engineering environments. In this document, methods are illustrated with both R and JMP software packages, when possible. JMP often requires some intermediate steps to obtain summary statistics with most of the methods described in this document. R, with the NADA package is usually straightforward. The package NADA was developed specifically for computing statistics with non-detects in environmental data based on Helsel (2005b). The data used to illustrate the methods described for computing summary statistics for non-detects are either simulated or based on information acquired from the B.C. Ministry of Environment. This document is strongly based on the book Nondetects And Data Analysis written by Dennis R. Helsel in 2005 (Helsel, 2005b).
d
Measuring the Impact of Digital Repositories: Summary of Big Data Workshop
catalog.data.gov
s.cnmilf.com
+1more
Updated May 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCO NITRD (2025). Measuring the Impact of Digital Repositories: Summary of Big Data Workshop [Dataset]. https://catalog.data.gov/dataset/measuring-the-impact-of-digital-repositories-summary-of-big-data-workshop
Explore at:
Dataset updated
May 14, 2025
Dataset provided by
NCO NITRD
Description
The Big Data Interagency Working Group (BD IWG) held a workshop, Measuring the Impact of Digital Repositories, on February 28 - March 1, 2017 in Arlington, VA. The aim of the workshop was to identify current assessment metrics, tools, and methodologies that are effective in measuring the impact of digital data repositories, and to identify the assessment issues, obstacles, and tools that require additional research and development (R&D). This workshop brought together leaders from academic, journal, government, and international data repository funders, users, and developers to discuss these issues...
h
Qwen-summarize-dataset-train
huggingface.co
Updated Jul 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Grigorev (2024). Qwen-summarize-dataset-train [Dataset]. https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 4, 2024
Authors
George Grigorev
Description
This dataset is used in an experimental preference fine-tuning of Qwen2-1.5B model for summarization task The goal is to re-implement Apple work on training specific LoRA's on top of small LM to perform specific tasks, for example summarization. More info on the project: https://github.com/thepowerfuldeez/qwen2_1_5b_summarize

Method

Dataset generated using samples from RedPajamaV2 dataset, specifically Arxiv, Wikipedia, StackExchange documents. I have downloaded 1% of data and… See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train.
DfT: business plan quarterly data summary (QDS)
gov.uk
Updated Nov 29, 2012
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Transport (2012). DfT: business plan quarterly data summary (QDS) [Dataset]. https://www.gov.uk/government/publications/business-plan-quarterly-data-summary-qds
Explore at:
Dataset updated
Nov 29, 2012
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Department for Transport
Description
Under the new quarterly data summary (QDS) framework departments’ spending data is published every quarter; to show the taxpayer how the government is spending their money.

The QDS grew out of commitments made in the 2011 Budget and the written ministerial statement on business plans. For the financial year 2012 to 2013 the QDS has been revised and improved in line with action 9 of the Civil Service Reform Plan to provide a common set of data that will enable comparisons of operational performance across government so that departments and individuals can be held to account.

The QDS breaks down the total spend of the department in 3 ways:

by budget

by internal operation

by transaction.

The QDS template is the same for all departments, though the individual detail of grants and policy will differ from department to department. In using this data:

people should ensure they take full note of the caveats noted in each department’s return

as the improvement of the QDS is an ongoing process data quality and completeness will be developed over time and therefore necessary caution should be applied to any comparative analysis undertaken

Please note that the quarter 1 2012 to 2013 return for the Department of Transport (DfT) is for the core department only.

April 2012

Quarterly data summaries for April 2012 are as follows:

http://webarchive.nationalarchives.gov.uk/20121204142927/http://data.dft.gov.uk/business-plan-qds/dft-qds-201204.xls">DfT: quarterly data summary (XLS, 49KB), April 2012

http://webarchive.nationalarchives.gov.uk/20120807082352/http://assets.dft.gov.uk/publications/business-plan-qds/dft-qds-201204.pdf">DfT: quarterly data summary (PDF, 80KB), April 2012

http://webarchive.nationalarchives.gov.uk/20121204142927/http://data.dft.gov.uk/business-plan-qds/dft-qds-201204-measurement-annex.xls">DfT: quarterly data summary measurement annex (XLS, 94KB), April 2012

http://webarchive.nationalarchives.gov.uk/20120807082352/http://assets.dft.gov.uk/publications/business-plan-qds/dft-qds-201204-measurement-annex.pdf">DfT: quarterly data summary measurement annex (PDF, 391KB), April 2012

January 2012

Quarterly data summaries for January 2012 are as follows:

http://webarchive.nationalarchives.gov.uk/20121204142927/http://data.dft.gov.uk/business-plan-qds/dft-qds-201201.xls">DfT: quarterly data summary (XLS, 114KB), January 2012

http://webarchive.nationalarchives.gov.uk/20120807082352/http://assets.dft.gov.uk/publications/business-plan-qds/dft-qds-201201.pdf">DfT: quarterly data summary (PDF, 95KB), January 2012

http://webarchive.nationalarchives.gov.uk/20121204142927/http://data.dft.gov.uk/business-plan-qds/dft-qds-201201-measurement-annex.xls">DfT: quarterly data summary measurement annex (XLS, 450KB), January 2012

http://webarchive.nationalarchives.gov.uk/20120807082352/http://assets.dft.gov.uk/publications/business-plan-qds/dft-qds-201201-measurement-annex.pdf">DfT: quarterly data summary measurement annex (PDF, 394KB), January 2012

October 2011

Quarterly data summaries for October 2011 are as follows:

http://webarchive.nationalarchives.gov.uk/20121204142927/http://data.dft.gov.uk/business-plan-qds/dft-qds-201110.xls">DfT: quarterly data summary (XLS, 171KB), October 2011

http://webarchive.nationalarchives.gov.uk/20120807082352/http://assets.dft.gov.uk/publications/business-plan-qds/dft-qds-201110.pdf">DfT: quarterly data summary (PDF, 116KB), October 2011

<a rel="external" href="http://webarchive.nationalarchives.gov.uk/20121204142927/http://data.dft.gov.uk/business-plan-qds/dft-qds-201
Semantic Summarization for Context Aware Manipulation of Data, Phase II
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Semantic Summarization for Context Aware Manipulation of Data, Phase II [Dataset]. https://data.nasa.gov/d/vcqh-dx6v
Explore at:
csv, xml, tsv, application/rssxml, json, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
NASA's exploration and scientific missions will produce terabytes of information. As NASA enters a new phase of space exploration, managing large amounts of scientific and operational data will become even more challenging. Robots conducting planetary exploration will produce data for selection and preparation of exploration sites. Robots and space probes will collect scientific data to improve understanding of the solar system. Satellites in low Earth orbit will collect data for monitoring changes in the Earth's atmosphere and surface environment. Key challenges for all these missions are understanding and summarizing what data have been collected and using this knowledge to improve data access. TRACLabs and CMU propose to develop context aware image manipulation software for managing data collected remotely during NASA missions. This software will filter and search large image archives using the temporal and spatial characteristics of images, and the robotic, instrument, and environmental conditions when images were taken. It also will implement techniques for finding which images show a terrain feature specified by the user. In Phase II we will implement this software and evaluate its effectiveness for NASA missions. At the end of Phase II, context aware image manipulation software at TRL 5-6 will be delivered to NASA.
E
Data from: Slovenian text summarization models
live.european-language-grid.eu
Updated Dec 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Slovenian text summarization models [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20871
Explore at:
Dataset updated
Dec 20, 2022
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:

Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.

Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.

The web service with a demo is available at https://slovenscina.eu/povzemanje.
Data summary of all endpoints measured
catalog.data.gov
s.cnmilf.com
Updated Nov 12, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2020). Data summary of all endpoints measured [Dataset]. https://catalog.data.gov/dataset/data-summary-of-all-endpoints-measured
Explore at:
Dataset updated
Nov 12, 2020
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
qPCR results for Vitellogenin. This dataset is associated with the following publication: Armstrong, B., J. Lazorchak , K. Jensen , H. Haring , M.E. Smith, R. Flick , D. Bencic , and A. Biales. Reproductive effects in fathead minnows (Pimphales promelas) following a 21 d exposure to 17α-ethinylestradiol. CHEMOSPHERE. Elsevier Science Ltd, New York, NY, USA, 144(1): 366-373, (2015).
Large Table Summarization Dataset
kaggle.com
zip
Updated Jul 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AmruthWarrier (2024). Large Table Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/amruthwarrier/large-table-summarization-dataset/discussion
Explore at:
zip(551099 bytes)Available download formats
Dataset updated
Jul 3, 2024
Authors
AmruthWarrier
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This dataset comprises tables in JSON format, each accompanied by a concise summary of approximately 10 lines. Originally created to train large language models (LLMs) for efficient tabular data summarization, this dataset serves as a valuable resource for developing advanced summarization algorithms. The tables contain a mix of textual and numerical data, reflecting a wide array of information types. Each summary captures the essential insights and key patterns from the corresponding table, providing a clear and succinct overview. The dataset's diverse entries, ranging from descriptive text to precise numerical values, present a robust challenge for LLMs, facilitating the enhancement of their summarization capabilities. By leveraging this dataset, researchers and developers can improve the accuracy and efficiency of their models in interpreting and summarizing complex tabular data.
h
Data from: summarize
huggingface.co
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varun R (2024). summarize [Dataset]. https://huggingface.co/datasets/varunr14/summarize
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Varun R
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
varunr14/summarize dataset hosted on Hugging Face and contributed by the HF Datasets community
Summarizing data – Problems and solutions.
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eckhard Limpert; Werner A. Stahel (2023). Summarizing data – Problems and solutions. [Dataset]. http://doi.org/10.1371/journal.pone.0021403.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0021403.t002
Dataset updated
Jun 9, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Eckhard Limpert; Werner A. Stahel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The collection of datasets in Table 1 is extended, and their more meaningful and, thus, recommended, descriptions based on multiplicative means and multiplicative standard errors or standard deviations are given. Some comparisons appear to be of interest. Necessarily, arithmetic means exceed multiplicative ones, starting from some 15% for small s*s around 1.7 up to more than the sevenfold for s* >7. The lower limits of the 95% ranges, relative to the means, turn increasingly negative with s* growing for the classical version, but remain positive and get smaller for the multiplicative description. Turning to upper limits, the multiplicative limit exceeds the additive one by some 17% for s* = 1.7. With s* = 2.5, the difference is about 25%. For s* = 4.2, there is no difference, and for s* = 7, the additive mean is only half the multiplicative one.
Daily Summary Observations, First Order - United States
catalog.data.gov
ncei.noaa.gov
Updated Sep 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA National Centers for Environmental Information (Point of Contact) (2023). Daily Summary Observations, First Order - United States [Dataset]. https://catalog.data.gov/dataset/daily-summary-observations-first-order-united-states3
Explore at:
Dataset updated
Sep 19, 2023
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
National Centers for Environmental Informationhttps://www.ncei.noaa.gov/
Area covered
United States
Description
This collection consists of data that has been quality controlled through both an automated and manual QC processes. The dataset consisted of summary of the day data from approximately 500 First Order stations (primarily major airports) in the U.S. and its possessions. Included variables are: daily summaries of hourly data for temperature, precipitation, pressure, observed weather, and wind speed and direction. The primary source of this information is from the Automated Surface Observation System (ASOS), however other collections are also included, such as SRRS, keyed data, and others -- see documentation for a complete list. Data were incorporated into the GHCN-Daily dataset as of 2013.

Facebook

Twitter

Click to copy link

Link copied

Cite

The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset

PubMed Article Summarization Dataset

PubMed Summarization Dataset

Explore at:

zip(686033678 bytes)Available download formats

Dataset updated

Dec 5, 2023

Authors

The Devastator

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

PubMed Article Summarization Dataset

PubMed Summarization Dataset

By ccdv (From Huggingface) [source]

About this dataset

The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

How to use the dataset

Introduction:

Dataset Structure:

article: The full text of a scientific article from the PubMed database (Text).

abstract: A summary of the main findings and conclusions of the article (Text).

Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

Tips for Utilizing the Dataset Effectively:

Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

Conclusion:

Research Ideas

Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description ...

Clear search

Close search

Google apps

Main menu

PubMed Article Summarization Dataset

PubMed Article Summarization Dataset

PubMed Summarization Dataset

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

CCDV Arxiv Summarization Dataset

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

pubmed-summarization

Vietnamese Multi Document Summarization Dataset

Vietnamese Multiple Document Summarization Dataset

ViMs Dataset (ViMs folder)

Vietnamese MDS (clusters folder)

autotrain-data-summarization

Person2#: I found it would be a good… See the full description on the dataset page: https://huggingface.co/datasets/neil-code/autotrain-data-summarization.

Summary of data collection and refinement statistics.

Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0

Data from: Legal Case Document Summarization: Extractive and Abstractive...

Mixed-Summarization-Dataset

Guidelines for Computing Summary Statistics for Data-Sets Containing...

Measuring the Impact of Digital Repositories: Summary of Big Data Workshop

Qwen-summarize-dataset-train

DfT: business plan quarterly data summary (QDS)

April 2012

January 2012

October 2011

Semantic Summarization for Context Aware Manipulation of Data, Phase II

Data from: Slovenian text summarization models

Data summary of all endpoints measured

Large Table Summarization Dataset

Data from: summarize

Summarizing data – Problems and solutions.

Daily Summary Observations, First Order - United States

PubMed Article Summarization Dataset

PubMed Summarization Dataset

PubMed Article Summarization Dataset

PubMed Summarization Dataset

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns