100+ datasets found

h
arxiv-summarization
huggingface.co
Updated Dec 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2021
Authors
ccdv
Description
Arxiv dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.
o
OpenAI Summarization Corpus
opendatabay.com
.undefined
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). OpenAI Summarization Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/f95cfdab-cfe3-46a3-91a4-e5d8f15dcf15
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 12, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

More Datasets For more datasets, click here.

Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

To use this dataset for summarization tasks:

Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry.. Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content Research Ideas Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset. Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance. Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

Original Data Source: OpenAI Summarization Corpus
Data from: WikiSum: Coherent Summarization Dataset for Efficient...
registry.opendata.aws
Updated May 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2021). WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation [Dataset]. https://registry.opendata.aws/wikisum/
Explore at:
Dataset updated
May 20, 2021
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides how-to articles from wikihow.com and their summaries, written as a coherent paragraph. The dataset itself is available at wikisum.zip, and contains the article, the summary, the wikihow url, and an official fold (train, val, or test). In addition, human evaluation results are available at wikisum-human-eval.zip. It consists of human evaluation of the summary of the Pegasus system, annotators response regarding the difficulty of the task, and words they marked as unknown.
i
Log summary dataset.
ieee-dataport.org
Updated Jul 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuzhe Zhang (2022). Log summary dataset. [Dataset]. https://ieee-dataport.org/documents/log-summary-dataset
Explore at:
Dataset updated
Jul 17, 2022
Authors
Yuzhe Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HPC
Department for Transport business plan quarterly data summary (QDS)
data.wu.ac.at
gimi9.com
+1more
xls
Updated May 16, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Transport (2014). Department for Transport business plan quarterly data summary (QDS) [Dataset]. https://data.wu.ac.at/odso/data_gov_uk/NjZiMTFmOGEtYTBmMC00ZWYzLWFlYWMtMTc3OTMyZGZmOGJh
Explore at:
xlsAvailable download formats
Dataset updated
May 16, 2014
Dataset provided by
Department for Transporthttps://gov.uk/dft
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Description
Under the new quarterly data summary (QDS) framework departments’ spending data is published every quarter; to show the taxpayer how the government is spending their money.

The QDS grew out of commitments made in the 2011 Budget and the written ministerial statement on business plans. For the financial year 2012 to 2013 the QDS has been revised and improved in line with action 9 of the Civil Service Reform Plan to provide a common set of data that will enable comparisons of operational performance across government so that departments and individuals can be held to account.

The QDS breaks down the total spend of the department in 3 ways: by budget, by internal operation and by transaction.

The QDS template is the same for all departments, though the individual detail of grants and policy will differ from department to department. In using this data:

people should ensure they take full note of the caveats noted in each department’s return

as the improvement of the QDS is an ongoing process data quality and completeness will be developed over time and therefore necessary caution should be applied to any comparative analysis undertaken.

Please note that the quarter 1 2012 to 2013 return for Department of Transport is for the core department only.

Information on GOV.UK about the business plan quarterly data summary at Department for Transport.
P
Summarize from Feedback Dataset
paperswithcode.com
Updated Jun 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nisan Stiennon; Long Ouyang; Jeff Wu; Daniel M. Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul Christiano (2023). Summarize from Feedback Dataset [Dataset]. https://paperswithcode.com/dataset/summarize-from-feedback
Explore at:
Dataset updated
Jun 11, 2024
Authors
Nisan Stiennon; Long Ouyang; Jeff Wu; Daniel M. Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul Christiano
Description
In the Learning to Summarize from Human Feedback paper, a reward model was trained from human feedback. The reward model was then used to train a summarization model to align with human preferences. This is the dataset of human feedback that was released for reward modelling. There are two parts of this dataset: comparisons and axis. In the comparisons part, human annotators were asked to choose the best out of two summaries. In the axis part, human annotators gave scores on a likert scale for the quality of a summary. The comparisons part only has a train and validation split, and the axis part only has a test and validation split.

Li et al. propose a variant with a subset of workers who annotate the data (details in Appendix C.1)

Reddit TL;DR (Seen) uses the top 10 workers from the original dataset. Reddit TL;DR (Unseen) uses unseen workers in the validation set.
Petroleum Data: Summary Application Programming Interface (API)
catalog.data.gov
data.wu.ac.at
Updated Jul 6, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Energy Information Administration (2021). Petroleum Data: Summary Application Programming Interface (API) [Dataset]. https://catalog.data.gov/dataset/petroleum-data-summary-application-programming-interface-api
Explore at:
Dataset updated
Jul 6, 2021
Dataset provided by
Energy Information Administrationhttp://www.eia.gov/
Description
Data on petroleum production, imports, inputs, stocks, exports, and prices. Weekly, monthly, and annual data available. Users of the EIA API are required to obtain an API Key via this registration form: http://www.eia.gov/beta/api/register.cfm
Semantic Summarization for Context Aware Manipulation of Data, Phase II
data.nasa.gov
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Semantic Summarization for Context Aware Manipulation of Data, Phase II [Dataset]. https://data.nasa.gov/d/vcqh-dx6v
Explore at:
csv, xml, tsv, application/rssxml, json, application/rdfxmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
NASA's exploration and scientific missions will produce terabytes of information. As NASA enters a new phase of space exploration, managing large amounts of scientific and operational data will become even more challenging. Robots conducting planetary exploration will produce data for selection and preparation of exploration sites. Robots and space probes will collect scientific data to improve understanding of the solar system. Satellites in low Earth orbit will collect data for monitoring changes in the Earth's atmosphere and surface environment. Key challenges for all these missions are understanding and summarizing what data have been collected and using this knowledge to improve data access. TRACLabs and CMU propose to develop context aware image manipulation software for managing data collected remotely during NASA missions. This software will filter and search large image archives using the temporal and spatial characteristics of images, and the robotic, instrument, and environmental conditions when images were taken. It also will implement techniques for finding which images show a terrain feature specified by the user. In Phase II we will implement this software and evaluate its effectiveness for NASA missions. At the end of Phase II, context aware image manipulation software at TRL 5-6 will be delivered to NASA.
d
GLO climate data stats summary
data.gov.au
researchdata.edu.au
+1more
zip
Updated Apr 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). GLO climate data stats summary [Dataset]. https://data.gov.au/data/dataset/afed85e0-7819-493d-a847-ec00a318e657
Explore at:
zip(8810)Available download formats
Dataset updated
Apr 13, 2022
Dataset authored and provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

Various climate variables summary for all 15 subregions based on Bureau of Meteorology Australian Water Availability Project (BAWAP) climate grids. Including

Time series mean annual BAWAP rainfall from 1900 - 2012.

Long term average BAWAP rainfall and Penman Potentail Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month

Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P (precipitation); (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009).

As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

There are 4 csv files here:

BAWAP_P_annual_BA_SYB_GLO.csv

Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

Source data: annual BILO rainfall

P_PET_monthly_BA_SYB_GLO.csv

long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

Climatology_Trend_BA_SYB_GLO.csv

Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

Dataset History

Dataset was created from various BAWAP source data, including Monthly BAWAP rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET, Correlation coefficient data. Data were extracted from national datasets for the GLO subregion.

BAWAP_P_annual_BA_SYB_GLO.csv

Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

Source data: annual BILO rainfall

P_PET_monthly_BA_SYB_GLO.csv

long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

Climatology_Trend_BA_SYB_GLO.csv

Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

Dataset Citation

Bioregional Assessment Programme (2014) GLO climate data stats summary. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/afed85e0-7819-493d-a847-ec00a318e657.

Dataset Ancestors

Derived From Natural Resource Management (NRM) Regions 2010

Derived From Bioregional Assessment areas v03

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012

Derived From Bioregional Assessment areas v01

Derived From Bioregional Assessment areas v02

Derived From GEODATA TOPO 250K Series 3

Derived From NSW Catchment Management Authority Boundaries 20130917

Derived From Geological Provinces - Full Extent

Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
m
Kurdish News Summarization Dataset
data.mendeley.com
Updated Jun 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Soran Badawi (2023). Kurdish News Summarization Dataset [Dataset]. http://doi.org/10.17632/jbvrd44m5g.1
Explore at:
Unique identifier
https://doi.org/10.17632/jbvrd44m5g.1
Dataset updated
Jun 12, 2023
Authors
Soran Badawi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Kurdish News Summarization Dataset (KNSD) is a newly constructed and comprehensive dataset specifically curated for the task of news summarization in the Kurdish language. The dataset includes a collection of 130,000 news articles and their corresponding headlines sourced from popular Kurdish news websites such as Ktv, NRT, RojNews, K24, KNN, Kurdsat, and more. The KNSD has been meticulously compiled to encompass a diverse range of topics, covering various domains such as politics, economics, culture, sports, and regional affairs. This ensures that the dataset provides a comprehensive representation of the news landscape in the Kurdish language. Key Features Size and Variety: The dataset comprises a substantial collection of 130,000 news articles, offering a wide range of textual content for training and evaluating news summarization models in the Kurdish language. The articles are sourced from reputable and popular Kurdish news websites, ensuring credibility and authenticity. Article-Headline Pairs: Each news article in the KNSD is associated with its corresponding headline, allowing researchers and developers to explore the task of generating concise and informative summaries for news content specifically in Kurdish. Data Quality: Great attention has been given to ensuring the quality and reliability of the dataset. The articles and headlines have undergone careful curation and preprocessing to remove duplicates, ensure linguistic consistency, and filter out irrelevant or spam-like content. This guarantees that the dataset is of high quality and suitable for training robust and accurate news summarization models. Language and Cultural Context: The KNSD is specifically tailored for the Kurdish language, taking into account the unique linguistic characteristics and cultural context of the Kurdish-speaking population. This allows researchers to develop models that are attuned to the nuances and specificities of Kurdish news content. Applications: The KNSD can be utilized in various applications and research areas, including but not limited to: News Summarization: The dataset provides a valuable resource for developing and evaluating news summarization models specifically for the Kurdish language. Researchers can explore different techniques, such as extractive or abstractive summarization, to generate concise and coherent summaries of Kurdish news articles. Machine Learning and Natural Language Processing (NLP): The KNSD can be used to train and evaluate machine learning models, deep learning architectures, and NLP algorithms for tasks related to news summarization, text generation, and semantic understanding in the Kurdish language. The Kurdish News Summarization Dataset (KNSD) offers an extensive and diverse collection of news articles and headlines in the Kurdish language, providing researchers with a valuable resource for advancing the field of news summarization specifically for Kurdish-speaking audiences.
E
Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0
live.european-language-grid.eu
binary format
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Summarization datasets from the KAS corpus KAS-Sum 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20154
Explore at:
binary formatAvailable download formats
Dataset updated
Feb 3, 2022
License
https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
Description
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.

The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.

The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.

References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
h
govreport-summarization-8192
huggingface.co
Updated Jun 15, 1997
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
govreport-summarization-8192 [Dataset]. https://huggingface.co/datasets/pszemraj/govreport-summarization-8192
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 1997
Authors
Peter Szemraj
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GovReport Summarization - 8192 tokens

ccdv/govreport-summarization with the changes of: data cleaned with the clean-text python package total tokens for each column computed and added in new columns according to the long-t5 tokenizer (done after cleaning)

train info

RangeIndex: 8200 entries, 0 to 8199 Data columns (total 4 columns): # Column Non-Null Count Dtype

0 report 8200 non-null… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/govreport-summarization-8192.
P
DialogSum Dataset
paperswithcode.com
Updated Dec 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yulong Chen; Yang Liu; Liang Chen; Yue Zhang (2024). DialogSum Dataset [Dataset]. https://paperswithcode.com/dataset/dialogsum
Explore at:
Dataset updated
Dec 18, 2024
Authors
Yulong Chen; Yang Liu; Liang Chen; Yue Zhang
Description
DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.

If you want to use our dataset, please cite our paper.

Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.

Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.

Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.

Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.
PubMed Article Summarization Dataset
kaggle.com
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
PubMed Article Summarization Dataset

PubMed Summarization Dataset

By ccdv (From Huggingface) [source]

About this dataset

The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

How to use the dataset

Introduction:

Dataset Structure:

article: The full text of a scientific article from the PubMed database (Text).

abstract: A summary of the main findings and conclusions of the article (Text).

Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

Tips for Utilizing the Dataset Effectively:

Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

Conclusion:

Research Ideas

Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description ...
P
How Do I Login McAfee Antivirus Account?: A Complete Guide Dataset
paperswithcode.com
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
How Do I Login McAfee Antivirus Account?: A Complete Guide Dataset [Dataset]. https://paperswithcode.com/dataset/news-articles-dataset-with-summary
Explore at:
Description
(Toll Free) Number +1-341-900-3252

In today’s digital landscape, (Toll Free) Number +1-341-900-3252 cybersecurity is not just a luxury—it’s a necessity. Whether you're protecting personal devices or business systems, (Toll Free) Number +1-341-900-3252 antivirus software plays a vital role. McAfee is one of (Toll Free) Number +1-341-900-3252 the most trusted names in this industry, offering a (Toll Free) Number +1-341-900-3252 robust range of protection solutions. To make the most of McAfee's features, users need to know how to access their McAfee antivirus login account efficiently. This article walks you through everything you need to know—from logging in to troubleshooting common login issues.

(Toll Free) Number +1-341-900-3252

Why You Need a McAfee Antivirus Login Account Before diving into the login steps, let’s understand why having a McAfee antivirus login account is crucial. This account acts as a control center for managing your McAfee services. Whether you want to install McAfee on a new device, renew your subscription, check for software updates, or manage licenses, it all begins with (Toll Free) Number +1-341-900-3252 logging into your account.

(Toll Free) Number +1-341-900-3252

Benefits include:

Centralized management of all protected devices

Real-time updates and threat reports

Subscription and billing management

Quick downloads and installations

24/7 customer support access

Your McAfee antivirus login account ensures you stay informed and protected.

How to Create a McAfee Antivirus Login Account If you’re new to McAfee, setting up your account is your first step. Here’s how: (Toll Free) Number +1-341-900-3252

Purchase McAfee Antivirus: Whether it's from an official vendor or pre-installed on your device, you'll need a product key.

(Toll Free) Number +1-341-900-3252

Visit the McAfee Website: Navigate to the official McAfee homepage.

Select "Sign Up" or "Register": Enter your personal details such as your name, email address, and create a secure password.

(Toll Free) Number +1-341-900-3252

Enter Product Key: Input the 25-digit product key received during purchase to activate your subscription.

Verify Your Email: McAfee will send a verification link. Click it to confirm your registration.

Now your McAfee antivirus login account is active and ready to use. (Toll Free) Number +1-341-900-3252

How to Login to Your McAfee Antivirus Account Logging into your McAfee account is simple and only takes a few steps:

Go to the McAfee Homepage: Start by opening your browser and visiting the official site.

Click on “My Account”: Usually located in the upper-right corner.

Enter Credentials: Input your registered email address and password.

Click “Login” or “Sign In”: You will now be redirected to your dashboard.

From here, you can manage subscriptions, download software, and update protection settings. Always ensure you’re logging in from a secure device and network.

(Toll Free) Number +1-341-900-3252

Troubleshooting McAfee Antivirus Login Account Issues Sometimes users face difficulties accessing their McAfee antivirus login account. (Toll Free) Number +1-341-900-3252 Here are common problems and solutions:

Forgot Password Click on the "Forgot Password" link on the login page.

Enter your registered email address.

Follow the instructions sent to your inbox to reset your password.

(Toll Free) Number +1-341-900-3252

Incorrect Email Double-check for typos or use a different email if you have multiple.

Ensure it's the same one used during registration.

Two-Factor Authentication Problems If enabled, make sure your secondary (Toll Free) Number +1-341-900-3252 device is accessible.

Check time synchronization between devices to avoid verification code mismatches.

Account Locked Multiple failed attempts may lock your account. Wait 15–30 minutes or contact customer support for help.

Staying calm and following these steps can quickly resolve most login issues.

Keeping Your McAfee Antivirus Login Account Secure Security doesn't stop after installing antivirus software. Your login account itself should be safeguarded. Here are some best practices:

(Toll Free) Number +1-341-900-3252

Use a Strong Password: Include upper and lowercase letters, numbers, and special characters.

Enable Two-Factor Authentication: Adds an extra layer of security.

Don’t Share Your Credentials: Keep your login details private and secure.

Regularly Update Your Password: Change your password every 3–6 months for added safety.

Log Out After Use: Especially important if you're using a public or shared device.

By following these tips, you ensure that your McAfee antivirus login account remains protected against unauthorized access.

(Toll Free) Number +1-341-900-3252

Managing Devices from Your McAfee Account Once logged in, you can view and manage all devices connected to your McAfee subscription:

Add a Device: Install McAfee on another PC, Mac, smartphone, or tablet directly from your dashboard.

Remove a Device: Stop protection for any device no longer in use.

Transfer Protection: Reassign your license if you're switching to a new device.

(Toll Free) Number +1-341-900-3252

This level of control helps users maximize the value of their subscription while staying secure across all platforms.

(Toll Free) Number +1-341-900-3252

Final Thoughts Your McAfee antivirus login account is more than just a gateway—it's a comprehensive (Toll Free) Number +1-341-900-3252 tool for managing your digital security. From checking protection (Toll Free) Number +1-341-900-3252 status to adding new devices, everything is just a few clicks away. For users (Toll Free) Number +1-341-900-3252 looking to stay ahead of cyber threats, knowing how to access and use this account is essential.
Data from: Mobile Application Review Summarization using Chain of Density...
figshare.com
pdf
Updated Sep 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SHRISTI SHRESTHA (2024). Mobile Application Review Summarization using Chain of Density Prompting [Dataset]. http://doi.org/10.6084/m9.figshare.26980834.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26980834.v1
Dataset updated
Sep 10, 2024
Dataset provided by
figshare
Authors
SHRISTI SHRESTHA
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplemental dataset for Chain of Density Summarization of mobile app reviews.
E
Data from: MLASK: Multimodal Summarization of Video-based News Articles
live.european-language-grid.eu
lindat.cz
+1more
binary format
Updated Dec 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). MLASK: Multimodal Summarization of Video-based News Articles [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23018
Explore at:
binary formatAvailable download formats
Dataset updated
Dec 31, 2022
License
https://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licencehttps://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licence
Description
The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022. The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary.

Each document consists of the following: - a .mp4 video - a single image (cover picture) - the article's text - the article's summary - the article's title - the article's publication date

All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds. The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively.

/ - / mean / Q1 / Median / Q3 / / Title / 11.16 ± 2.78 / 9 / 11 / 13 / / Abstract / 33.40 ± 13.86 / 22 / 32 / 43 / / Article / 276.96 ± 191.74 / 154 / 231 / 343 /

The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances).

The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.
f
Data from: Figure Associated Text Summarization and Evaluation
figshare.com
message/news
Updated Jan 18, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balaji Polepalli ramesh (2016). Figure Associated Text Summarization and Evaluation [Dataset]. http://doi.org/10.6084/m9.figshare.852976.v3
Explore at:
message/newsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.852976.v3
Dataset updated
Jan 18, 2016
Dataset provided by
figshare
Authors
Balaji Polepalli ramesh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The gold standard of summaries used to build and evaluate Figure summarization system consisting of 94 figures from 19 articles.
C
Assessment Data Summary
data.milwaukee.gov
csv, pdf
Updated Feb 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Assessor's Office (2025). Assessment Data Summary [Dataset]. https://data.milwaukee.gov/dataset/assessment-data-summary
Explore at:
pdf(683105), pdf(141178), csv(9732)Available download formats
Dataset updated
Feb 14, 2025
Dataset authored and provided by
Assessor's Office
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Update Frequency: Annual

Updated for 2022. End of year assessed property values for the City of Milwaukee for the years 1992-present. These values include real estate property and personal property in Milwaukee, Washington, and Waukesha Counties.

One data row per year.

To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page.
E-Commerce Summarization Demo
figshare.com
csv
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deepchecks Data (2025). E-Commerce Summarization Demo [Dataset]. http://doi.org/10.6084/m9.figshare.28218227.v4
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28218227.v4
Dataset updated
Jun 26, 2025
Dataset provided by
figshare
Authors
Deepchecks Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The for https://llmdocs.deepchecks.com/docs/summarization-demo

Facebook

Twitter

Click to copy link

Link copied

Cite

ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization

arxiv-summarization

ccdv/arxiv-summarization

Explore at:

50 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 19, 2021

Authors

ccdv

Description

Arxiv dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

  Data Fields

id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.

Clear search

Close search

Google apps

Main menu

arxiv-summarization

OpenAI Summarization Corpus

Data from: WikiSum: Coherent Summarization Dataset for Efficient...

Log summary dataset.

Department for Transport business plan quarterly data summary (QDS)

Summarize from Feedback Dataset

Petroleum Data: Summary Application Programming Interface (API)

Semantic Summarization for Context Aware Manipulation of Data, Phase II

GLO climate data stats summary

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

Kurdish News Summarization Dataset

Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0

govreport-summarization-8192

DialogSum Dataset

PubMed Article Summarization Dataset

PubMed Article Summarization Dataset

PubMed Summarization Dataset

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

How Do I Login McAfee Antivirus Account?: A Complete Guide Dataset

Data from: Mobile Application Review Summarization using Chain of Density...

Data from: MLASK: Multimodal Summarization of Video-based News Articles

Data from: Figure Associated Text Summarization and Evaluation

Assessment Data Summary

E-Commerce Summarization Demo

arxiv-summarization

ccdv/arxiv-summarization