100+ datasets found
  1. h

    arxiv-summarization

    • huggingface.co
    Updated Dec 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2021
    Authors
    ccdv
    Description

    Arxiv dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.

  2. o

    OpenAI Summarization Corpus

    • opendatabay.com
    .undefined
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). OpenAI Summarization Corpus [Dataset]. https://www.opendatabay.com/data/ai-ml/f95cfdab-cfe3-46a3-91a4-e5d8f15dcf15
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset provides a unique and comprehensive corpus for natural language processing tasks, specifically text summarization tools for validating reward models from OpenAI. It contains columns that provide summaries of text from the TL;DR, CNN, and Daily Mail datasets, along with additional information including choices made by workers when summarizing the text, batch information provided to differentiate different summaries created by workers, and dataset attribute splits. All of this data allows users to train state-of-the-art natural language processing systems with real-world data in order to create reliable concise summaries from long form text. This remarkable collection enables developers to explore the possibilities of cutting-edge summarization research while directly holding themselves accountable compared against human generated results

    More Datasets For more datasets, click here.

    Featured Notebooks 🚨 Your notebook can be here! 🚨! How to use the dataset This dataset provides a comprehensive corpus of human-generated summaries for text from the TL;DR, CNN, and Daily Mail datasets to help machine learning models understand and evaluate natural language processing. The dataset contains training and validation data to optimize machine learning tasks.

    To use this dataset for summarization tasks:

    Gather information about the text you would like to summarize by looking at the info column entries in the two .csv files (train and validation). Choose which summary you want from the choice column of either .csv file based on your preference for worker or batch type summarization. Review entries in the selected summary's corresponding summaries columns for alternative options with similar content but different word choices/styles that you prefer over the original choice worker or batch entry.. Look through split, worker, batch information for more information regarding each choice before selecting one to use as your desired summary according to its accuracy or clarity with regards to its content Research Ideas Training a natural language processing model to automatically generate summaries of text, using summary and choice data from this dataset. Evaluating OpenAI's reward model for natural language processing on the validation data in order to improve accuracy and performance. Analyzing the worker and batch information, in order to assess different trends among workers or batches that could be indicative of bias or other issues affecting summarization accuracy

    Original Data Source: OpenAI Summarization Corpus

  3. Data from: WikiSum: Coherent Summarization Dataset for Efficient...

    • registry.opendata.aws
    Updated May 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2021). WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation [Dataset]. https://registry.opendata.aws/wikisum/
    Explore at:
    Dataset updated
    May 20, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides how-to articles from wikihow.com and their summaries, written as a coherent paragraph. The dataset itself is available at wikisum.zip, and contains the article, the summary, the wikihow url, and an official fold (train, val, or test). In addition, human evaluation results are available at wikisum-human-eval.zip. It consists of human evaluation of the summary of the Pegasus system, annotators response regarding the difficulty of the task, and words they marked as unknown.

  4. i

    Log summary dataset.

    • ieee-dataport.org
    Updated Jul 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuzhe Zhang (2022). Log summary dataset. [Dataset]. https://ieee-dataport.org/documents/log-summary-dataset
    Explore at:
    Dataset updated
    Jul 17, 2022
    Authors
    Yuzhe Zhang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HPC

  5. Department for Transport business plan quarterly data summary (QDS)

    • data.wu.ac.at
    • gimi9.com
    • +1more
    xls
    Updated May 16, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Transport (2014). Department for Transport business plan quarterly data summary (QDS) [Dataset]. https://data.wu.ac.at/odso/data_gov_uk/NjZiMTFmOGEtYTBmMC00ZWYzLWFlYWMtMTc3OTMyZGZmOGJh
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2014
    Dataset provided by
    Department for Transporthttps://gov.uk/dft
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    Under the new quarterly data summary (QDS) framework departments’ spending data is published every quarter; to show the taxpayer how the government is spending their money.

    The QDS grew out of commitments made in the 2011 Budget and the written ministerial statement on business plans. For the financial year 2012 to 2013 the QDS has been revised and improved in line with action 9 of the Civil Service Reform Plan to provide a common set of data that will enable comparisons of operational performance across government so that departments and individuals can be held to account.

    The QDS breaks down the total spend of the department in 3 ways: by budget, by internal operation and by transaction.

    The QDS template is the same for all departments, though the individual detail of grants and policy will differ from department to department. In using this data:

    1. people should ensure they take full note of the caveats noted in each department’s return
    2. as the improvement of the QDS is an ongoing process data quality and completeness will be developed over time and therefore necessary caution should be applied to any comparative analysis undertaken.

    Please note that the quarter 1 2012 to 2013 return for Department of Transport is for the core department only.

    Information on GOV.UK about the business plan quarterly data summary at Department for Transport.

  6. P

    Summarize from Feedback Dataset

    • paperswithcode.com
    Updated Jun 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nisan Stiennon; Long Ouyang; Jeff Wu; Daniel M. Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul Christiano (2023). Summarize from Feedback Dataset [Dataset]. https://paperswithcode.com/dataset/summarize-from-feedback
    Explore at:
    Dataset updated
    Jun 11, 2024
    Authors
    Nisan Stiennon; Long Ouyang; Jeff Wu; Daniel M. Ziegler; Ryan Lowe; Chelsea Voss; Alec Radford; Dario Amodei; Paul Christiano
    Description

    In the Learning to Summarize from Human Feedback paper, a reward model was trained from human feedback. The reward model was then used to train a summarization model to align with human preferences. This is the dataset of human feedback that was released for reward modelling. There are two parts of this dataset: comparisons and axis. In the comparisons part, human annotators were asked to choose the best out of two summaries. In the axis part, human annotators gave scores on a likert scale for the quality of a summary. The comparisons part only has a train and validation split, and the axis part only has a test and validation split.

    Li et al. propose a variant with a subset of workers who annotate the data (details in Appendix C.1)

    Reddit TL;DR (Seen) uses the top 10 workers from the original dataset. Reddit TL;DR (Unseen) uses unseen workers in the validation set.

  7. Petroleum Data: Summary Application Programming Interface (API)

    • catalog.data.gov
    • data.wu.ac.at
    Updated Jul 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Energy Information Administration (2021). Petroleum Data: Summary Application Programming Interface (API) [Dataset]. https://catalog.data.gov/dataset/petroleum-data-summary-application-programming-interface-api
    Explore at:
    Dataset updated
    Jul 6, 2021
    Dataset provided by
    Energy Information Administrationhttp://www.eia.gov/
    Description

    Data on petroleum production, imports, inputs, stocks, exports, and prices. Weekly, monthly, and annual data available. Users of the EIA API are required to obtain an API Key via this registration form: http://www.eia.gov/beta/api/register.cfm

  8. Semantic Summarization for Context Aware Manipulation of Data, Phase II

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Semantic Summarization for Context Aware Manipulation of Data, Phase II [Dataset]. https://data.nasa.gov/d/vcqh-dx6v
    Explore at:
    csv, xml, tsv, application/rssxml, json, application/rdfxmlAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    NASA's exploration and scientific missions will produce terabytes of information. As NASA enters a new phase of space exploration, managing large amounts of scientific and operational data will become even more challenging. Robots conducting planetary exploration will produce data for selection and preparation of exploration sites. Robots and space probes will collect scientific data to improve understanding of the solar system. Satellites in low Earth orbit will collect data for monitoring changes in the Earth's atmosphere and surface environment. Key challenges for all these missions are understanding and summarizing what data have been collected and using this knowledge to improve data access. TRACLabs and CMU propose to develop context aware image manipulation software for managing data collected remotely during NASA missions. This software will filter and search large image archives using the temporal and spatial characteristics of images, and the robotic, instrument, and environmental conditions when images were taken. It also will implement techniques for finding which images show a terrain feature specified by the user. In Phase II we will implement this software and evaluate its effectiveness for NASA missions. At the end of Phase II, context aware image manipulation software at TRL 5-6 will be delivered to NASA.

  9. d

    GLO climate data stats summary

    • data.gov.au
    • researchdata.edu.au
    • +1more
    zip
    Updated Apr 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bioregional Assessment Program (2022). GLO climate data stats summary [Dataset]. https://data.gov.au/data/dataset/afed85e0-7819-493d-a847-ec00a318e657
    Explore at:
    zip(8810)Available download formats
    Dataset updated
    Apr 13, 2022
    Dataset authored and provided by
    Bioregional Assessment Program
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract

    The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.

    Various climate variables summary for all 15 subregions based on Bureau of Meteorology Australian Water Availability Project (BAWAP) climate grids. Including

    1. Time series mean annual BAWAP rainfall from 1900 - 2012.

    2. Long term average BAWAP rainfall and Penman Potentail Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month

    3. Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P (precipitation); (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.

    4. Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009).

    As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

    There are 4 csv files here:

    BAWAP_P_annual_BA_SYB_GLO.csv

    Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

    Source data: annual BILO rainfall

    P_PET_monthly_BA_SYB_GLO.csv

    long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

    Climatology_Trend_BA_SYB_GLO.csv

    Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

    Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

    Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

    Dataset History

    Dataset was created from various BAWAP source data, including Monthly BAWAP rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET, Correlation coefficient data. Data were extracted from national datasets for the GLO subregion.

    BAWAP_P_annual_BA_SYB_GLO.csv

    Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.

    Source data: annual BILO rainfall

    P_PET_monthly_BA_SYB_GLO.csv

    long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month

    Climatology_Trend_BA_SYB_GLO.csv

    Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend

    Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv

    Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

    Dataset Citation

    Bioregional Assessment Programme (2014) GLO climate data stats summary. Bioregional Assessment Derived Dataset. Viewed 18 July 2018, http://data.bioregionalassessments.gov.au/dataset/afed85e0-7819-493d-a847-ec00a318e657.

    Dataset Ancestors

  10. m

    Kurdish News Summarization Dataset

    • data.mendeley.com
    Updated Jun 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Soran Badawi (2023). Kurdish News Summarization Dataset [Dataset]. http://doi.org/10.17632/jbvrd44m5g.1
    Explore at:
    Dataset updated
    Jun 12, 2023
    Authors
    Soran Badawi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Kurdish News Summarization Dataset (KNSD) is a newly constructed and comprehensive dataset specifically curated for the task of news summarization in the Kurdish language. The dataset includes a collection of 130,000 news articles and their corresponding headlines sourced from popular Kurdish news websites such as Ktv, NRT, RojNews, K24, KNN, Kurdsat, and more. The KNSD has been meticulously compiled to encompass a diverse range of topics, covering various domains such as politics, economics, culture, sports, and regional affairs. This ensures that the dataset provides a comprehensive representation of the news landscape in the Kurdish language. Key Features Size and Variety: The dataset comprises a substantial collection of 130,000 news articles, offering a wide range of textual content for training and evaluating news summarization models in the Kurdish language. The articles are sourced from reputable and popular Kurdish news websites, ensuring credibility and authenticity. Article-Headline Pairs: Each news article in the KNSD is associated with its corresponding headline, allowing researchers and developers to explore the task of generating concise and informative summaries for news content specifically in Kurdish. Data Quality: Great attention has been given to ensuring the quality and reliability of the dataset. The articles and headlines have undergone careful curation and preprocessing to remove duplicates, ensure linguistic consistency, and filter out irrelevant or spam-like content. This guarantees that the dataset is of high quality and suitable for training robust and accurate news summarization models. Language and Cultural Context: The KNSD is specifically tailored for the Kurdish language, taking into account the unique linguistic characteristics and cultural context of the Kurdish-speaking population. This allows researchers to develop models that are attuned to the nuances and specificities of Kurdish news content. Applications: The KNSD can be utilized in various applications and research areas, including but not limited to: News Summarization: The dataset provides a valuable resource for developing and evaluating news summarization models specifically for the Kurdish language. Researchers can explore different techniques, such as extractive or abstractive summarization, to generate concise and coherent summaries of Kurdish news articles. Machine Learning and Natural Language Processing (NLP): The KNSD can be used to train and evaluate machine learning models, deep learning architectures, and NLP algorithms for tasks related to news summarization, text generation, and semantic understanding in the Kurdish language. The Kurdish News Summarization Dataset (KNSD) offers an extensive and diverse collection of news articles and headlines in the Kurdish language, providing researchers with a valuable resource for advancing the field of news summarization specifically for Kurdish-speaking audiences.

  11. E

    Data from: Summarization datasets from the KAS corpus KAS-Sum 1.0

    • live.european-language-grid.eu
    binary format
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Summarization datasets from the KAS corpus KAS-Sum 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/20154
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Feb 3, 2022
    License

    https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0

    Description

    Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.

    The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.

    The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.

    References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228

  12. h

    govreport-summarization-8192

    • huggingface.co
    Updated Jun 15, 1997
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    govreport-summarization-8192 [Dataset]. https://huggingface.co/datasets/pszemraj/govreport-summarization-8192
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 1997
    Authors
    Peter Szemraj
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    GovReport Summarization - 8192 tokens

    ccdv/govreport-summarization with the changes of: data cleaned with the clean-text python package total tokens for each column computed and added in new columns according to the long-t5 tokenizer (done after cleaning)

      train info
    

    RangeIndex: 8200 entries, 0 to 8199 Data columns (total 4 columns): # Column Non-Null Count Dtype

    0 report 8200 non-null… See the full description on the dataset page: https://huggingface.co/datasets/pszemraj/govreport-summarization-8192.

  13. P

    DialogSum Dataset

    • paperswithcode.com
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang (2024). DialogSum Dataset [Dataset]. https://paperswithcode.com/dataset/dialogsum
    Explore at:
    Dataset updated
    Dec 18, 2024
    Authors
    Yulong Chen; Yang Liu; Liang Chen; Yue Zhang
    Description

    DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

    This work is accepted by ACL findings 2021. You may find the paper here: https://arxiv.org/pdf/2105.06762.pdf.

    If you want to use our dataset, please cite our paper.

    Dialogue Data We collect dialogue data for DialogSum from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2019), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers.

    Compared with previous datasets, dialogues from DialogSum have distinct characteristics: * Under rich real-life scenarios, including more diverse task-oriented scenarios; * Have clear communication patterns and intents, which is valuable to serve as summarization sources; * Have a reasonable length, which comforts the purpose of automatic summarization.

    Summaries We ask annotators to summarize each dialogue based on the following criteria: * Convey the most salient information; * Be brief; * Preserve important named entities within the conversation; * Be written from an observer perspective; * Be written in formal language.

    Topics In addition to summaries, we also ask annotators to write a short topic for each dialogue, which can be potentially useful for future work, e.g. generating summaries by leveraging topic information.

  14. PubMed Article Summarization Dataset

    • kaggle.com
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    PubMed Article Summarization Dataset

    PubMed Summarization Dataset

    By ccdv (From Huggingface) [source]

    About this dataset

    The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

    In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

    Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

    Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

    How to use the dataset

    Introduction:

    Dataset Structure:

    • article: The full text of a scientific article from the PubMed database (Text).
    • abstract: A summary of the main findings and conclusions of the article (Text).

    Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

    • Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

    • Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

    • Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

    Tips for Utilizing the Dataset Effectively:

    • Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

    • Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

    • Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

    • Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

    Conclusion:

    Research Ideas

    • Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description ...

  15. P

    How Do I Login McAfee Antivirus Account?: A Complete Guide Dataset

    • paperswithcode.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    How Do I Login McAfee Antivirus Account?: A Complete Guide Dataset [Dataset]. https://paperswithcode.com/dataset/news-articles-dataset-with-summary
    Explore at:
    Description

    (Toll Free) Number +1-341-900-3252

    In today’s digital landscape, (Toll Free) Number +1-341-900-3252 cybersecurity is not just a luxury—it’s a necessity. Whether you're protecting personal devices or business systems, (Toll Free) Number +1-341-900-3252 antivirus software plays a vital role. McAfee is one of (Toll Free) Number +1-341-900-3252 the most trusted names in this industry, offering a (Toll Free) Number +1-341-900-3252 robust range of protection solutions. To make the most of McAfee's features, users need to know how to access their McAfee antivirus login account efficiently. This article walks you through everything you need to know—from logging in to troubleshooting common login issues.

    (Toll Free) Number +1-341-900-3252

    Why You Need a McAfee Antivirus Login Account Before diving into the login steps, let’s understand why having a McAfee antivirus login account is crucial. This account acts as a control center for managing your McAfee services. Whether you want to install McAfee on a new device, renew your subscription, check for software updates, or manage licenses, it all begins with (Toll Free) Number +1-341-900-3252 logging into your account.

    (Toll Free) Number +1-341-900-3252

    Benefits include:

    Centralized management of all protected devices

    Real-time updates and threat reports

    Subscription and billing management

    Quick downloads and installations

    24/7 customer support access

    Your McAfee antivirus login account ensures you stay informed and protected.

    How to Create a McAfee Antivirus Login Account If you’re new to McAfee, setting up your account is your first step. Here’s how: (Toll Free) Number +1-341-900-3252

    Purchase McAfee Antivirus: Whether it's from an official vendor or pre-installed on your device, you'll need a product key.

    (Toll Free) Number +1-341-900-3252

    Visit the McAfee Website: Navigate to the official McAfee homepage.

    Select "Sign Up" or "Register": Enter your personal details such as your name, email address, and create a secure password.

    (Toll Free) Number +1-341-900-3252

    Enter Product Key: Input the 25-digit product key received during purchase to activate your subscription.

    Verify Your Email: McAfee will send a verification link. Click it to confirm your registration.

    Now your McAfee antivirus login account is active and ready to use. (Toll Free) Number +1-341-900-3252

    How to Login to Your McAfee Antivirus Account Logging into your McAfee account is simple and only takes a few steps:

    Go to the McAfee Homepage: Start by opening your browser and visiting the official site.

    Click on “My Account”: Usually located in the upper-right corner.

    Enter Credentials: Input your registered email address and password.

    Click “Login” or “Sign In”: You will now be redirected to your dashboard.

    From here, you can manage subscriptions, download software, and update protection settings. Always ensure you’re logging in from a secure device and network.

    (Toll Free) Number +1-341-900-3252

    Troubleshooting McAfee Antivirus Login Account Issues Sometimes users face difficulties accessing their McAfee antivirus login account. (Toll Free) Number +1-341-900-3252 Here are common problems and solutions:

    Forgot Password Click on the "Forgot Password" link on the login page.

    Enter your registered email address.

    Follow the instructions sent to your inbox to reset your password.

    (Toll Free) Number +1-341-900-3252

    Incorrect Email Double-check for typos or use a different email if you have multiple.

    Ensure it's the same one used during registration.

    Two-Factor Authentication Problems If enabled, make sure your secondary (Toll Free) Number +1-341-900-3252 device is accessible.

    Check time synchronization between devices to avoid verification code mismatches.

    Account Locked Multiple failed attempts may lock your account. Wait 15–30 minutes or contact customer support for help.

    Staying calm and following these steps can quickly resolve most login issues.

    Keeping Your McAfee Antivirus Login Account Secure Security doesn't stop after installing antivirus software. Your login account itself should be safeguarded. Here are some best practices:

    (Toll Free) Number +1-341-900-3252

    Use a Strong Password: Include upper and lowercase letters, numbers, and special characters.

    Enable Two-Factor Authentication: Adds an extra layer of security.

    Don’t Share Your Credentials: Keep your login details private and secure.

    Regularly Update Your Password: Change your password every 3–6 months for added safety.

    Log Out After Use: Especially important if you're using a public or shared device.

    By following these tips, you ensure that your McAfee antivirus login account remains protected against unauthorized access.

    (Toll Free) Number +1-341-900-3252

    Managing Devices from Your McAfee Account Once logged in, you can view and manage all devices connected to your McAfee subscription:

    Add a Device: Install McAfee on another PC, Mac, smartphone, or tablet directly from your dashboard.

    Remove a Device: Stop protection for any device no longer in use.

    Transfer Protection: Reassign your license if you're switching to a new device.

    (Toll Free) Number +1-341-900-3252

    This level of control helps users maximize the value of their subscription while staying secure across all platforms.

    (Toll Free) Number +1-341-900-3252

    Final Thoughts Your McAfee antivirus login account is more than just a gateway—it's a comprehensive (Toll Free) Number +1-341-900-3252 tool for managing your digital security. From checking protection (Toll Free) Number +1-341-900-3252 status to adding new devices, everything is just a few clicks away. For users (Toll Free) Number +1-341-900-3252 looking to stay ahead of cyber threats, knowing how to access and use this account is essential.

  16. Data from: Mobile Application Review Summarization using Chain of Density...

    • figshare.com
    pdf
    Updated Sep 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SHRISTI SHRESTHA (2024). Mobile Application Review Summarization using Chain of Density Prompting [Dataset]. http://doi.org/10.6084/m9.figshare.26980834.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Sep 10, 2024
    Dataset provided by
    figshare
    Authors
    SHRISTI SHRESTHA
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplemental dataset for Chain of Density Summarization of mobile app reviews.

  17. E

    Data from: MLASK: Multimodal Summarization of Video-based News Articles

    • live.european-language-grid.eu
    • lindat.cz
    • +1more
    binary format
    Updated Dec 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). MLASK: Multimodal Summarization of Video-based News Articles [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/23018
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Dec 31, 2022
    License

    https://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licencehttps://lindat.mff.cuni.cz/repository/xmlui/page/szn-dataset-licence

    Description

    The MLASK corpus consists of 41,243 multi-modal documents – video-based news articles in the Czech language – collected from Novinky.cz (https://www.novinky.cz/) and Seznam Zprávy (https://www.seznamzpravy.cz/). It was introduced in "MLASK: Multimodal Summarization of Video-based News Articles" (Krubiński & Pecina, EACL 2023). The articles' publication dates range from September 2016 to February 2022. The intended use case of the dataset is to model the task of multimodal summarization with multimodal output: based on a pair of a textual article and a short video, a textual summary is generated, and a single frame from the video is chosen as a pictorial summary.

    Each document consists of the following: - a .mp4 video - a single image (cover picture) - the article's text - the article's summary - the article's title - the article's publication date

    All of the videos are re-sampled to 25 fps and resized to the same resolution of 1280x720p. The maximum length of the video is 5 minutes, and the shortest one is 7 seconds. The average video duration is 86 seconds. The quantitative statistics of the lengths of titles, abstracts, and full texts (measured in the number of tokens) are below. Q1 and Q3 denote the first and third quartiles, respectively.

    / - / mean / Q1 / Median / Q3 / / Title / 11.16 ± 2.78 / 9 / 11 / 13 / / Abstract / 33.40 ± 13.86 / 22 / 32 / 43 / / Article / 276.96 ± 191.74 / 154 / 231 / 343 /

    The proposed training/dev/test split follows the chronological ordering based on publication data. We use the articles published in the first half (Jan-Jun) of 2021 for validation (2,482 instances) and the ones published in the second half (Jul-Dec) of 2021 and the beginning (Jan-Feb) of 2022 for testing (2,652 instances). The remaining data is used for training (36,109 instances).

    The textual data is shared as a single .tsv file. The visual data (video+image) is shared as a single archive for validation and test splits, and the one from the training split is partitioned based on the publication date.

  18. f

    Data from: Figure Associated Text Summarization and Evaluation

    • figshare.com
    message/news
    Updated Jan 18, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balaji Polepalli ramesh (2016). Figure Associated Text Summarization and Evaluation [Dataset]. http://doi.org/10.6084/m9.figshare.852976.v3
    Explore at:
    message/newsAvailable download formats
    Dataset updated
    Jan 18, 2016
    Dataset provided by
    figshare
    Authors
    Balaji Polepalli ramesh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The gold standard of summaries used to build and evaluate Figure summarization system consisting of 94 figures from 19 articles.

  19. C

    Assessment Data Summary

    • data.milwaukee.gov
    csv, pdf
    Updated Feb 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Assessor's Office (2025). Assessment Data Summary [Dataset]. https://data.milwaukee.gov/dataset/assessment-data-summary
    Explore at:
    pdf(683105), pdf(141178), csv(9732)Available download formats
    Dataset updated
    Feb 14, 2025
    Dataset authored and provided by
    Assessor's Office
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Update Frequency: Annual

    Updated for 2022. End of year assessed property values for the City of Milwaukee for the years 1992-present. These values include real estate property and personal property in Milwaukee, Washington, and Waukesha Counties.

    One data row per year.

    To download XML and JSON files, click the CSV option below and click the down arrow next to the Download button in the upper right on its page.

  20. E-Commerce Summarization Demo

    • figshare.com
    csv
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepchecks Data (2025). E-Commerce Summarization Demo [Dataset]. http://doi.org/10.6084/m9.figshare.28218227.v4
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    figshare
    Authors
    Deepchecks Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ccdv (2021). arxiv-summarization [Dataset]. https://huggingface.co/datasets/ccdv/arxiv-summarization

arxiv-summarization

ccdv/arxiv-summarization

Explore at:
50 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 19, 2021
Authors
ccdv
Description

Arxiv dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/arxiv-summarization": ("article", "abstract")

  Data Fields

id: paper id article: a string containing the body of… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/arxiv-summarization.

Search
Clear search
Close search
Google apps
Main menu