Facebook
TwitterPubMed dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
Facebook
Twitterreaderbench/ro-text-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.
The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.
Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.
With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.
Introduction:
File Description:
validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.
train.csv: The purpose of this file is to provide training data for summarizing scientific articles.
test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.
Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.
Usage Examples: This dataset can be utilized in various ways:
a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.
b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.
c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.
- Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.
Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions
- Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
- Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
- Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.
In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.
Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.
Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.
Introduction:
Dataset Structure:
- article: The full text of a scientific article from the PubMed database (Text).
- abstract: A summary of the main findings and conclusions of the article (Text).
Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:
Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.
Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.
Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.
Tips for Utilizing the Dataset Effectively:
Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.
Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.
Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.
Conclusion:
- Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description ...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }
Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Extreme Summarization (XSum) Dataset.
There are three features: - document: Input news article. - summary: One sentence summary of the article. - id: BBC ID of the article.
Facebook
Twitterhttps://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
SaiCharanChetpelly/legal-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The "Daily Mail Articles and Highlights" dataset comprises a meticulously curated collection of 8,176 articles, along with their corresponding highlights, sourced directly from the Daily Mail website. This extensive dataset is designed to facilitate the development and training of sophisticated text summarization models that can generate concise and accurate summaries for long-form articles.
The primary goal of this dataset is to train a text summarization model capable of producing brief, yet informative, summaries of given articles. This endeavor is particularly beneficial for readers who seek to grasp the essential points of lengthy articles quickly, thereby enhancing their reading efficiency and comprehension.
The dataset was compiled through an automated web scraping process, ensuring the inclusion of a diverse range of articles spanning various topics and categories. Each article in the dataset is paired with its highlight, which serves as a reference summary. The highlights are succinct extracts that encapsulate the core message of the articles, providing a foundation for training summarization models.
To achieve the goal of creating an efficient summarization system, we employ a combination of cutting-edge technologies and libraries, including:
The summarization model is trained using the collected dataset, following a structured workflow:
The resulting summarization system is designed to automatically produce concise and informative summaries, which can be used in various applications, including:
The "Daily Mail Articles and Highlights" dataset is a valuable resource for advancing the field of text summarization. By leveraging state-of-the-art techniques and libraries, this project aims to develop a robust summarization model that can significantly improve the way we consume and process information. This dataset not only supports the creation of efficient summarization systems but also contributes to the broader goal of making information more accessible and digestible for all.
Facebook
TwitterThis is Arabic news data with 9 categories in csv format
original data link: https://www.kaggle.com/datasets/muhammedfathi/arabic-news-texts-corpus Data preparation and summary link: https://www.kaggle.com/code/abdalrahmanshahrour/arabic-text-summarization
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
mayankchugh-learning/text-summarization-logs dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
🇺🇸 English:
This synthetic dataset is created for learning and testing abstractive text summarization models. Each row contains a news-style article and a short summary. The dataset is ideal for experimenting with HuggingFace models such as t5-base, facebook/bart-large-cnn, or google/pegasus-xsum.
🇹🇷 Türkçe:
Bu sentetik veri seti, haber metinlerinden özet üretmek isteyenler için tasarlanmıştır. Her satırda İngilizce uzun bir haber metni ve karşılık gelen kısa bir özet yer alır. T5, BART veya Pegasus modelleriyle uyumludur.
Facebook
TwitterThis dataset was created by Nilesh Malode
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.
The BCP-47 code for English as generally spoken in the United States is en-US and the BCP-47 code for English as generally spoken in the United Kingdom is en-GB. It is unknown if other varieties of English are represented in the data.
For each instance, there is a string for the article, a string for the highlights, and a string for the id. See the CNN / Daily Mail dataset viewer to explore more examples.
{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .
Previously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}
The average token count for the articles and the highlights are provided below:
| Feature | Mean Token Count |
|---|---|
| Article | 781 |
| Highlights | 56 |
id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved fromarticle: a string containing the body of the news article highlights: a string containing the highlight of the article as written by the article authorThe CNN/DailyMail dataset has 3 splits: train, validation, and test. Below are the statistics for Version 3.0.0 of the dataset.
| Dataset Split | Number of Instances in Split |
|---|---|
| Train | 287,113 |
| Validation | 13,368 |
| Test | 11,490 |
Version 1.0.0 aimed to support supervised neural methodologies for machine reading and question answering with a large amount of real natural language training data and released about 313k unique articles and nearly 1M Cloze style questions to go with the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization rather than question answering. Version 3.0.0 provided a non-anonymized version of the data, whereas both the previous versions were preprocessed to replace named entities with unique identifier labels.
The data consists of news articles and...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CNN/DailyMail non-anonymized summarization dataset.
There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
Facebook
TwitterDataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]
Dataset Sources [optional]
Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/danielmekuriaw/Amharic-Text-Summarization-Benchmark-Dataset.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By allegro (From Huggingface) [source]
The Source-Target Pair Dataset for Allegro Articles Summarization is a comprehensive and valuable dataset specifically tailored for training and evaluating the performance of an advanced text summarization model. The dataset comprises three distinct files: validation.csv, train.csv, and test.csv, each containing a rich collection of source-target pairs.
In this dataset, the source column represents the original source text or article from which summarizations are to be derived. This is followed by the target column, which consists of the target summary or desired output summarization corresponding to each respective source text.
The validation.csv file serves as a reliable resource for assessing the model's performance and effectiveness in generating accurate summaries. It contains numerous annotated examples of source-target pairings that serve as benchmarks during evaluation.
On the other hand, train.csv encompasses meticulously curated examples of both sources and their respective target summaries. This valuable resource forms the foundation for training an automated Allegro Articles Summarization model that can effectively condense lengthy articles into concise and coherent summaries.
Lastly, test.csv ensures rigorous testing of the trained model's generalizability by providing additional unseen instances of source-target pairs representing various types of articles across different domains. This allows for robust evaluation of how well the model can perform on real-world scenarios beyond its training data.
The purpose behind this carefully crafted Source-Target Pair Dataset is to facilitate research and development in text summarization techniques with a specific focus on Allegro Articles Summarization tasks. By leveraging this comprehensive dataset, researchers can design more accurate and sophisticated models that significantly enhance our ability to automatically summarize long-form texts efficiently across diverse domains such as news articles, blog posts, academic papers, among others.
In summary, through its meticulous curation and diversification across datasets (validation.csv), training (train.csv), and testing (test.cvs), this Source-Target Pair Dataset offers an invaluable resource for advancing state-of-the-art techniques in automatic Allegro Articles Summarization
How to use this dataset for Allegro Articles Summarization
Dataset Overview
The dataset consists of three separate files: validation.csv, train.csv, and test.csv. These files contain source-target pairs that are used for training, validating, and testing the performance of the Allegro Articles Summarization model.
Each file contains multiple columns: - source: The source text or article from which the summarization is to be generated. - target: The desired output summarization or target summary of the source text.
Training Your Model
To train your model using this dataset, you can use the train.csv file. This file contains a large number of source-target pairs that can be used for training your summarization model. You can load this data into your preferred machine learning framework or language like Python with libraries such as Pandas or NumPy.
Here are some steps to follow while training your model:
Preprocessing:
- Clean the data by removing dates if required (as specified in the prompt).
- Perform any necessary data cleaning steps such as removing special characters, lowercasing text, etc.
Defining a Model Architecture:
- Choose a suitable algorithm/model architecture for article summarization. Some popular options include sequence-to-sequence models (e.g., LSTM), transformer models (e.g., BERT), or pointer-generator networks.
Training Process:
- Split your data into training and validation sets.
- Feed in the source text as input and compare it with target summaries during each epoch to optimize loss/error rate using gradient descent algorithms.
Hyperparameter Tuning:
- Experiment with different hyperparameters such as learning rate, batch size, model depth, etc., to improve performance.
- Use techniques like grid search or random search to find the optimal combination of hyperparameters.
Model Evaluation:
- Evaluate your model on a separate test dataset (e.g., test.csv) that you have set aside for final evaluation.
- Calculate metrics like ROU...
Facebook
Twitterantash420/long-context-text-summarization-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By mlsum (From Huggingface) [source]
The MLSUM dataset, also known as the Multilingual Summarization Dataset, is a comprehensive and extensive collection of data specifically tailored for multilingual summarization tasks. With over 1.5 million meticulously curated pairs of articles and summaries, this dataset serves as an invaluable resource for researchers in the field of multilingual summarization.
This dataset is sourced from a wide range of reputable online newspapers and encompasses articles written in five distinct languages: French, German, Spanish, Russian, and Turkish. By incorporating diverse linguistic sources, the MLSUM dataset allows for the exploration of various language-specific nuances and challenges that arise in the process of generating accurate and informative summaries.
Each article-summary pair within this highly curated dataset has been carefully selected to ensure relevance and accuracy. The articles span across a broad spectrum of topics and domains to encompass a diverse range of subject matter. With such comprehensiveness in content coverage across multiple languages, researchers can explore various topics while keeping cultural context intact.
The MLSUM dataset goes beyond mere translation by providing high-quality summaries that capture key information from each article concisely yet effectively. These summaries are designed to encapsulate the essence of each article while maintaining coherence and readability.
As an unprecedentedly large-scale collection with its vast number of articles spanning multiple languages, it enables researchers to develop novel approaches towards improving multilingual summarization models by allowing them to explore cross-lingual transfer learning techniques.
Overall, this extensive MLSUM dataset facilitates significant advancements in research pertaining to multilingual summarization tasks by offering rich resources across different languages while maintaining contextual relevance between articles and their corresponding summaries
Guide: How to Use the MLSUM Dataset for Multilingual Summarization Tasks
The MLSUM dataset is a valuable resource for researchers working on multilingual summarization tasks. With over 1.5 million pairs of articles and summaries in five different languages, it offers a wide range of possibilities for training and evaluating summarization models.
Here's a step-by-step guide on how to make the most out of this dataset:
Familiarize Yourself with the Dataset Structure:
- The dataset is organized into separate files based on language and purpose (e.g., test, validation).
- Each file contains columns such as text, summary, topic, URL, and title.
- The text column contains the main body of the article, while the summary column contains a concise summary of the article.
- The topic column provides information about the category or topic of each article.
Choose Your Target Language:
- Decide which language you want to focus on for your multilingual summarization task.
- Remember that MLSUM covers five languages: French, German, Spanish, Russian, and Turkish.
Determine Your Task:
- Define your specific summarization task. For example:
- Single-document summarization: Generate a succinct summary for each individual article.
- Multidocument summarization: Generate a summary by considering multiple related articles as input.
Preprocess the Data:
- Clean and preprocess the text data according to your specific needs (e.g., lowercasing letters, removing punctuation).
Splitting Data Into Training/Validation/Test Sets: Ensure proper separation between training data (to train your model), validation data (to tune hyperparameters), evaluation / test data(to evaluate model performance).
Build or Adapt Your Summarization Model: Depending on your chosen task and programming abilities, decide whether you will adapt an existing model or build a new one from scratch. you may use existing state-of-the-art models such as BART, T5, GPT, or Transformer.
- Multilingual Summarization Research: The MLSUM dataset provides a rich resource for researchers to study and develop multilingual summarization models. With over 1.5 million article/summary pairs in five different languages, the dataset can be used to train and evaluate the performance of multilingual summarization algorithms.
- Comparative Analysis of Summariz...
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
Facebook
Twitterhttps://github.com/Alex-Fabbri/Multi-Newshttps://github.com/Alex-Fabbri/Multi-News
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited. There are two features: document: text of news articles seperated by special token "|||||". summary: news summary.
Facebook
TwitterPubMed dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.