100+ datasets found

h
pubmed-summarization
huggingface.co
opendatalab.com
Updated Dec 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2021
Authors
ccdv
Description
PubMed dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

Data Fields

id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
h
ro-text-summarization
huggingface.co
Updated Jun 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ReaderBench (2020). ro-text-summarization [Dataset]. https://huggingface.co/datasets/readerbench/ro-text-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2020
Dataset authored and provided by
ReaderBench
Description
readerbench/ro-text-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community
CCDV Arxiv Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
Explore at:
zip(2219742528 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

By ccdv (From Huggingface) [source]

About this dataset

The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

How to use the dataset

Introduction:

File Description:

validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

Usage Examples: This dataset can be utilized in various ways:

a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

Research Ideas

Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.

Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.

Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
PubMed Article Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset
Explore at:
zip(686033678 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
PubMed Article Summarization Dataset

PubMed Summarization Dataset

By ccdv (From Huggingface) [source]

About this dataset

The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

How to use the dataset

Introduction:

Dataset Structure:

article: The full text of a scientific article from the PubMed database (Text).

abstract: A summary of the main findings and conclusions of the article (Text).

Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

Tips for Utilizing the Dataset Effectively:

Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

Conclusion:

Research Ideas

Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description ...
Arxiv Summary Dataset
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syndrigasti (2023). Arxiv Summary Dataset [Dataset]. https://www.kaggle.com/datasets/syndri224/arxiv-summary-dataset
Explore at:
zip(2242895176 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
Syndrigasti
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
@inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization
h
xsum
huggingface.co
Updated Oct 23, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
EdinburghNLP - Natural Language Processing Group at the University of Edinburgh (2015). xsum [Dataset]. https://huggingface.co/datasets/EdinburghNLP/xsum
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2015
Dataset authored and provided by
EdinburghNLP - Natural Language Processing Group at the University of Edinburgh
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Extreme Summarization (XSum) Dataset.

There are three features: - document: Input news article. - summary: One sentence summary of the article. - id: BBC ID of the article.
h
legal-summarization
huggingface.co
Updated Sep 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sai charan chetpelly (2024). legal-summarization [Dataset]. https://huggingface.co/datasets/SaiCharanChetpelly/legal-summarization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 13, 2024
Authors
Sai charan chetpelly
License
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Description
SaiCharanChetpelly/legal-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community
Daily Mail Summarization Dataset
kaggle.com
zip
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evil Spirit05 (2024). Daily Mail Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/evilspirit05/daily-mail-summarization-dataset
Explore at:
zip(52096 bytes)Available download formats
Dataset updated
Aug 6, 2024
Authors
Evil Spirit05
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The "Daily Mail Articles and Highlights" dataset comprises a meticulously curated collection of 8,176 articles, along with their corresponding highlights, sourced directly from the Daily Mail website. This extensive dataset is designed to facilitate the development and training of sophisticated text summarization models that can generate concise and accurate summaries for long-form articles.

Objective

The primary goal of this dataset is to train a text summarization model capable of producing brief, yet informative, summaries of given articles. This endeavor is particularly beneficial for readers who seek to grasp the essential points of lengthy articles quickly, thereby enhancing their reading efficiency and comprehension.

Data Collection Process

The dataset was compiled through an automated web scraping process, ensuring the inclusion of a diverse range of articles spanning various topics and categories. Each article in the dataset is paired with its highlight, which serves as a reference summary. The highlights are succinct extracts that encapsulate the core message of the articles, providing a foundation for training summarization models.

https://www.dailymail.co.uk/home/index.html

Technical Framework

To achieve the goal of creating an efficient summarization system, we employ a combination of cutting-edge technologies and libraries, including:

Hugging Face's Transformers: A powerful library that provides pre-trained models and tools for natural language processing tasks. For this project, we leverage the DistilBERT model, known for its efficiency and performance in text summarization tasks.

Blurr: A library that bridges the gap between Hugging Face’s Transformers and Fastai, enabling seamless integration and enhanced model training capabilities.

Fastai: An accessible deep learning library that simplifies the process of building and training models. Fastai's user-friendly interface and robust functionalities are instrumental in developing and fine-tuning the summarization model.

Implementation Strategy

The summarization model is trained using the collected dataset, following a structured workflow:

Preprocessing: The articles and highlights are cleaned and preprocessed to ensure consistency and quality. This step includes tokenization, normalization, and handling of special characters.

Model Training: Utilizing the DistilBERT model from Hugging Face's Transformers, the training process involves fine-tuning the model on the preprocessed dataset. The integration of Blurr and Fastai facilitates efficient training and model optimization.

Evaluation and Tuning: The model's performance is evaluated using various metrics, such as ROUGE scores, to assess the quality of the generated summaries. Continuous tuning and iteration are performed to enhance the model’s accuracy and reliability.

Applications

The resulting summarization system is designed to automatically produce concise and informative summaries, which can be used in various applications, including:

News Aggregation Platforms: Providing readers with quick summaries of news articles, enhancing their ability to stay informed with minimal time investment.

Educational Tools: Assisting students and researchers by summarizing lengthy academic articles and papers.

Content Management Systems: Enabling efficient content curation and management by generating summaries for large volumes of articles.

Conclusion

The "Daily Mail Articles and Highlights" dataset is a valuable resource for advancing the field of text summarization. By leveraging state-of-the-art techniques and libraries, this project aims to develop a robust summarization model that can significantly improve the way we consume and process information. This dataset not only supports the creation of efficient summarization systems but also contributes to the broader goal of making information more accessible and digestible for all.
Data from: arabic text summarization
kaggle.com
huggingface.co
zip
Updated Dec 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdalrahman Shahrour (2022). arabic text summarization [Dataset]. https://www.kaggle.com/datasets/abdalrahmanshahrour/arabicsummarization/code
Explore at:
zip(6602909 bytes)Available download formats
Dataset updated
Dec 8, 2022
Authors
Abdalrahman Shahrour
Description
This is Arabic news data with 9 categories in csv format

original data link: https://www.kaggle.com/datasets/muhammedfathi/arabic-news-texts-corpus Data preparation and summary link: https://www.kaggle.com/code/abdalrahmanshahrour/arabic-text-summarization
h
text-summarization-logs
huggingface.co
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayank Chugh (2024). text-summarization-logs [Dataset]. https://huggingface.co/datasets/mayankchugh-learning/text-summarization-logs
Explore at:
Dataset updated
Aug 3, 2024
Authors
Mayank Chugh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
mayankchugh-learning/text-summarization-logs dataset hosted on Hugging Face and contributed by the HF Datasets community
News Article Summarization Dataset
kaggle.com
zip
Updated Apr 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
🇹🇷 Şahide Şeker, MSc (2025). News Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/sahideseker/news-article-summarization-dataset/data
Explore at:
zip(758 bytes)Available download formats
Dataset updated
Apr 3, 2025
Authors
🇹🇷 Şahide Şeker, MSc
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
🇺🇸 English:

This synthetic dataset is created for learning and testing abstractive text summarization models. Each row contains a news-style article and a short summary. The dataset is ideal for experimenting with HuggingFace models such as t5-base, facebook/bart-large-cnn, or google/pegasus-xsum.

🇹🇷 Türkçe:

Bu sentetik veri seti, haber metinlerinden özet üretmek isteyenler için tasarlanmıştır. Her satırda İngilizce uzun bir haber metni ve karşılık gelen kısa bir özet yer alır. T5, BART veya Pegasus modelleriyle uyumludur.
Samsum Dataset Text summarization
kaggle.com
zip
Updated Jun 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nilesh Malode (2023). Samsum Dataset Text summarization [Dataset]. https://www.kaggle.com/datasets/nileshmalode1/samsum-dataset-text-summarization
Explore at:
zip(8377572 bytes)Available download formats
Dataset updated
Jun 23, 2023
Authors
Nilesh Malode
Description
Dataset

This dataset was created by Nilesh Malode

Contents
CNN-DailyMail News Text Summarization
kaggle.com
zip
Updated Oct 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gowri Shankar Penugonda (2021). CNN-DailyMail News Text Summarization [Dataset]. https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/code
Explore at:
zip(527738644 bytes)Available download formats
Dataset updated
Oct 23, 2021
Authors
Gowri Shankar Penugonda
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
dataset-card-for-cnn-dailymail-dataset

dataset-summary

The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

supported-tasks-and-leaderboards

'summarization': Versions 2.0.0 and 3.0.0 of the CNN / DailyMail Dataset can be used to train a model for abstractive and extractive summarization (Version 1.0.0 was developed for machine reading and comprehension and abstractive question answering). The model performance is measured by how high the output summary's ROUGE score for a given article is when compared to the highlight as written by the original article author. Zhong et al (2020) report a ROUGE-1 score of 44.41 when testing a model trained for extractive summarization. See the Papers With Code leaderboard for more models.

languages

The BCP-47 code for English as generally spoken in the United States is en-US and the BCP-47 code for English as generally spoken in the United Kingdom is en-GB. It is unknown if other varieties of English are represented in the data.

dataset-structure

data-instances

For each instance, there is a string for the article, a string for the highlights, and a string for the id. See the CNN / Daily Mail dataset viewer to explore more examples.

{'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62', 'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.' 'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say . Previously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}

The average token count for the articles and the highlights are provided below:

Feature Mean Token Count
Article 781
Highlights 56

data-fields

id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from

article: a string containing the body of the news article

highlights: a string containing the highlight of the article as written by the article author

data-splits

The CNN/DailyMail dataset has 3 splits: train, validation, and test. Below are the statistics for Version 3.0.0 of the dataset.

Dataset Split Number of Instances in Split
Train 287,113
Validation 13,368
Test 11,490

dataset-creation

curation-rationale

Version 1.0.0 aimed to support supervised neural methodologies for machine reading and question answering with a large amount of real natural language training data and released about 313k unique articles and nearly 1M Cloze style questions to go with the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization rather than question answering. Version 3.0.0 provided a non-anonymized version of the data, whereas both the previous versions were preprocessed to replace named entities with unique identifier labels.

source-data

initial-data-collection-and-normalization

The data consists of news articles and...
h
cnn_dailymail
huggingface.co
tensorflow.org
+1more
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ccdv, cnn_dailymail [Dataset]. https://huggingface.co/datasets/ccdv/cnn_dailymail
Explore at:
Authors
ccdv
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
CNN/DailyMail non-anonymized summarization dataset.

There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary
h
Amharic-Text-Summarization-Benchmark-Dataset
huggingface.co
Updated Jan 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Mekuriaw (2024). Amharic-Text-Summarization-Benchmark-Dataset [Dataset]. https://huggingface.co/datasets/danielmekuriaw/Amharic-Text-Summarization-Benchmark-Dataset
Explore at:
Dataset updated
Jan 24, 2024
Authors
Daniel Mekuriaw
Description
Dataset Card for Dataset Name

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Dataset Details Dataset Description

Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

Dataset Sources [optional]

Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/danielmekuriaw/Amharic-Text-Summarization-Benchmark-Dataset.
Allegro Articles Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Allegro Articles Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/allegro-articles-summarization-dataset
Explore at:
zip(122439590 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Allegro Articles Summarization Dataset

Allegro Articles Summarization Source-Target Dataset

By allegro (From Huggingface) [source]

About this dataset

The Source-Target Pair Dataset for Allegro Articles Summarization is a comprehensive and valuable dataset specifically tailored for training and evaluating the performance of an advanced text summarization model. The dataset comprises three distinct files: validation.csv, train.csv, and test.csv, each containing a rich collection of source-target pairs.

In this dataset, the source column represents the original source text or article from which summarizations are to be derived. This is followed by the target column, which consists of the target summary or desired output summarization corresponding to each respective source text.

The validation.csv file serves as a reliable resource for assessing the model's performance and effectiveness in generating accurate summaries. It contains numerous annotated examples of source-target pairings that serve as benchmarks during evaluation.

On the other hand, train.csv encompasses meticulously curated examples of both sources and their respective target summaries. This valuable resource forms the foundation for training an automated Allegro Articles Summarization model that can effectively condense lengthy articles into concise and coherent summaries.

Lastly, test.csv ensures rigorous testing of the trained model's generalizability by providing additional unseen instances of source-target pairs representing various types of articles across different domains. This allows for robust evaluation of how well the model can perform on real-world scenarios beyond its training data.

The purpose behind this carefully crafted Source-Target Pair Dataset is to facilitate research and development in text summarization techniques with a specific focus on Allegro Articles Summarization tasks. By leveraging this comprehensive dataset, researchers can design more accurate and sophisticated models that significantly enhance our ability to automatically summarize long-form texts efficiently across diverse domains such as news articles, blog posts, academic papers, among others.

In summary, through its meticulous curation and diversification across datasets (validation.csv), training (train.csv), and testing (test.cvs), this Source-Target Pair Dataset offers an invaluable resource for advancing state-of-the-art techniques in automatic Allegro Articles Summarization

How to use the dataset

How to use this dataset for Allegro Articles Summarization

Dataset Overview

The dataset consists of three separate files: validation.csv, train.csv, and test.csv. These files contain source-target pairs that are used for training, validating, and testing the performance of the Allegro Articles Summarization model.

Each file contains multiple columns: - source: The source text or article from which the summarization is to be generated. - target: The desired output summarization or target summary of the source text.

Training Your Model

To train your model using this dataset, you can use the train.csv file. This file contains a large number of source-target pairs that can be used for training your summarization model. You can load this data into your preferred machine learning framework or language like Python with libraries such as Pandas or NumPy.

Here are some steps to follow while training your model:

Preprocessing:

Clean the data by removing dates if required (as specified in the prompt).

Perform any necessary data cleaning steps such as removing special characters, lowercasing text, etc.

Defining a Model Architecture:

Choose a suitable algorithm/model architecture for article summarization. Some popular options include sequence-to-sequence models (e.g., LSTM), transformer models (e.g., BERT), or pointer-generator networks.

Training Process:

Split your data into training and validation sets.

Feed in the source text as input and compare it with target summaries during each epoch to optimize loss/error rate using gradient descent algorithms.

Hyperparameter Tuning:

Experiment with different hyperparameters such as learning rate, batch size, model depth, etc., to improve performance.

Use techniques like grid search or random search to find the optimal combination of hyperparameters.

Model Evaluation:

Evaluate your model on a separate test dataset (e.g., test.csv) that you have set aside for final evaluation.

Calculate metrics like ROU...
h
long-context-text-summarization-alpaca-format
huggingface.co
Updated Nov 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antash Mishra (2024). long-context-text-summarization-alpaca-format [Dataset]. https://huggingface.co/datasets/antash420/long-context-text-summarization-alpaca-format
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 2, 2024
Authors
Antash Mishra
Description
antash420/long-context-text-summarization-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community
MLSUM - Multilingual Summarization
kaggle.com
zip
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). MLSUM - Multilingual Summarization [Dataset]. https://www.kaggle.com/datasets/thedevastator/mlsam-multilingual-summarization-dataset
Explore at:
zip(1780841718 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MLSUM - Multilingual Summarization

A vast multilingual dataset for summarization research

By mlsum (From Huggingface) [source]

About this dataset

The MLSUM dataset, also known as the Multilingual Summarization Dataset, is a comprehensive and extensive collection of data specifically tailored for multilingual summarization tasks. With over 1.5 million meticulously curated pairs of articles and summaries, this dataset serves as an invaluable resource for researchers in the field of multilingual summarization.

This dataset is sourced from a wide range of reputable online newspapers and encompasses articles written in five distinct languages: French, German, Spanish, Russian, and Turkish. By incorporating diverse linguistic sources, the MLSUM dataset allows for the exploration of various language-specific nuances and challenges that arise in the process of generating accurate and informative summaries.

Each article-summary pair within this highly curated dataset has been carefully selected to ensure relevance and accuracy. The articles span across a broad spectrum of topics and domains to encompass a diverse range of subject matter. With such comprehensiveness in content coverage across multiple languages, researchers can explore various topics while keeping cultural context intact.

The MLSUM dataset goes beyond mere translation by providing high-quality summaries that capture key information from each article concisely yet effectively. These summaries are designed to encapsulate the essence of each article while maintaining coherence and readability.

As an unprecedentedly large-scale collection with its vast number of articles spanning multiple languages, it enables researchers to develop novel approaches towards improving multilingual summarization models by allowing them to explore cross-lingual transfer learning techniques.

Overall, this extensive MLSUM dataset facilitates significant advancements in research pertaining to multilingual summarization tasks by offering rich resources across different languages while maintaining contextual relevance between articles and their corresponding summaries

How to use the dataset

Guide: How to Use the MLSUM Dataset for Multilingual Summarization Tasks

The MLSUM dataset is a valuable resource for researchers working on multilingual summarization tasks. With over 1.5 million pairs of articles and summaries in five different languages, it offers a wide range of possibilities for training and evaluating summarization models.

Here's a step-by-step guide on how to make the most out of this dataset:

Familiarize Yourself with the Dataset Structure:

The dataset is organized into separate files based on language and purpose (e.g., test, validation).

Each file contains columns such as text, summary, topic, URL, and title.

The text column contains the main body of the article, while the summary column contains a concise summary of the article.

The topic column provides information about the category or topic of each article.

Choose Your Target Language:

Decide which language you want to focus on for your multilingual summarization task.

Remember that MLSUM covers five languages: French, German, Spanish, Russian, and Turkish.

Determine Your Task:

Define your specific summarization task. For example:

Single-document summarization: Generate a succinct summary for each individual article.

Multidocument summarization: Generate a summary by considering multiple related articles as input.

Preprocess the Data:

Clean and preprocess the text data according to your specific needs (e.g., lowercasing letters, removing punctuation).

Splitting Data Into Training/Validation/Test Sets: Ensure proper separation between training data (to train your model), validation data (to tune hyperparameters), evaluation / test data(to evaluate model performance).

Build or Adapt Your Summarization Model: Depending on your chosen task and programming abilities, decide whether you will adapt an existing model or build a new one from scratch. you may use existing state-of-the-art models such as BART, T5, GPT, or Transformer.

Research Ideas

Multilingual Summarization Research: The MLSUM dataset provides a rich resource for researchers to study and develop multilingual summarization models. With over 1.5 million article/summary pairs in five different languages, the dataset can be used to train and evaluate the performance of multilingual summarization algorithms.

Comparative Analysis of Summariz...
h
multi_document_summarization
huggingface.co
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arka Das (2024). multi_document_summarization [Dataset]. https://huggingface.co/datasets/arka0821/multi_document_summarization
Explore at:
Dataset updated
Feb 8, 2024
Authors
Arka Das
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
O
Multi-News
opendatalab.com
tensorflow.org
+1more
zip
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yale University (2022). Multi-News [Dataset]. https://opendatalab.com/OpenDataLab/Multi-News
Explore at:
zip(5650548772 bytes)Available download formats
Dataset updated
Sep 21, 2022
Dataset provided by
Yale University
License
https://github.com/Alex-Fabbri/Multi-Newshttps://github.com/Alex-Fabbri/Multi-News
Description
Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited. There are two features: document: text of news articles seperated by special token "|||||". summary: news summary.

Feature	Mean Token Count
Article	781
Highlights	56

Dataset Split	Number of Instances in Split
Train	287,113
Validation	13,368
Test	11,490

Facebook

Twitter

Click to copy link

Link copied

Cite

ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization

pubmed-summarization

ccdv/pubmed-summarization

Explore at:

107 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 1, 2021

Authors

ccdv

Description

PubMed dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

  Data Fields

id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.

Clear search

Close search

Google apps

Main menu

pubmed-summarization

ro-text-summarization

CCDV Arxiv Summarization Dataset

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

PubMed Article Summarization Dataset

PubMed Article Summarization Dataset