100+ datasets found
  1. h

    pubmed-summarization

    • huggingface.co
    • opendatalab.com
    Updated Dec 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 1, 2021
    Authors
    ccdv
    Description

    PubMed dataset for summarization

    Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

      Data Fields
    

    id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.

  2. h

    ro-text-summarization

    • huggingface.co
    Updated Jun 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ReaderBench (2020). ro-text-summarization [Dataset]. https://huggingface.co/datasets/readerbench/ro-text-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2020
    Dataset authored and provided by
    ReaderBench
    Description

    readerbench/ro-text-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. CCDV Arxiv Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
    Explore at:
    zip(2219742528 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CCDV Arxiv Summarization Dataset

    Arxiv Summarization Dataset for CCDV

    By ccdv (From Huggingface) [source]

    About this dataset

    The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

    The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

    Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

    With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

    How to use the dataset

    • Introduction:

    • File Description:

    • validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

    • train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

    • test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

    • Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

    • Usage Examples: This dataset can be utilized in various ways:

    a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

    b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

    c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

    • Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

    Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

    Research Ideas

    • Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
    • Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
    • Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

  4. PubMed Article Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). PubMed Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/pubmed-article-summarization-dataset
    Explore at:
    zip(686033678 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    PubMed Article Summarization Dataset

    PubMed Summarization Dataset

    By ccdv (From Huggingface) [source]

    About this dataset

    The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.

    In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.

    Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.

    Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.

    How to use the dataset

    Introduction:

    Dataset Structure:

    • article: The full text of a scientific article from the PubMed database (Text).
    • abstract: A summary of the main findings and conclusions of the article (Text).

    Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:

    • Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.

    • Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.

    • Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.

    Tips for Utilizing the Dataset Effectively:

    • Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.

    • Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.

    • Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).

    • Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.

    Conclusion:

    Research Ideas

    • Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description ...

  5. Arxiv Summary Dataset

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syndrigasti (2023). Arxiv Summary Dataset [Dataset]. https://www.kaggle.com/datasets/syndri224/arxiv-summary-dataset
    Explore at:
    zip(2242895176 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    Syndrigasti
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    @inproceedings{cohan-etal-2018-discourse, title = "A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents", author = "Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli", booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)", month = jun, year = "2018", address = "New Orleans, Louisiana", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/N18-2097", doi = "10.18653/v1/N18-2097", pages = "615--621", abstract = "Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.", }

    Adapted from: https://github.com/armancohan/long-summarization and https://huggingface.co/datasets/ccdv/arxiv-summarization

  6. h

    xsum

    • huggingface.co
    Updated Oct 23, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    EdinburghNLP - Natural Language Processing Group at the University of Edinburgh (2015). xsum [Dataset]. https://huggingface.co/datasets/EdinburghNLP/xsum
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2015
    Dataset authored and provided by
    EdinburghNLP - Natural Language Processing Group at the University of Edinburgh
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Extreme Summarization (XSum) Dataset.

    There are three features: - document: Input news article. - summary: One sentence summary of the article. - id: BBC ID of the article.

  7. h

    legal-summarization

    • huggingface.co
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sai charan chetpelly (2024). legal-summarization [Dataset]. https://huggingface.co/datasets/SaiCharanChetpelly/legal-summarization
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2024
    Authors
    Sai charan chetpelly
    License

    https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/

    Description

    SaiCharanChetpelly/legal-summarization dataset hosted on Hugging Face and contributed by the HF Datasets community

  8. Daily Mail Summarization Dataset

    • kaggle.com
    zip
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evil Spirit05 (2024). Daily Mail Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/evilspirit05/daily-mail-summarization-dataset
    Explore at:
    zip(52096 bytes)Available download formats
    Dataset updated
    Aug 6, 2024
    Authors
    Evil Spirit05
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    The "Daily Mail Articles and Highlights" dataset comprises a meticulously curated collection of 8,176 articles, along with their corresponding highlights, sourced directly from the Daily Mail website. This extensive dataset is designed to facilitate the development and training of sophisticated text summarization models that can generate concise and accurate summaries for long-form articles.
    

    Objective

    The primary goal of this dataset is to train a text summarization model capable of producing brief, yet informative, summaries of given articles. This endeavor is particularly beneficial for readers who seek to grasp the essential points of lengthy articles quickly, thereby enhancing their reading efficiency and comprehension.
    

    Data Collection Process

    The dataset was compiled through an automated web scraping process, ensuring the inclusion of a diverse range of articles spanning various topics and categories. Each article in the dataset is paired with its highlight, which serves as a reference summary. The highlights are succinct extracts that encapsulate the core message of the articles, providing a foundation for training summarization models.
    

    https://www.dailymail.co.uk/home/index.html

    Technical Framework

    To achieve the goal of creating an efficient summarization system, we employ a combination of cutting-edge technologies and libraries, including:
    
    • Hugging Face's Transformers: A powerful library that provides pre-trained models and tools for natural language processing tasks. For this project, we leverage the DistilBERT model, known for its efficiency and performance in text summarization tasks.
    • Blurr: A library that bridges the gap between Hugging Face’s Transformers and Fastai, enabling seamless integration and enhanced model training capabilities.
    • Fastai: An accessible deep learning library that simplifies the process of building and training models. Fastai's user-friendly interface and robust functionalities are instrumental in developing and fine-tuning the summarization model.

    Implementation Strategy

    The summarization model is trained using the collected dataset, following a structured workflow:
    
    • Preprocessing: The articles and highlights are cleaned and preprocessed to ensure consistency and quality. This step includes tokenization, normalization, and handling of special characters.
    • Model Training: Utilizing the DistilBERT model from Hugging Face's Transformers, the training process involves fine-tuning the model on the preprocessed dataset. The integration of Blurr and Fastai facilitates efficient training and model optimization.
    • Evaluation and Tuning: The model's performance is evaluated using various metrics, such as ROUGE scores, to assess the quality of the generated summaries. Continuous tuning and iteration are performed to enhance the model’s accuracy and reliability.

    Applications

    The resulting summarization system is designed to automatically produce concise and informative summaries, which can be used in various applications, including:
    
    • News Aggregation Platforms: Providing readers with quick summaries of news articles, enhancing their ability to stay informed with minimal time investment.
    • Educational Tools: Assisting students and researchers by summarizing lengthy academic articles and papers.
    • Content Management Systems: Enabling efficient content curation and management by generating summaries for large volumes of articles.

    Conclusion

    The "Daily Mail Articles and Highlights" dataset is a valuable resource for advancing the field of text summarization. By leveraging state-of-the-art techniques and libraries, this project aims to develop a robust summarization model that can significantly improve the way we consume and process information. This dataset not only supports the creation of efficient summarization systems but also contributes to the broader goal of making information more accessible and digestible for all.
    
  9. Data from: arabic text summarization

    • kaggle.com
    • huggingface.co
    zip
    Updated Dec 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdalrahman Shahrour (2022). arabic text summarization [Dataset]. https://www.kaggle.com/datasets/abdalrahmanshahrour/arabicsummarization/code
    Explore at:
    zip(6602909 bytes)Available download formats
    Dataset updated
    Dec 8, 2022
    Authors
    Abdalrahman Shahrour
    Description

    This is Arabic news data with 9 categories in csv format

    original data link: https://www.kaggle.com/datasets/muhammedfathi/arabic-news-texts-corpus Data preparation and summary link: https://www.kaggle.com/code/abdalrahmanshahrour/arabic-text-summarization

  10. h

    text-summarization-logs

    • huggingface.co
    Updated Aug 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayank Chugh (2024). text-summarization-logs [Dataset]. https://huggingface.co/datasets/mayankchugh-learning/text-summarization-logs
    Explore at:
    Dataset updated
    Aug 3, 2024
    Authors
    Mayank Chugh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    mayankchugh-learning/text-summarization-logs dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. News Article Summarization Dataset

    • kaggle.com
    zip
    Updated Apr 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    🇹🇷 Şahide Şeker, MSc (2025). News Article Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/sahideseker/news-article-summarization-dataset/data
    Explore at:
    zip(758 bytes)Available download formats
    Dataset updated
    Apr 3, 2025
    Authors
    🇹🇷 Şahide Şeker, MSc
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    🇺🇸 English:

    This synthetic dataset is created for learning and testing abstractive text summarization models. Each row contains a news-style article and a short summary. The dataset is ideal for experimenting with HuggingFace models such as t5-base, facebook/bart-large-cnn, or google/pegasus-xsum.

    🇹🇷 Türkçe:

    Bu sentetik veri seti, haber metinlerinden özet üretmek isteyenler için tasarlanmıştır. Her satırda İngilizce uzun bir haber metni ve karşılık gelen kısa bir özet yer alır. T5, BART veya Pegasus modelleriyle uyumludur.

  12. Samsum Dataset Text summarization

    • kaggle.com
    zip
    Updated Jun 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nilesh Malode (2023). Samsum Dataset Text summarization [Dataset]. https://www.kaggle.com/datasets/nileshmalode1/samsum-dataset-text-summarization
    Explore at:
    zip(8377572 bytes)Available download formats
    Dataset updated
    Jun 23, 2023
    Authors
    Nilesh Malode
    Description

    Dataset

    This dataset was created by Nilesh Malode

    Contents

  13. CNN-DailyMail News Text Summarization

    • kaggle.com
    zip
    Updated Oct 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gowri Shankar Penugonda (2021). CNN-DailyMail News Text Summarization [Dataset]. https://www.kaggle.com/datasets/gowrishankarp/newspaper-text-summarization-cnn-dailymail/code
    Explore at:
    zip(527738644 bytes)Available download formats
    Dataset updated
    Oct 23, 2021
    Authors
    Gowri Shankar Penugonda
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    dataset-card-for-cnn-dailymail-dataset

    dataset-summary

    The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

    supported-tasks-and-leaderboards

    languages

    The BCP-47 code for English as generally spoken in the United States is en-US and the BCP-47 code for English as generally spoken in the United Kingdom is en-GB. It is unknown if other varieties of English are represented in the data.

    dataset-structure

    data-instances

    For each instance, there is a string for the article, a string for the highlights, and a string for the id. See the CNN / Daily Mail dataset viewer to explore more examples.

    {'id': '0054d6d30dbcad772e20b22771153a2a9cbeaf62',
     'article': '(CNN) -- An American woman died aboard a cruise ship that docked at Rio de Janeiro on Tuesday, the same ship on which 86 passengers previously fell ill, according to the state-run Brazilian news agency, Agencia Brasil. The American tourist died aboard the MS Veendam, owned by cruise operator Holland America. Federal Police told Agencia Brasil that forensic doctors were investigating her death. The ship's doctors told police that the woman was elderly and suffered from diabetes and hypertension, according the agency. The other passengers came down with diarrhea prior to her death during an earlier part of the trip, the ship's doctors said. The Veendam left New York 36 days ago for a South America tour.'
     'highlights': 'The elderly woman suffered from diabetes and hypertension, ship's doctors say .
    Previously, 86 passengers had fallen ill on the ship, Agencia Brasil says .'}
    

    The average token count for the articles and the highlights are provided below:

    FeatureMean Token Count
    Article781
    Highlights56

    data-fields

    • id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from
    • article: a string containing the body of the news article
    • highlights: a string containing the highlight of the article as written by the article author

    data-splits

    The CNN/DailyMail dataset has 3 splits: train, validation, and test. Below are the statistics for Version 3.0.0 of the dataset.

    Dataset SplitNumber of Instances in Split
    Train287,113
    Validation13,368
    Test11,490

    dataset-creation

    curation-rationale

    Version 1.0.0 aimed to support supervised neural methodologies for machine reading and question answering with a large amount of real natural language training data and released about 313k unique articles and nearly 1M Cloze style questions to go with the articles. Versions 2.0.0 and 3.0.0 changed the structure of the dataset to support summarization rather than question answering. Version 3.0.0 provided a non-anonymized version of the data, whereas both the previous versions were preprocessed to replace named entities with unique identifier labels.

    source-data

    initial-data-collection-and-normalization

    The data consists of news articles and...

  14. h

    cnn_dailymail

    • huggingface.co
    • tensorflow.org
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ccdv, cnn_dailymail [Dataset]. https://huggingface.co/datasets/ccdv/cnn_dailymail
    Explore at:
    Authors
    ccdv
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    CNN/DailyMail non-anonymized summarization dataset.

    There are two features: - article: text of news article, used as the document to be summarized - highlights: joined text of highlights with and around each highlight, which is the target summary

  15. h

    Amharic-Text-Summarization-Benchmark-Dataset

    • huggingface.co
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Mekuriaw (2024). Amharic-Text-Summarization-Benchmark-Dataset [Dataset]. https://huggingface.co/datasets/danielmekuriaw/Amharic-Text-Summarization-Benchmark-Dataset
    Explore at:
    Dataset updated
    Jan 24, 2024
    Authors
    Daniel Mekuriaw
    Description

    Dataset Card for Dataset Name

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]

      Dataset Sources [optional]
    

    Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/danielmekuriaw/Amharic-Text-Summarization-Benchmark-Dataset.

  16. Allegro Articles Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Allegro Articles Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/allegro-articles-summarization-dataset
    Explore at:
    zip(122439590 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Allegro Articles Summarization Dataset

    Allegro Articles Summarization Source-Target Dataset

    By allegro (From Huggingface) [source]

    About this dataset

    The Source-Target Pair Dataset for Allegro Articles Summarization is a comprehensive and valuable dataset specifically tailored for training and evaluating the performance of an advanced text summarization model. The dataset comprises three distinct files: validation.csv, train.csv, and test.csv, each containing a rich collection of source-target pairs.

    In this dataset, the source column represents the original source text or article from which summarizations are to be derived. This is followed by the target column, which consists of the target summary or desired output summarization corresponding to each respective source text.

    The validation.csv file serves as a reliable resource for assessing the model's performance and effectiveness in generating accurate summaries. It contains numerous annotated examples of source-target pairings that serve as benchmarks during evaluation.

    On the other hand, train.csv encompasses meticulously curated examples of both sources and their respective target summaries. This valuable resource forms the foundation for training an automated Allegro Articles Summarization model that can effectively condense lengthy articles into concise and coherent summaries.

    Lastly, test.csv ensures rigorous testing of the trained model's generalizability by providing additional unseen instances of source-target pairs representing various types of articles across different domains. This allows for robust evaluation of how well the model can perform on real-world scenarios beyond its training data.

    The purpose behind this carefully crafted Source-Target Pair Dataset is to facilitate research and development in text summarization techniques with a specific focus on Allegro Articles Summarization tasks. By leveraging this comprehensive dataset, researchers can design more accurate and sophisticated models that significantly enhance our ability to automatically summarize long-form texts efficiently across diverse domains such as news articles, blog posts, academic papers, among others.

    In summary, through its meticulous curation and diversification across datasets (validation.csv), training (train.csv), and testing (test.cvs), this Source-Target Pair Dataset offers an invaluable resource for advancing state-of-the-art techniques in automatic Allegro Articles Summarization

    How to use the dataset

    How to use this dataset for Allegro Articles Summarization

    Dataset Overview

    The dataset consists of three separate files: validation.csv, train.csv, and test.csv. These files contain source-target pairs that are used for training, validating, and testing the performance of the Allegro Articles Summarization model.

    Each file contains multiple columns: - source: The source text or article from which the summarization is to be generated. - target: The desired output summarization or target summary of the source text.

    Training Your Model

    To train your model using this dataset, you can use the train.csv file. This file contains a large number of source-target pairs that can be used for training your summarization model. You can load this data into your preferred machine learning framework or language like Python with libraries such as Pandas or NumPy.

    Here are some steps to follow while training your model:

    • Preprocessing:

      • Clean the data by removing dates if required (as specified in the prompt).
      • Perform any necessary data cleaning steps such as removing special characters, lowercasing text, etc.
    • Defining a Model Architecture:

      • Choose a suitable algorithm/model architecture for article summarization. Some popular options include sequence-to-sequence models (e.g., LSTM), transformer models (e.g., BERT), or pointer-generator networks.
    • Training Process:

      • Split your data into training and validation sets.
      • Feed in the source text as input and compare it with target summaries during each epoch to optimize loss/error rate using gradient descent algorithms.
    • Hyperparameter Tuning:

      • Experiment with different hyperparameters such as learning rate, batch size, model depth, etc., to improve performance.
      • Use techniques like grid search or random search to find the optimal combination of hyperparameters.
    • Model Evaluation:

      • Evaluate your model on a separate test dataset (e.g., test.csv) that you have set aside for final evaluation.
      • Calculate metrics like ROU...
  17. h

    long-context-text-summarization-alpaca-format

    • huggingface.co
    Updated Nov 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antash Mishra (2024). long-context-text-summarization-alpaca-format [Dataset]. https://huggingface.co/datasets/antash420/long-context-text-summarization-alpaca-format
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2024
    Authors
    Antash Mishra
    Description

    antash420/long-context-text-summarization-alpaca-format dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. MLSUM - Multilingual Summarization

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). MLSUM - Multilingual Summarization [Dataset]. https://www.kaggle.com/datasets/thedevastator/mlsam-multilingual-summarization-dataset
    Explore at:
    zip(1780841718 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    MLSUM - Multilingual Summarization

    A vast multilingual dataset for summarization research

    By mlsum (From Huggingface) [source]

    About this dataset

    The MLSUM dataset, also known as the Multilingual Summarization Dataset, is a comprehensive and extensive collection of data specifically tailored for multilingual summarization tasks. With over 1.5 million meticulously curated pairs of articles and summaries, this dataset serves as an invaluable resource for researchers in the field of multilingual summarization.

    This dataset is sourced from a wide range of reputable online newspapers and encompasses articles written in five distinct languages: French, German, Spanish, Russian, and Turkish. By incorporating diverse linguistic sources, the MLSUM dataset allows for the exploration of various language-specific nuances and challenges that arise in the process of generating accurate and informative summaries.

    Each article-summary pair within this highly curated dataset has been carefully selected to ensure relevance and accuracy. The articles span across a broad spectrum of topics and domains to encompass a diverse range of subject matter. With such comprehensiveness in content coverage across multiple languages, researchers can explore various topics while keeping cultural context intact.

    The MLSUM dataset goes beyond mere translation by providing high-quality summaries that capture key information from each article concisely yet effectively. These summaries are designed to encapsulate the essence of each article while maintaining coherence and readability.

    As an unprecedentedly large-scale collection with its vast number of articles spanning multiple languages, it enables researchers to develop novel approaches towards improving multilingual summarization models by allowing them to explore cross-lingual transfer learning techniques.

    Overall, this extensive MLSUM dataset facilitates significant advancements in research pertaining to multilingual summarization tasks by offering rich resources across different languages while maintaining contextual relevance between articles and their corresponding summaries

    How to use the dataset

    Guide: How to Use the MLSUM Dataset for Multilingual Summarization Tasks

    The MLSUM dataset is a valuable resource for researchers working on multilingual summarization tasks. With over 1.5 million pairs of articles and summaries in five different languages, it offers a wide range of possibilities for training and evaluating summarization models.

    Here's a step-by-step guide on how to make the most out of this dataset:

    • Familiarize Yourself with the Dataset Structure:

      • The dataset is organized into separate files based on language and purpose (e.g., test, validation).
      • Each file contains columns such as text, summary, topic, URL, and title.
      • The text column contains the main body of the article, while the summary column contains a concise summary of the article.
      • The topic column provides information about the category or topic of each article.
    • Choose Your Target Language:

      • Decide which language you want to focus on for your multilingual summarization task.
      • Remember that MLSUM covers five languages: French, German, Spanish, Russian, and Turkish.
    • Determine Your Task:

      • Define your specific summarization task. For example:
        • Single-document summarization: Generate a succinct summary for each individual article.
        • Multidocument summarization: Generate a summary by considering multiple related articles as input.
    • Preprocess the Data:

      • Clean and preprocess the text data according to your specific needs (e.g., lowercasing letters, removing punctuation).
    • Splitting Data Into Training/Validation/Test Sets: Ensure proper separation between training data (to train your model), validation data (to tune hyperparameters), evaluation / test data(to evaluate model performance).

    • Build or Adapt Your Summarization Model: Depending on your chosen task and programming abilities, decide whether you will adapt an existing model or build a new one from scratch. you may use existing state-of-the-art models such as BART, T5, GPT, or Transformer.

    Research Ideas

    • Multilingual Summarization Research: The MLSUM dataset provides a rich resource for researchers to study and develop multilingual summarization models. With over 1.5 million article/summary pairs in five different languages, the dataset can be used to train and evaluate the performance of multilingual summarization algorithms.
    • Comparative Analysis of Summariz...
  19. h

    multi_document_summarization

    • huggingface.co
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arka Das (2024). multi_document_summarization [Dataset]. https://huggingface.co/datasets/arka0821/multi_document_summarization
    Explore at:
    Dataset updated
    Feb 8, 2024
    Authors
    Arka Das
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Multi-Document, a large-scale multi-document summarization dataset created from scientific articles. Multi-Document introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.

  20. O

    Multi-News

    • opendatalab.com
    • tensorflow.org
    • +1more
    zip
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yale University (2022). Multi-News [Dataset]. https://opendatalab.com/OpenDataLab/Multi-News
    Explore at:
    zip(5650548772 bytes)Available download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Yale University
    License

    https://github.com/Alex-Fabbri/Multi-Newshttps://github.com/Alex-Fabbri/Multi-News

    Description

    Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited. There are two features: document: text of news articles seperated by special token "|||||". summary: news summary.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ccdv (2021). pubmed-summarization [Dataset]. https://huggingface.co/datasets/ccdv/pubmed-summarization

pubmed-summarization

ccdv/pubmed-summarization

Explore at:
107 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 1, 2021
Authors
ccdv
Description

PubMed dataset for summarization

Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")

  Data Fields

id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.

Search
Clear search
Close search
Google apps
Main menu