Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.
In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.
Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.
Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.
Introduction:
Dataset Structure:
- article: The full text of a scientific article from the PubMed database (Text).
- abstract: A summary of the main findings and conclusions of the article (Text).
Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:
Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.
Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.
Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.
Tips for Utilizing the Dataset Effectively:
Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.
Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.
Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.
Conclusion:
- Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description ...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.
The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.
Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.
With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.
Introduction:
File Description:
validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.
train.csv: The purpose of this file is to provide training data for summarizing scientific articles.
test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.
Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.
Usage Examples: This dataset can be utilized in various ways:
a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.
b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.
c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.
- Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.
Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions
- Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
- Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
- Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Facebook
TwitterPubMed dataset for summarization
Dataset for summarization of long documents.Adapted from this repo.Note that original data are pre-tokenized so this dataset returns " ".join(text) and add " " for paragraphs. This dataset is compatible with the run_summarization.py script from Transformers if you add this line to the summarization_name_mapping variable: "ccdv/pubmed-summarization": ("article", "abstract")
Data Fields
id: paper id article: a string containing the body… See the full description on the dataset page: https://huggingface.co/datasets/ccdv/pubmed-summarization.
Facebook
TwitterThis work was supported by the Ho Chi Minh City Department of Science and Technology, Grant Numbers 15/2016/HÐ-SKHCN
Data construction process: In this work, we aim to have 300 clusters of documents extracted from news. To this end, we made use of the Vietnamese language version of Google News. Due to the copyright issue, we did not collect articles from every source listed on Google News, but limited to some sources that are open for research purposes. The collected articles belong to five genres: world news, domestic news, business, entertainment, and sports. Every cluster contains from four to ten news articles. Each article is represented by the following information: the title, the plain text content, the news source, the date of publication, the author(s), the tag(s) and the headline summary.
After that, two summaries are created for each cluster (produced in the first subtask above) by two distinguished annotators using the MDSWriter system (Meyer, Christian M., et al. "MDSWriter: Annotation tool for creating high-quality multi-document summarization corpora." Proceedings of ACL-2016 System Demonstrations). These annotators are Vietnamese native speakers and they are undergraduate students or graduate students. Most of them know about natural language processing. The full annotation process consists of seven steps that must be done sequentially from the first to the seventh one.
Data information: Original folder: Containing 300 subdirectories which are 300 news clusters. Articles (documents) in each cluster belong to a similar topic and there are from four to ten of them. The number of articles is 1,945.
Summary folder: Contains 300 subdirectories which are 600 final summaries. Every input cluster has two manual abstract summaries from two distinguished annotators. ViMs can be used for both implementing and evaluating supervised machine learning-based systems for Vietnamese abstractive multi-document summarization.
S3_summary folder: Contains 300 subdirectories including 600 ''best sentence selection'' summaries, the result of step 3 -- best sentence selection step. Sentences in a group are separated from others by a blank line. The most important sentence is marked as 1 while 0 is the label for others.
@article{tran2020vims, title={ViMs: a high-quality Vietnamese dataset for abstractive multi-document summarization}, author={Tran, Nhi-Thao and Nghiem, Minh-Quoc and Nguyen, Nhung TH and Nguyen, Ngan Luu-Thuy and Van Chi, Nam and Dinh, Dien}, journal={Language Resources and Evaluation}, volume={54}, number={4}, pages={893--920}, year={2020}, publisher={Springer} }
Author: Tran Mai Vu, Vu Trong Hoa, Phi Van Thuy, Le Duc Trong, Ha Quang Thuy Affiliation: Knowledge Technology Laboratory, University of Technology, VNU Hanoi Research Topic: Design and Implementation of a Multi-document Summarization Program for the Vietnamese Language, Funded by the Ministry of Education (Project Code: B2012-01-24)
Data construction process: The data construction process is entirely manual. It consists of two steps:
Data information: Data Volume: 200 clusters
Each cluster corresponds to a folder, and it typically contains 2-5 documents (often 3). The folder's name represents the cluster.
Within each folder:
All files within the same folder represent documents (online articles) belonging to the cluster:
Facebook
TwitterAutoTrain Dataset for project: summarization
Dataset Description
This dataset has been automatically processed by AutoTrain for project summarization.
Languages
The BCP-47 code for the dataset's language is en.
Dataset Structure
Data Instances
A sample from this dataset looks as follows: [ { "feat_id": "train_0", "text": "#Person1#: Hi, Mr. Smith. I'm Doctor Hawkins. Why are you here today?
Facebook
TwitterValues in parentheses are for the highest resolution shell.
Facebook
Twitterhttps://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0https://clarin.si/repository/xmlui/page/licence-aca-id-by-nc-inf-nored-1.0
Summarization datasets were created from the text bodies in the KAS 2.0 corpus (http://hdl.handle.net/11356/1448) and the abstracts from the KAS-Abs 2.0 corpus (http://hdl.handle.net/11356/1449). The monolingual slo2slo dataset contains 69,730 Slovene abstracts and Slovene body texts. The cross-lingual slo2eng dataset contains 52,351 Slovene body texts and English abstracts. It is suitable for building cross-lingual summarization models. Total number of words represent the sum of words in bodies, Slovene abstracts, and English abstracts.
The files are stored in the same manner as the complete KAS corpus, i.e. in 1,000 directories with the same filename prefix as in KAS. They are in the JSON format that contains chapter segmented text. In addition to a unique chapter ID, each JSON file contains a key titled “abstract” that contains a list with abstract text as its first element. The file with the metadata for the corpus texts is also included.
The datasets are suitable for training monolingual Slovene summarization models and cross-lingual Slovene-English summarization models on long texts.
References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the following 3 datasets for legal document summarization :
- IN-Abs : Indian Supreme Court case documents & their `abstractive' summaries, obtained from http://www.liiofindia.org/in/cases/cen/INSC/
- IN-Ext : Indian Supreme Court case documents & their `extractive' summaries, written by two law experts (A1, A2).
- UK-Abs : United Kingdom (U.K.) Supreme Court case documents & their `abstractive' summaries, obtained from https://www.supremecourt.uk/decided-cases/
Please refer to the paper and the README file for more details.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Russian summarization data mix
Total Number of items in Train: 197561. Total Number of items in Golden Test set: 258 (manually verified semi-synthetic data for evaluation purpose).
We use this datasets for train mix:
XLSum Gazeta WikiLingua MLSUM Reviews (ru) Curation-corpus (ru) Matreshka DialogSum (ru) SAMSum (ru)
Cite us
@misc{akhmetgareeva2024summary, title={Towards Russian Summarization: can architecture solve data limitations problems?}… See the full description on the dataset page: https://huggingface.co/datasets/RussianNLP/Mixed-Summarization-Dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
INTRODUCTION As part of its responsibilities, the BC Ministry of Environment monitors water quality in the province’s streams, rivers, and lakes. Often, it is necessary to compile statistics involving concentrations of contaminants or other compounds. Quite often the instruments used cannot measure concentrations below certain values. These observations are called non-detects or less thans. However, non-detects pose a difficulty when it is necessary to compute statistical measurements such as the mean, the median, and the standard deviation for a data set. The way non-detects are handled can affect the quality of any statistics generated. Non-detects, or censored data are found in many fields such as medicine, engineering, biology, and environmetrics. In such fields, it is often the case that the measurements of interest are below some threshold. Dealing with non-detects is a significant issue and statistical tools using survival or reliability methods have been developed. Basically, there are three approaches for treating data containing censored values: 1. substitution, which gives poor results and therefore, is not recommended in the literature; 2. maximum likelihood estimation, which requires an assumption of some distributional form; and 3. and nonparametric methods which assess the shape of the data based on observed percentiles rather than a strict distributional form. This document provides guidance on how to record censor data, and on when and how to use certain analysis methods when the percentage of censored observations is less than 50%. The methods presented in this document are:1. substitution; 2. Kaplan-Meier, as part of nonparametric methods; 3. lognormal model based on maximum likelihood estimation; 4. and robust regression on order statistics, which is a semiparametric method. Statistical software suitable for survival or reliability analysis is available for dealing with censored data. This software has been widely used in medical and engineering environments. In this document, methods are illustrated with both R and JMP software packages, when possible. JMP often requires some intermediate steps to obtain summary statistics with most of the methods described in this document. R, with the NADA package is usually straightforward. The package NADA was developed specifically for computing statistics with non-detects in environmental data based on Helsel (2005b). The data used to illustrate the methods described for computing summary statistics for non-detects are either simulated or based on information acquired from the B.C. Ministry of Environment. This document is strongly based on the book Nondetects And Data Analysis written by Dennis R. Helsel in 2005 (Helsel, 2005b).
Facebook
TwitterThe Big Data Interagency Working Group (BD IWG) held a workshop, Measuring the Impact of Digital Repositories, on February 28 - March 1, 2017 in Arlington, VA. The aim of the workshop was to identify current assessment metrics, tools, and methodologies that are effective in measuring the impact of digital data repositories, and to identify the assessment issues, obstacles, and tools that require additional research and development (R&D). This workshop brought together leaders from academic, journal, government, and international data repository funders, users, and developers to discuss these issues...
Facebook
TwitterThis dataset is used in an experimental preference fine-tuning of Qwen2-1.5B model for summarization task The goal is to re-implement Apple work on training specific LoRA's on top of small LM to perform specific tasks, for example summarization. More info on the project: https://github.com/thepowerfuldeez/qwen2_1_5b_summarize
Method
Dataset generated using samples from RedPajamaV2 dataset, specifically Arxiv, Wikipedia, StackExchange documents. I have downloaded 1% of data and… See the full description on the dataset page: https://huggingface.co/datasets/thepowerfuldeez/Qwen-summarize-dataset-train.
Facebook
TwitterUnder the new quarterly data summary (QDS) framework departments’ spending data is published every quarter; to show the taxpayer how the government is spending their money.
The QDS grew out of commitments made in the 2011 Budget and the written ministerial statement on business plans. For the financial year 2012 to 2013 the QDS has been revised and improved in line with action 9 of the Civil Service Reform Plan to provide a common set of data that will enable comparisons of operational performance across government so that departments and individuals can be held to account.
The QDS breaks down the total spend of the department in 3 ways:
The QDS template is the same for all departments, though the individual detail of grants and policy will differ from department to department. In using this data:
Please note that the quarter 1 2012 to 2013 return for the Department of Transport (DfT) is for the core department only.
Quarterly data summaries for April 2012 are as follows:
Quarterly data summaries for January 2012 are as follows:
Quarterly data summaries for October 2011 are as follows:
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
NASA's exploration and scientific missions will produce terabytes of information. As NASA enters a new phase of space exploration, managing large amounts of scientific and operational data will become even more challenging. Robots conducting planetary exploration will produce data for selection and preparation of exploration sites. Robots and space probes will collect scientific data to improve understanding of the solar system. Satellites in low Earth orbit will collect data for monitoring changes in the Earth's atmosphere and surface environment. Key challenges for all these missions are understanding and summarizing what data have been collected and using this knowledge to improve data access. TRACLabs and CMU propose to develop context aware image manipulation software for managing data collected remotely during NASA missions. This software will filter and search large image archives using the temporal and spatial characteristics of images, and the robotic, instrument, and environmental conditions when images were taken. It also will implement techniques for finding which images show a terrain feature specified by the user. In Phase II we will implement this software and evaluate its effectiveness for NASA missions. At the end of Phase II, context aware image manipulation software at TRL 5-6 will be delivered to NASA.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
A text summarisation task aims to convert a longer text into a shorter text while preserving the essential information of the source text. In general, there are two approaches to text summarization. The extractive approach simply rewrites the most important sentences or parts of the text, whereas the abstractive approach is more similar to human-made summaries. We release 5 models that cover extractive, abstractive, and hybrid types:
Metamodel: a neural model based on the Doc2Vec document representation that suggests the best summariser. Graph-based model: unsupervised graph-based extractive approach that returns the N most relevant sentences. Headline model: a supervised abstractive approach (T5 architecture) that returns returns headline-like abstracts. Article model: a supervised abstract approach (T5 architecture) that returns short summaries. Hybrid-long model: unsupervised hybrid (graph-based and transformer model-based) approach that returns short summaries of long texts.
Details and instructions to run and train the models are available at https://github.com/clarinsi/SloSummarizer.
The web service with a demo is available at https://slovenscina.eu/povzemanje.
Facebook
TwitterqPCR results for Vitellogenin. This dataset is associated with the following publication: Armstrong, B., J. Lazorchak , K. Jensen , H. Haring , M.E. Smith, R. Flick , D. Bencic , and A. Biales. Reproductive effects in fathead minnows (Pimphales promelas) following a 21 d exposure to 17α-ethinylestradiol. CHEMOSPHERE. Elsevier Science Ltd, New York, NY, USA, 144(1): 366-373, (2015).
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset comprises tables in JSON format, each accompanied by a concise summary of approximately 10 lines. Originally created to train large language models (LLMs) for efficient tabular data summarization, this dataset serves as a valuable resource for developing advanced summarization algorithms. The tables contain a mix of textual and numerical data, reflecting a wide array of information types. Each summary captures the essential insights and key patterns from the corresponding table, providing a clear and succinct overview. The dataset's diverse entries, ranging from descriptive text to precise numerical values, present a robust challenge for LLMs, facilitating the enhancement of their summarization capabilities. By leveraging this dataset, researchers and developers can improve the accuracy and efficiency of their models in interpreting and summarizing complex tabular data.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
varunr14/summarize dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The collection of datasets in Table 1 is extended, and their more meaningful and, thus, recommended, descriptions based on multiplicative means and multiplicative standard errors or standard deviations are given. Some comparisons appear to be of interest. Necessarily, arithmetic means exceed multiplicative ones, starting from some 15% for small s*s around 1.7 up to more than the sevenfold for s* >7. The lower limits of the 95% ranges, relative to the means, turn increasingly negative with s* growing for the classical version, but remain positive and get smaller for the multiplicative description. Turning to upper limits, the multiplicative limit exceeds the additive one by some 17% for s* = 1.7. With s* = 2.5, the difference is about 25%. For s* = 4.2, there is no difference, and for s* = 7, the additive mean is only half the multiplicative one.
Facebook
TwitterThis collection consists of data that has been quality controlled through both an automated and manual QC processes. The dataset consisted of summary of the day data from approximately 500 First Order stations (primarily major airports) in the U.S. and its possessions. Included variables are: daily summaries of hourly data for temperature, precipitation, pressure, observed weather, and wind speed and direction. The primary source of this information is from the Automated Surface Observation System (ASOS), however other collections are also included, such as SRRS, keyed data, and others -- see documentation for a complete list. Data were incorporated into the GHCN-Daily dataset as of 2013.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ccdv (From Huggingface) [source]
The dataset consists of multiple files, including validation.csv, train.csv, and test.csv. Each file contains a combination of articles and their respective abstracts. The articles are sourced directly from PubMed, ensuring they represent a wide range of topics across various scientific disciplines.
In order to provide reliable datasets for different purposes, the files have been carefully curated to serve specific functions. validation.csv contains a subset of articles with their corresponding abstracts that can be used for validating the performance of summarization models during development. train.csv features a larger set of article-abstract pairs specifically intended for training such models.
Finally, test.csv serves as an independent evaluation set that allows developers to measure the effectiveness and generalizability of their summarization models against unseen data points. By using this test set, researchers can assess how well their algorithms perform in generating concise summaries that accurately capture the main findings and conclusions within scientific articles.
Researchers in natural language processing (NLP), machine learning (ML), or any related field can utilize this dataset to advance automatic text summarization techniques focused on scientific literature. Whether it's building extractive or abstractive methods or exploring novel approaches like neural networks or transformer-based architectures, this rich dataset provides ample opportunities for experimentation and progress in the field.
Introduction:
Dataset Structure:
- article: The full text of a scientific article from the PubMed database (Text).
- abstract: A summary of the main findings and conclusions of the article (Text).
Using the Dataset: To maximize the utility of this dataset, it is important to understand its purpose and how it can be utilized:
Training Models: The train.csv file contains articles and their corresponding abstracts that can be used for training summarization models or developing algorithms that generate concise summaries automatically.
Validation Purposes: The validation.csv file serves as a test set for fine-tuning your models or comparing different approaches during development.
Evaluating Model Performance: The test.csv file offers a separate set of articles along with their corresponding abstracts specifically designed for evaluating the performance of various summarization models.
Tips for Utilizing the Dataset Effectively:
Preprocessing: Before using this dataset, consider preprocessing steps such as removing irrelevant sections (e.g., acknowledgments, references), cleaning up invalid characters or formatting issues if any exist.
Feature Engineering: Explore additional features like article length, sentence structure complexity, or domain-specific details that may assist in improving summarization model performance.
Model Selection & Evaluation: Experiment with different summarization algorithms, ranging from traditional extractive approaches to more advanced abstractive methods. Evaluate model performance using established metrics such as ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
Data Augmentation: Depending on the size of your dataset, you may consider augmenting it further by applying techniques like data synthesis or employing external resources (e.g., pre-trained language models) to enhance model performance.
Conclusion:
- Textual analysis and information retrieval: Researchers can use this dataset to analyze patterns in scientific literature or conduct information retrieval tasks. By examining the relationship between article content and its abstract, researchers can gain insights into how different sections of a scientific paper contribute to its overall summary.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: validation.csv | Column name | Description ...