Facebook
Twitterhttps://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization
Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev
Introduction
The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.
Facebook
TwitterBookSum: A Collection of Datasets for Long-form Narrative Summarization
This implementation currently only supports book and chapter summaries.
GitHub: https://github.com/salesforce/booksum
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('booksum', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Video Storytelling is a dataset for generating text story/summarization for videos containing social events. It consists of 105 videos from four categories: birthday, camping, Christmas and wedding. For each video, we provide at least 5 human-written stories.
Videos are contained in the .tar file with their corresponding category name.
Text stories are contained in Text.tar.
In each txt file, the first line is the video id. The start and end time (in seconds) of each sentence is also given.
test_id.txt provides the id for videos in the test set
Please cite the following paper if you use the Video Storytelling dataset in your work (papers, articles, reports, books, software, etc):
Video Storytelling: Textual Summaries for Events. J. Li, Y. Wong, Q.Zhao, M. Kankanhalli. IEEE Transactions on Multimedia.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Purpose: Summarizing expository passages is a critical academic skill that is understudied in language research. The purpose of this study was to compare the quality of verbal summaries produced by adolescents for 3 different discourse types and to determine whether a composite measure of cognitive skill or a test of expressive syntax predicted their performance.Method: Fifty adolescents listened to, and then verbally summarized, 1 narrative and 2 expository lectures (compare–contrast and cause–effect). They also participated in testing that targeted expressive syntax and 5 cognitive subdomains.Results: Summary quality scores were significantly different across discourse types, with a medium effect size. Analyse revealed significantly higher summary quality scores for cause–effect than compare–contrast summaries. Although the composite cognitive measure contributed significantly to the prediction of quality scores for both types of expository summaries, the expressive syntax score only contributed significantly to the quality scores for narrative summaries.Conclusions: These results support previous research indicating that type of expository discourse may impact student performance. These results also show, for the first time, that cognition may play a predictive role in determining summary quality for expository but not narrative passages in this population. In addition, despite the more complex syntax commonly associated with exposition versus narratives, an expressive syntax score was only predictive of performance on narrative summaries. These findings provide new information, questions, and directions for future research for those who study academic discourse and for professionals who must identify and manage the problems of students struggling with different types of academic discourse.Supplemental Material S1. Descriptive block-level U.S. Census values for participants and rotated structure matrix for principal component analysis with Varimax rotation of socioeconomic status (SES) variables.Supplemental Material S2. Descriptions of compare–contrast, cause–effect, and narrative lectures. Supplemental Material S3. Tests used from the National Institutes of Health Toolbox Cognition Battery.Supplemental Material S4. Pearson correlations for Expressive Syntax score, MLCU, and SI for compare–contrast, cause–effect, and narrative summaries (N = 50).Supplemental Material S5. Pearson correlations for total summarization quality scores for compare–contrast, cause–effect, and narrative lectures, age, socioeconomic status (SES) factors, cognitive composite score, and expressive syntax score (N = 48).Lundine, J. P., Harnish, S. M., McCauley, R. J., Blackett, D. S., Zezinka, A., Chen, W., & Fox, R. A. (2018). Adolescent summaries of narrative and expository discourse: Differences and predictors. Language, Speech, and Hearing Services in Schools, 49, 551–568. https://doi.org/10.1044/2018_LSHSS-17-0105
Facebook
TwitterDigital health technologies used in primary care, referred to as, virtual primary care, allow patients to interact with primary healthcare professionals remotely though the current iteration of virtual primary care may also come with several unintended consequences, such as accessibility barriers and cream skimming. The World Health Organization (WHO) has a well-established framework to understand the functional components of health systems. However, the existing building blocks framework does not sufficiently account for the disruptive and multi-modal impact of digital transformations. In this review, we aimed to develop the first iteration of this updated framework by reviewing the deployment of virtual primary care systems in five leading countries: Canada, Finland, Germany and Sweden and the United Kingdom (England). We found that all five countries have taken different approaches with the deployment of virtual primary care, yet seven common themes were highlighted across countries: (1) stated policy objectives, (2) regulation and governance, (3) financing and reimbursement, (4) delivery and integration, (5) workforce training and support, (6) IT systems and data sharing, and (7) the extent of patient involvement in the virtual primary care system. The conceptual framework that was derived from these findings offers a set of guiding principles that can facilitate the assessment of virtual primary care in health system settings.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Text summarization condenses extensive content into concise summaries; however, current approaches often rely on large language models (LLMs), which can lack interpretability and are susceptible to generating hallucinated content. To address these issues, we propose Docusage, an interpretable framework that replicates human summaries through a hierarchical clustering approach combined with extractive summarization, augmented by selective, LLM-based abstraction. Docusage minimizes the risk of hallucinations, ensures contextual relevance, and mitigates the computational costs inherent in leveraging an LLM.Our results show that Docusage aligns closely with journalist-generated summaries, outperforming foundational and specialized models. Additionally, Docusage offers an interpretable framework that is not constrained by context size, ensures transparency regarding the role of extracted sentences within the narrative, and adapts to the style of the training data.
Facebook
Twitterhttps://india-data.org/terms-conditionshttps://india-data.org/terms-conditions
This is a dataset of two popular crime thriller TV Shows "24" and "Prison Break" crafted for story-summarization task. In term of inputs this consists of, per episode frame embeddings generated from CLIP vision encoder, MViT, and DenseNet, as well as utterance embeddings generated from finetuned RoBERTa encoder. For output we have treated recap signals to form story-summary labels cached as per shot and utterance scores for an episode. In total we have a total of 205 episodes.
Facebook
TwitterThe dataset is a .csv file consisting of 10 columns: "subject" is the number assigned to each of the adolescents in the study, 1,...,55; "lecture_type" is either cc (compare-contrast), ce (cause-effect), or n (narrative), and each of the 55 subjects have a row for each lecture type; "development_type" is collected at the subject level, and is either TD (typically developing) or TBI (traumatic brain injury); "sex," (Male/Female) "age," (13-19) and "ses" (a summary of socioeconomic status; a standardized "z-value") are also collected at the subject level; "U" (>=1) is the total number of utterances in the discourse; "C" (>=U) is the total number of clauses in the discourse; "W" (>=C) is the total number of words in the discourse; and "D" (<=W) is the total number of distinct words in the discourse.
Facebook
TwitterThe CrisisFACTS track focuses on temporal summarization for first responders in emergency situations. These summaries differ from traditional summarization in that they order information by time and produce a series of short updates instead of a longer narrative.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Main points narrative for creative dissemination development
Facebook
TwitterREQUIRED: A brief narrative summary of the data set.
Facebook
TwitterREQUIRED: A brief narrative summary of the data set.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 1,500 profiles of fictional characters, each described across 15 diverse and creative columns. The dataset offers a rich variety of character attributes and narrative elements, designed to support a wide range of Natural Language Processing (NLP), generative AI, and storytelling applications.
Character Name
Media Type (e.g., Novel, Movie, Webcomic, TV Show, Video Game)
Media Source (fictional title/source)
Genre (e.g., Fantasy, Sci-Fi, Mystery, Romance, Horror, Thriller)
Role (e.g., Protagonist, Antagonist, Sidekick, Mentor, Villain, Hero)
Personality Traits (comma-separated adjectives)
Backstory (short narrative)
Skills/Abilities (comma-separated)
Appearance Description (physical summary)
Alignment (Hero, Villain, Neutral)
Interests/Hobbies
Relationships (summary of key connections)
Significance/Impact (their importance in the story)
Description (detailed narrative)
Scenario/Dialogue Example (sample interaction or scenario)
| Column Name | Description |
|---|---|
| Character Name | Full name of the fictional character |
| Media Type | Origin medium (Novel, Movie, etc.) |
| Media Source | Fictional title or work |
| Genre | Genre classification |
| Role | Narrative role (Protagonist, Antagonist, etc.) |
| Personality Traits | Key adjectives describing personality |
| Backstory | Brief background story |
| Skills/Abilities | Notable skills or powers |
| Appearance Description | Physical or visual description |
| Alignment | Moral alignment |
| Interests/Hobbies | Activities or interests |
| Relationships | Key relationships or connections |
| Significance/Impact | Importance or influence in their story |
| Description | Detailed narrative description |
| Scenario/Dialogue Example | Example scenario or dialogue for context |
Fictional character datasets are valuable for advancing research in text generation, character modeling, and creative AI. This dataset is ideal for anyone looking to experiment with synthetic narrative data, prototype new storytelling tools, or benchmark NLP models on character-driven content.
Data generated using Python Faker and randomization.
No real persons or copyrighted works are included.
CC0: Public Domain. This dataset is fully free to use for any purpose.
You can copy and adapt this structure for your Kaggle submission. It clearly explains what the dataset is, how it was built, what each column means, and why it might be useful to the community.
Facebook
TwitterREQUIRED: A brief narrative summary of the data set.
Facebook
Twitterdescription: This annual narrative report for Big Muddy National Fish and Wildlife Refuge summarizes refuge activities during the fiscal years 1994-1997. The report begins with an introduction to the refuge and a summary of the year s highlights and climatic conditions. Information about monitoring and studies- including fishery surveys, amphibian monitoring and seasonal flooding is provided next. Habitat and wildlife management were not discussed because the refuge was in its early stages of development. Coordination activities, such as private land activities and cooperative organizations, are outlined. The resource protection section provides information about law enforcement, water rights, and land acquisition. Information about public education and recreation is given including visitor services and refuge visitation. Finally, refuge planning and administration are discussed.; abstract: This annual narrative report for Big Muddy National Fish and Wildlife Refuge summarizes refuge activities during the fiscal years 1994-1997. The report begins with an introduction to the refuge and a summary of the year s highlights and climatic conditions. Information about monitoring and studies- including fishery surveys, amphibian monitoring and seasonal flooding is provided next. Habitat and wildlife management were not discussed because the refuge was in its early stages of development. Coordination activities, such as private land activities and cooperative organizations, are outlined. The resource protection section provides information about law enforcement, water rights, and land acquisition. Information about public education and recreation is given including visitor services and refuge visitation. Finally, refuge planning and administration are discussed.
Facebook
TwitterREQUIRED: A brief narrative summary of the data set.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Fine-tune Data BookSum: BookSum is a dataset for long context summarization. It includes a vast collection of books from various genres, and the task is to generate a coherent and concise summary given a long context from the book. This dataset is designed to test and train models on their ability to understand and summarize long, complex narratives. to convert to binidx format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Humans are entertained and emotionally captivated by a good story. Artworks, such as operas, theatre plays, movies, TV series, cartoons, etc., contain implicit stories, which are conveyed visually (e.g., through scenes) and audially (e.g., via music and speech). Story theorists have explored the structure of various artworks and identified forms and paradigms that are common to most well-written stories. Further, typical story structures have been formalized in different ways and used by professional screenwriters as guidelines. Currently, computers cannot yet identify such a latent narrative structure of a movie story. Therefore, in this work, we raise the novel challenge of understanding and formulating the movie story structure and introduce the first ever story-based labeled dataset—the Flintstones Scene Dataset (FSD). The dataset consists of 1, 569 scenes taken from a manual annotation of 60 episodes of a famous cartoon series, The Flintstones, by 105 distinct annotators. The various labels assigned to each scene by different annotators are summarized by a probability vector over 10 possible story elements representing the function of each scene in the advancement of the story, such as the Climax of Act One or the Midpoint. These elements are learned from guidelines for professional script-writing. The annotated dataset is used to investigate the effectiveness of various story-related features and multi-label classification algorithms for the task of predicting the probability distribution of scene labels. We use cosine similarity and KL divergence to measure the quality of predicted distributions. The best approaches demonstrated 0.81 average similarity and 0.67 KL divergence between the predicted label vectors and the ground truth vectors based on the manual annotations. These results demonstrate the ability of machine learning approaches to detect the narrative structure in movies, which could lead to the development of story-related video analytics tools, such as automatic video summarization and recommendation systems.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Synthetic Call Center Summaries Dataset
Overview
This dataset contains synthetic summaries of call center conversations generated by different prompt configurations. Each record (in JSON Lines format) includes:
The original dialogue metadata. A generated summary tailored to provide quick insights for call center service agents. Evaluation metrics
Prompts for summarization
Narrative: A narrative summary of the conversation. Bullet Points: A summary of the… See the full description on the dataset page: https://huggingface.co/datasets/marccgrau/filtered_convos_research_llm_summaries.
Facebook
Twitterhttps://choosealicense.com/licenses/bsd-3-clause/https://choosealicense.com/licenses/bsd-3-clause/
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization
Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, Dragomir Radev
Introduction
The majority of available text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future generations of text… See the full description on the dataset page: https://huggingface.co/datasets/kmfoda/booksum.