Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was thinking about movie sentiments and wanted to see if there is any strong pattern behind how sentiment fluctuates across the movie to how that movie is received or performed.
To track movie sentiments across the run time, the easy way is to get the movie subtitles and identify the sentiment for each text in the subtitle. The advantage of this approach is that movie subtitles are easy to get, parse, and process and NLP frameworks can easily help with the task. This approach is scalable since irrespective of language, english subtitles are available for almost all movies albeit translation errors.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.
IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!
This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.
62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by Sandun De Silva
Released under Apache 2.0
Facebook
TwitterSubscene is a vast collection of multilingual subtitles, encompassing 65 different languages and consisting of more than 30 billion tokens with a total size of 410.70 GB. This dataset includes subtitles for movies, series, and animations gathered from the Subscene dump. It provides a rich resource for studying language variations and building multilingual NLP models. We have carefully applied a fastText classifier to remove any non-language content from incorrect subsets. Additionally, we performed basic cleaning and filtration. However, there is still room for further cleaning and refinement.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 25 aligned movie subtitle segments in English, Russian, and Italian, extracted from the ParTree corpus. Each row provides a short, context-rich movie line with its translations in all three languages, making it ideal for research and development in machine translation, multilingual NLP, and cross-lingual transfer learning.
Key features: - Parallel triplets: English, Russian, Italian - Sourced from authentic movie subtitles for natural, conversational language - Suitable for training, validation, and benchmarking of translation and multilingual models
Data originally from the ParTree corpus, available via Swiss-AL
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
German OPUS OpenSubtitles
Dataset Description
This dataset contains German movie and TV subtitles from the OPUS OpenSubtitles corpus. It provides a large collection of natural, conversational German text extracted from movie and TV show subtitles.
Key Features
141,565,623 lines of German dialogue 4.2 GB of clean text data 92.5% unique lines (low duplication rate) Natural conversational German across diverse genres Minimal contamination (0.2% English, 0.8% ALLโฆ See the full description on the dataset page: https://huggingface.co/datasets/arnomatic/german-opus-subtitles.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Subtitles is a dataset for object detection tasks - it contains Letters annotations for 500 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterA multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
techiaith/YouTube-Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThe dataset is used to test the proposed methodologies for mining parallel data from comparable corpora.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The YouTube Insights dataset offers valuable data for researchers, data scientists, and YouTube enthusiasts to explore video performance and engagement. This dataset focuses on key elements such as video titles, view counts, analytics, and subtitles.
With a wide range of YouTube videos, spanning various genres and upload dates, this dataset provides insights into video popularity and audience engagement. Researchers can analyze video titles to understand effective strategies for capturing viewer attention. View counts offer quantitative measures of video popularity, while analytics data provides metrics like likes, dislikes, comments, and shares.
The inclusion of subtitles enhances the dataset, enabling language pattern analysis, sentiment analysis, and keyword extraction. Researchers can uncover correlations between subtitles and video content to gain a deeper understanding of audience preferences and behavior.
The YouTube Insights dataset empowers users to discover valuable insights into YouTube's ecosystem, optimizing content creation and engagement strategies. It serves as a foundation for research, analysis, and innovation in the realm of online video platforms.
Facebook
Twitterdaliselmi/french-conversations-from-movie-subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.
Dataset statistics:
Dataset use cases:
Data Analysis:
The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">
Figure 1: Histogram of the runtime in minutes
The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">
Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime
The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">
Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles
Example use case:
The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.
The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.
Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.
Conclusion
This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Survivor Subtitles Dataset
Dataset Description
A collection of subtitles from the American reality television show "Survivor", spanning seasons 1 through 47. The dataset contains subtitle text extracted from episode broadcasts.
Source
The subtitles were obtained from OpenSubtitles.com.
Dataset Details
Coverage:
Seasons: 1-47 Episodes per season: ~13-14 Total episodes: ~600
Format:
Text files containing timestamped subtitle data Characterโฆ See the full description on the dataset page: https://huggingface.co/datasets/hipml/survivor-subtitles.
Facebook
TwitterThis dataset was created by Rahul Kaushik
Released under Other (specified in description)
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming subtitles editor market! This in-depth analysis reveals key trends, growth drivers, and leading companies shaping the future of video accessibility and localization. Explore market size, CAGR, and regional insights for 2025-2033.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming subtitling and captioning market! This in-depth analysis reveals key trends, growth drivers, leading companies (SDI Media, IYUNO, Deluxe Media, ZOO Digital), and regional market shares from 2019-2033. Learn about the impact of AI and increasing demand for multilingual content.
Facebook
TwitterPJMixers-Dev/Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAccording to a survey of who watch foreign content, as of November 2021, subtitling video content was preferred over dubbing in the United States and the United Kingdom, with ** percent and ** percent of respondents reporting preferring the first method, respectively. By comparison, ** percent of video viewers in Italy reported preferring dubbing, while in Germany, this number rose to ********* respondents.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was thinking about movie sentiments and wanted to see if there is any strong pattern behind how sentiment fluctuates across the movie to how that movie is received or performed.
To track movie sentiments across the run time, the easy way is to get the movie subtitles and identify the sentiment for each text in the subtitle. The advantage of this approach is that movie subtitles are easy to get, parse, and process and NLP frameworks can easily help with the task. This approach is scalable since irrespective of language, english subtitles are available for almost all movies albeit translation errors.