15 datasets found

Movie Subtitle Durations
kaggle.com
Updated Oct 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nevo Itzhak
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

Dataset statistics:

Average duration between subtitles

Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds

Maximum duration between subtitles

Percentage of duration between subtitles from the runtime

Dataset use cases:

Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres

Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes

Evaluating the effectiveness of tools like the VLC extension mentioned below

Data Analysis:

The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

Figure 1: Histogram of the runtime in minutes

The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

Example use case:

The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

Conclusion

This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.
h
open_subtitles
huggingface.co
marketplace.sshopencloud.eu
Updated Dec 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
Explore at:
Dataset updated
Dec 10, 2020
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G
P
OpenSubtitles Dataset
paperswithcode.com
Updated Jul 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre Lison; J{\"o}rg Tiedemann (2022). OpenSubtitles Dataset [Dataset]. https://paperswithcode.com/dataset/opensubtitles
Explore at:
Dataset updated
Jul 10, 2022
Authors
Pierre Lison; J{\"o}rg Tiedemann
Description
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
French Conversations (from movie subtitles)
kaggle.com
Updated Aug 3, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dali Selmi (2023). French Conversations (from movie subtitles) [Dataset]. https://www.kaggle.com/datasets/daliselmi/french-conversational-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Dali Selmi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
French
Description
French Movie Subtitle Conversations Dataset

Description

Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.

Content Overview

Each conversation in this dataset is structured as a JSON object, featuring three key attributes:

Context: Get a holistic view of the conversation's flow with the preceding 9 lines of dialogue. This context provides invaluable insights into the conversation's dynamics and contextual cues.

Knowledge: Immerse yourself in a wide range of thematic knowledge. This dataset covers an array of topics, ensuring that your models receive exposure to diverse information sources for generating well-informed responses.

Response: Explore how characters react and respond across various scenarios. From casual conversations to intense emotional exchanges, this dataset encapsulates the authenticity of genuine human interaction.

Data Sample

Here's a snippet from the dataset to give you an idea of its structure:

[ { "context": [ "Tu as attendu longtemps?", "Oui en effet.", "Je pense que c' est grossier pour un premier rencard.", // ... (6 more lines of context) ], "knowledge": "", "response": "On n' avait pas dit 9h?" }, // ... (more data samples) ]

Use Cases

The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:

Conversational AI: Train advanced chatbots and dialogue systems in French that can engage users in fluid, contextually aware conversations.

Language Modeling: Enhance your language models by leveraging diverse dialogue patterns, colloquialisms, and contextual dependencies present in real-world conversations.

Sentiment Analysis: Investigate the emotional tones of conversations across different movie genres and periods, contributing to a better understanding of sentiment variation.

Why This Dataset

Size and Diversity: With a vast collection of over 127,000 conversations spanning diverse genres and tones, this dataset offers an unparalleled breadth and depth in French dialogue data.

Contextual Richness: The inclusion of context empowers researchers and practitioners to explore the dynamics of conversation flow, leading to more accurate and contextually relevant responses.

Real-world Relevance: Originating from movie subtitles, this dataset mirrors real-world interactions, making it a valuable asset for training models that understand and generate human-like dialogue.

Acknowledgments

We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.

Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
m
Data on regional, ethnicity, and minorities representation in movies
data.mendeley.com
Updated Feb 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FERNANDO TAMBERLINI ALVES (2025). Data on regional, ethnicity, and minorities representation in movies [Dataset]. http://doi.org/10.17632/kzv2m4hsvw.1
Explore at:
Unique identifier
https://doi.org/10.17632/kzv2m4hsvw.1
Dataset updated
Feb 20, 2025
Authors
FERNANDO TAMBERLINI ALVES
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data sources are primary from three public databases: MovieLens, IMDb, and Brazilian National Cinema Agency. We also collected movie data and subtitles files using web scrapping and public API from six internet public sites: imdb.com, letterboxd.com, metacritic.com, rottentomatoes.com, subdl.com, and subscene.co.in. In addition, we used LLM Tool (Claude.Ai by Anthropic) to collect regional and ethnicity from movie’s director, screenwriter and main character.
Movie Dynamics
kaggle.com
zip
Updated Apr 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Fire (2021). Movie Dynamics [Dataset]. https://www.kaggle.com/michaelfire/movie-dynamics-over-15000-movie-social-networks
Explore at:
zip(30901632 bytes)Available download formats
Dataset updated
Apr 1, 2021
Authors
Michael Fire
Description
The dataset is from our recent study titled "Using data science to understand the film industry’s gender gap". To construct this dataset, we fused data from the online movie database IMDb with a dataset of movie dialogue subtitles to create the largest available corpus of movie social networks (15,540 networks).

More details on our research can be found at the following links: * Kagan, Dima, Thomas Chesney, and Michael Fire "Using data science to understand the film industry's gender gap." Nature Humanities and Social Sciences Communications, 6.1 (2020): 1-16 [Link] * "What do movie characters’ relationships reveal about gender, and how has this changed over time?", On Society Blog Post * Project's GitHub page * Our lab's website
f
Correlation between variables.
plos.figshare.com
xls
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lydia T. S. Yee (2023). Correlation between variables. [Dataset]. http://doi.org/10.1371/journal.pone.0174569.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0174569.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Lydia T. S. Yee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Correlation between variables.
E
Data from: ACTIV-ES: a comparable Spanish corpus comprised of film dialogue...
live.european-language-grid.eu
data.niaid.nih.gov
+1more
Updated Aug 18, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ACTIV-ES: a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7467
Explore at:
Dataset updated
Aug 18, 2021
License
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Area covered
Argentina, Mexico
Description
DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions. Titles for each of these three countries were seeded from the Internet Movie Database, subtitle data for the hearing impaired was provided by Opensubtitles.org and was post-processed to correct/remove subtitle, OCR and diacritic artifacts and annotated for part-of-speech.The data is available in two main formats: 1) running text for each document and 2) 1:5 gram aggregate files. Each format includes a plain text and part-of-speech annotated version. Document names reflect the language code, country, year, title, type, genre (first genre listed in the IMDb), and IMDb ID.For more information about the development and evaluation of these resources and to cite this work refer to:Francom, J., Hulden, M. and Ussishkin, A.. (2014) ACTIV-ES: a comparable, cross-dialect corpus of 'everyday' Spanish from Argentina, Mexico, and Spain. In Proceedings of the Ninth Annual Language Resources and Evaluation Conference, Reykjavik, Iceland. European Language Resources Association (ELRA).In version .02 of the tagged running format corpus in the /eagles directory has been added which includes the EAGLES tagset. This tagset is much more fleshed out than the simplified tagset in the /tagged directory. For information on the tagset refer here: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html.
f
SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles
plos.figshare.com
doc
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qing Cai; Marc Brysbaert (2023). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles [Dataset]. http://doi.org/10.1371/journal.pone.0010729
Explore at:
docAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0010729
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Qing Cai; Marc Brysbaert
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundWord frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.MethodologyFollowing recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.ConclusionsOur results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.
Data from: ChatSubs: A dataset of movie dialogues in Spanish, Catalan,...
zenodo.org
produccioncientifica.ugr.es
+1more
application/gzip, txt
Updated Aug 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ksenia Kharitonova; Ksenia Kharitonova; Zoraida Callejas; Zoraida Callejas; David Pérez-Fernández; David Pérez-Fernández; Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño; David Griol; David Griol (2023). ChatSubs: A dataset of movie dialogues in Spanish, Catalan, Basque and Galician [Dataset]. http://doi.org/10.5281/zenodo.8192331
Explore at:
txt, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8192331
Dataset updated
Aug 7, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ksenia Kharitonova; Ksenia Kharitonova; Zoraida Callejas; Zoraida Callejas; David Pérez-Fernández; David Pérez-Fernández; Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño; David Griol; David Griol
Description
Description: The ChatSubs dataset contains dialogues in Spanish and three co-official languages of Spain (Catalan, Basque, and Galician). It was obtained from OpenSubtitles and processed to generate clearly segmented dialogues and turns. The dataset consists of 206,706 JSON files, with over 20 million dialogues and 96 million turns, making it one of the largest dialogue corpora available. It serves as an excellent resource for research teams interested in training dialogue models in Spanish, Catalan, Basque, and Galician.

License: CC BY-NC 4.0.
P
MovieQA Dataset
paperswithcode.com
opendatalab.com
Updated Jun 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Makarand Tapaswi; Yukun Zhu; Rainer Stiefelhagen; Antonio Torralba; Raquel Urtasun; Sanja Fidler (2023). MovieQA Dataset [Dataset]. https://paperswithcode.com/dataset/movieqa
Explore at:
Dataset updated
Jun 25, 2023
Authors
Makarand Tapaswi; Yukun Zhu; Rainer Stiefelhagen; Antonio Torralba; Raquel Urtasun; Sanja Fidler
Description
The MovieQA dataset is a dataset for movie question answering. to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies and features high semantic diversity. Each question comes with a set of five highly plausible answers; only one of which is correct. The questions can be answered using multiple sources of information: movie clips, plots, subtitles, and for a subset scripts and DVS.
h
replique-a
huggingface.co
Updated Jul 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
opsci (2024). replique-a [Dataset]. https://huggingface.co/datasets/opsci/replique-a
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Dataset authored and provided by
opsci
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
Description

The JSONL file generated by the script below contains detailed information about a corpus of public domain films, including their subtitles in multiple languages. Here is a detailed description of its structure:

JSONL file structure

IMDB: Unique identifier for the movie in the IMDb database. primary_title: Primary title of the movie. original_title: Original title of the movie. french: filepath: Relative path to the French subtitles file. subtitles: List of… See the full description on the dataset page: https://huggingface.co/datasets/opsci/replique-a.
P
Opusparcus Dataset
paperswithcode.com
opendatalab.com
+1more
Updated Nov 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathias Creutz (2021). Opusparcus Dataset [Dataset]. https://paperswithcode.com/dataset/opusparcus
Explore at:
Dataset updated
Nov 12, 2021
Authors
Mathias Creutz
Description
Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.

For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been annotated manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.
I
Global Film Translation Market Demand and Supply Dynamics 2025-2032
statsndata.org
excel, pdf
Updated May 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stats N Data (2025). Global Film Translation Market Demand and Supply Dynamics 2025-2032 [Dataset]. https://www.statsndata.org/report/film-translation-market-138167
Explore at:
excel, pdfAvailable download formats
Dataset updated
May 2025
Dataset authored and provided by
Stats N Data
License
https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order
Area covered
Global
Description
The film translation market, a vital component of the global entertainment industry, specializes in the adaptation of film dialogue and subtitles to cater to diverse linguistic audiences. As globalization increases the consumption of foreign films, the demand for high-quality translation services has surged, ensurin
A
Corpus of Contemporary American English (COCA)
abacus.library.ubc.ca
dataverse.ucla.edu
bin, pdf, tar, txt
Updated Sep 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2022). Corpus of Contemporary American English (COCA) [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/3AKAN0
Explore at:
txt(1737), bin(14953046), pdf(5234129), tar(2096243712)Available download formats
Dataset updated
Sep 2, 2022
Dataset provided by
Abacus Data Network
Time period covered
1990 - 2020
Description
The Corpus of Contemporary American English (COCA) contains about 1 billion words in nearly 500,000 texts from 1990 to 2019 -- which are nearly evenly divided between spoken, fiction, magazines, newspapers, academic journals, blogs, other web pages, and TV/Movie subtitles (120-130 million words in each genre). In addition, there are 20 million words each year from 1990-2019 (with the same genre balance each year). From the COCA website:"The Corpus of Contemporary American English (COCA) is the only large and 'representative' corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created. These corpora were formerly known as the 'BYU Corpora', and they offer unparalleled insight into variation in English. (https://www.english-corpora.org/coca/)
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations

Movie Subtitle Durations

Statistics about durations between two consecutive subtitles in movies

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 9, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Nevo Itzhak

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

Dataset statistics:

Average duration between subtitles
Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds
Maximum duration between subtitles
Percentage of duration between subtitles from the runtime

Dataset use cases:

Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres
Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes
Evaluating the effectiveness of tools like the VLC extension mentioned below

Data Analysis:

The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

Figure 1: Histogram of the runtime in minutes

The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

Example use case:

The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

Conclusion

This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.

Clear search

Close search

Google apps

Main menu

Movie Subtitle Durations

open_subtitles

OpenSubtitles Dataset

French Conversations (from movie subtitles)

French Movie Subtitle Conversations Dataset

Description

Content Overview

Data Sample

Use Cases

Why This Dataset

Acknowledgments

Data on regional, ethnicity, and minorities representation in movies

Movie Dynamics

Correlation between variables.

Data from: ACTIV-ES: a comparable Spanish corpus comprised of film dialogue...

SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles

Data from: ChatSubs: A dataset of movie dialogues in Spanish, Catalan,...

MovieQA Dataset

replique-a

Opusparcus Dataset

Global Film Translation Market Demand and Supply Dynamics 2025-2032

Corpus of Contemporary American English (COCA)

Movie Subtitle Durations

Statistics about durations between two consecutive subtitles in movies