15 datasets found
  1. Movie Subtitle Durations

    • kaggle.com
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nevo Itzhak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

    Dataset statistics:

    • Average duration between subtitles
    • Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds
    • Maximum duration between subtitles
    • Percentage of duration between subtitles from the runtime

    Dataset use cases:

    • Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres
    • Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes
    • Evaluating the effectiveness of tools like the VLC extension mentioned below

    Data Analysis:

    The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

    Figure 1: Histogram of the runtime in minutes

    The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

    Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

    The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

    Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

    Example use case:

    The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

    The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

    Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

    Conclusion

    This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.

  2. h

    open_subtitles

    • huggingface.co
    • marketplace.sshopencloud.eu
    Updated Dec 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
    Explore at:
    Dataset updated
    Dec 10, 2020
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

    IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

    This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

    62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

  3. P

    OpenSubtitles Dataset

    • paperswithcode.com
    Updated Jul 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Lison; J{\"o}rg Tiedemann (2022). OpenSubtitles Dataset [Dataset]. https://paperswithcode.com/dataset/opensubtitles
    Explore at:
    Dataset updated
    Jul 10, 2022
    Authors
    Pierre Lison; J{\"o}rg Tiedemann
    Description

    OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.

  4. French Conversations (from movie subtitles)

    • kaggle.com
    Updated Aug 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dali Selmi (2023). French Conversations (from movie subtitles) [Dataset]. https://www.kaggle.com/datasets/daliselmi/french-conversational-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dali Selmi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    French
    Description

    French Movie Subtitle Conversations Dataset

    Description

    Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.

    Content Overview

    Each conversation in this dataset is structured as a JSON object, featuring three key attributes:

    1. Context: Get a holistic view of the conversation's flow with the preceding 9 lines of dialogue. This context provides invaluable insights into the conversation's dynamics and contextual cues.
    2. Knowledge: Immerse yourself in a wide range of thematic knowledge. This dataset covers an array of topics, ensuring that your models receive exposure to diverse information sources for generating well-informed responses.
    3. Response: Explore how characters react and respond across various scenarios. From casual conversations to intense emotional exchanges, this dataset encapsulates the authenticity of genuine human interaction.

    Data Sample

    Here's a snippet from the dataset to give you an idea of its structure:

    [
     {
      "context": [
       "Tu as attendu longtemps?",
       "Oui en effet.",
       "Je pense que c' est grossier pour un premier rencard.",
       // ... (6 more lines of context)
      ],
      "knowledge": "",
      "response": "On n' avait pas dit 9h?"
     },
     // ... (more data samples)
    ]
    

    Use Cases

    The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:

    • Conversational AI: Train advanced chatbots and dialogue systems in French that can engage users in fluid, contextually aware conversations.
    • Language Modeling: Enhance your language models by leveraging diverse dialogue patterns, colloquialisms, and contextual dependencies present in real-world conversations.
    • Sentiment Analysis: Investigate the emotional tones of conversations across different movie genres and periods, contributing to a better understanding of sentiment variation.

    Why This Dataset

    • Size and Diversity: With a vast collection of over 127,000 conversations spanning diverse genres and tones, this dataset offers an unparalleled breadth and depth in French dialogue data.
    • Contextual Richness: The inclusion of context empowers researchers and practitioners to explore the dynamics of conversation flow, leading to more accurate and contextually relevant responses.
    • Real-world Relevance: Originating from movie subtitles, this dataset mirrors real-world interactions, making it a valuable asset for training models that understand and generate human-like dialogue.

    Acknowledgments

    We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.

    Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.

  5. m

    Data on regional, ethnicity, and minorities representation in movies

    • data.mendeley.com
    Updated Feb 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FERNANDO TAMBERLINI ALVES (2025). Data on regional, ethnicity, and minorities representation in movies [Dataset]. http://doi.org/10.17632/kzv2m4hsvw.1
    Explore at:
    Dataset updated
    Feb 20, 2025
    Authors
    FERNANDO TAMBERLINI ALVES
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data sources are primary from three public databases: MovieLens, IMDb, and Brazilian National Cinema Agency. We also collected movie data and subtitles files using web scrapping and public API from six internet public sites: imdb.com, letterboxd.com, metacritic.com, rottentomatoes.com, subdl.com, and subscene.co.in. In addition, we used LLM Tool (Claude.Ai by Anthropic) to collect regional and ethnicity from movie’s director, screenwriter and main character.

  6. Movie Dynamics

    • kaggle.com
    zip
    Updated Apr 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Fire (2021). Movie Dynamics [Dataset]. https://www.kaggle.com/michaelfire/movie-dynamics-over-15000-movie-social-networks
    Explore at:
    zip(30901632 bytes)Available download formats
    Dataset updated
    Apr 1, 2021
    Authors
    Michael Fire
    Description

    The dataset is from our recent study titled "Using data science to understand the film industry’s gender gap". To construct this dataset, we fused data from the online movie database IMDb with a dataset of movie dialogue subtitles to create the largest available corpus of movie social networks (15,540 networks).

    More details on our research can be found at the following links: * Kagan, Dima, Thomas Chesney, and Michael Fire "Using data science to understand the film industry's gender gap." Nature Humanities and Social Sciences Communications, 6.1 (2020): 1-16 [Link] * "What do movie characters’ relationships reveal about gender, and how has this changed over time?", On Society Blog Post * Project's GitHub page * Our lab's website

  7. f

    Correlation between variables.

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lydia T. S. Yee (2023). Correlation between variables. [Dataset]. http://doi.org/10.1371/journal.pone.0174569.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Lydia T. S. Yee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Correlation between variables.

  8. E

    Data from: ACTIV-ES: a comparable Spanish corpus comprised of film dialogue...

    • live.european-language-grid.eu
    • data.niaid.nih.gov
    • +1more
    Updated Aug 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ACTIV-ES: a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7467
    Explore at:
    Dataset updated
    Aug 18, 2021
    License

    https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html

    Area covered
    Argentina, Mexico
    Description

    DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions. Titles for each of these three countries were seeded from the Internet Movie Database, subtitle data for the hearing impaired was provided by Opensubtitles.org and was post-processed to correct/remove subtitle, OCR and diacritic artifacts and annotated for part-of-speech.The data is available in two main formats: 1) running text for each document and 2) 1:5 gram aggregate files. Each format includes a plain text and part-of-speech annotated version. Document names reflect the language code, country, year, title, type, genre (first genre listed in the IMDb), and IMDb ID.For more information about the development and evaluation of these resources and to cite this work refer to:Francom, J., Hulden, M. and Ussishkin, A.. (2014) ACTIV-ES: a comparable, cross-dialect corpus of 'everyday' Spanish from Argentina, Mexico, and Spain. In Proceedings of the Ninth Annual Language Resources and Evaluation Conference, Reykjavik, Iceland. European Language Resources Association (ELRA).In version .02 of the tagged running format corpus in the /eagles directory has been added which includes the EAGLES tagset. This tagset is much more fleshed out than the simplified tagset in the /tagged directory. For information on the tagset refer here: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html.

  9. f

    SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles

    • plos.figshare.com
    doc
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qing Cai; Marc Brysbaert (2023). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles [Dataset]. http://doi.org/10.1371/journal.pone.0010729
    Explore at:
    docAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Qing Cai; Marc Brysbaert
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundWord frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.MethodologyFollowing recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.ConclusionsOur results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.

  10. Data from: ChatSubs: A dataset of movie dialogues in Spanish, Catalan,...

    • zenodo.org
    • produccioncientifica.ugr.es
    • +1more
    application/gzip, txt
    Updated Aug 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ksenia Kharitonova; Ksenia Kharitonova; Zoraida Callejas; Zoraida Callejas; David Pérez-Fernández; David Pérez-Fernández; Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño; David Griol; David Griol (2023). ChatSubs: A dataset of movie dialogues in Spanish, Catalan, Basque and Galician [Dataset]. http://doi.org/10.5281/zenodo.8192331
    Explore at:
    txt, application/gzipAvailable download formats
    Dataset updated
    Aug 7, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Ksenia Kharitonova; Ksenia Kharitonova; Zoraida Callejas; Zoraida Callejas; David Pérez-Fernández; David Pérez-Fernández; Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño; David Griol; David Griol
    Description

    Description: The ChatSubs dataset contains dialogues in Spanish and three co-official languages of Spain (Catalan, Basque, and Galician). It was obtained from OpenSubtitles and processed to generate clearly segmented dialogues and turns. The dataset consists of 206,706 JSON files, with over 20 million dialogues and 96 million turns, making it one of the largest dialogue corpora available. It serves as an excellent resource for research teams interested in training dialogue models in Spanish, Catalan, Basque, and Galician.

    License: CC BY-NC 4.0.

  11. P

    MovieQA Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Makarand Tapaswi; Yukun Zhu; Rainer Stiefelhagen; Antonio Torralba; Raquel Urtasun; Sanja Fidler (2023). MovieQA Dataset [Dataset]. https://paperswithcode.com/dataset/movieqa
    Explore at:
    Dataset updated
    Jun 25, 2023
    Authors
    Makarand Tapaswi; Yukun Zhu; Rainer Stiefelhagen; Antonio Torralba; Raquel Urtasun; Sanja Fidler
    Description

    The MovieQA dataset is a dataset for movie question answering. to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies and features high semantic diversity. Each question comes with a set of five highly plausible answers; only one of which is correct. The questions can be answered using multiple sources of information: movie clips, plots, subtitles, and for a subset scripts and DVS.

  12. h

    replique-a

    • huggingface.co
    Updated Jul 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    opsci (2024). replique-a [Dataset]. https://huggingface.co/datasets/opsci/replique-a
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2024
    Dataset authored and provided by
    opsci
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    Description

    The JSONL file generated by the script below contains detailed information about a corpus of public domain films, including their subtitles in multiple languages. Here is a detailed description of its structure:

      JSONL file structure
    

    IMDB: Unique identifier for the movie in the IMDb database. primary_title: Primary title of the movie. original_title: Original title of the movie. french: filepath: Relative path to the French subtitles file. subtitles: List of… See the full description on the dataset page: https://huggingface.co/datasets/opsci/replique-a.

  13. P

    Opusparcus Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Nov 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathias Creutz (2021). Opusparcus Dataset [Dataset]. https://paperswithcode.com/dataset/opusparcus
    Explore at:
    Dataset updated
    Nov 12, 2021
    Authors
    Mathias Creutz
    Description

    Opusparcus is a paraphrase corpus for six European languages: German, English, Finnish, French, Russian, and Swedish. The paraphrases are extracted from the OpenSubtitles2016 corpus, which contains subtitles from movies and TV shows.

    For each target language, the Opusparcus data have been partitioned into three types of data sets: training, development and test sets. The training sets are large, consisting of millions of sentence pairs, and have been compiled automatically, with the help of probabilistic ranking functions. The development and test sets consist of sentence pairs that have been annotated manually; each set contains approximately 1000 sentence pairs that have been verified to be acceptable paraphrases by two annotators.

  14. I

    Global Film Translation Market Demand and Supply Dynamics 2025-2032

    • statsndata.org
    excel, pdf
    Updated May 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Film Translation Market Demand and Supply Dynamics 2025-2032 [Dataset]. https://www.statsndata.org/report/film-translation-market-138167
    Explore at:
    excel, pdfAvailable download formats
    Dataset updated
    May 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The film translation market, a vital component of the global entertainment industry, specializes in the adaptation of film dialogue and subtitles to cater to diverse linguistic audiences. As globalization increases the consumption of foreign films, the demand for high-quality translation services has surged, ensurin

  15. A

    Corpus of Contemporary American English (COCA)

    • abacus.library.ubc.ca
    • dataverse.ucla.edu
    bin, pdf, tar, txt
    Updated Sep 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2022). Corpus of Contemporary American English (COCA) [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/3AKAN0
    Explore at:
    txt(1737), bin(14953046), pdf(5234129), tar(2096243712)Available download formats
    Dataset updated
    Sep 2, 2022
    Dataset provided by
    Abacus Data Network
    Time period covered
    1990 - 2020
    Description

    The Corpus of Contemporary American English (COCA) contains about 1 billion words in nearly 500,000 texts from 1990 to 2019 -- which are nearly evenly divided between spoken, fiction, magazines, newspapers, academic journals, blogs, other web pages, and TV/Movie subtitles (120-130 million words in each genre). In addition, there are 20 million words each year from 1990-2019 (with the same genre balance each year). From the COCA website:"The Corpus of Contemporary American English (COCA) is the only large and 'representative' corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created. These corpora were formerly known as the 'BYU Corpora', and they offer unparalleled insight into variation in English. (https://www.english-corpora.org/coca/)

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations
Organization logo

Movie Subtitle Durations

Statistics about durations between two consecutive subtitles in movies

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 9, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nevo Itzhak
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

Dataset statistics:

  • Average duration between subtitles
  • Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds
  • Maximum duration between subtitles
  • Percentage of duration between subtitles from the runtime

Dataset use cases:

  • Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres
  • Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes
  • Evaluating the effectiveness of tools like the VLC extension mentioned below

Data Analysis:

The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

Figure 1: Histogram of the runtime in minutes

The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

Example use case:

The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

Conclusion

This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.

Search
Clear search
Close search
Google apps
Main menu