100+ datasets found
  1. Movie Subtitle Dataset

    • kaggle.com
    zip
    Updated Aug 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adiamaan (2021). Movie Subtitle Dataset [Dataset]. https://www.kaggle.com/datasets/adiamaan/movie-subtitle-dataset
    Explore at:
    zip(254871718 bytes)Available download formats
    Dataset updated
    Aug 8, 2021
    Authors
    Adiamaan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    ๐Ÿ’ก Motive

    I was thinking about movie sentiments and wanted to see if there is any strong pattern behind how sentiment fluctuates across the movie to how that movie is received or performed.

    ๐ŸŽ Lowest hanging fruit

    To track movie sentiments across the run time, the easy way is to get the movie subtitles and identify the sentiment for each text in the subtitle. The advantage of this approach is that movie subtitles are easy to get, parse, and process and NLP frameworks can easily help with the task. This approach is scalable since irrespective of language, english subtitles are available for almost all movies albeit translation errors.

  2. h

    open_subtitles

    • huggingface.co
    • marketplace.sshopencloud.eu
    Updated May 13, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2024). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
    Explore at:
    Dataset updated
    May 13, 2024
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

    IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

    This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

    62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

  3. English Movie Subtitle Collection

    • kaggle.com
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandun De Silva (2024). English Movie Subtitle Collection [Dataset]. https://www.kaggle.com/datasets/sandundesilva/movie-genre-dataset
    Explore at:
    zip(5081555 bytes)Available download formats
    Dataset updated
    Apr 24, 2024
    Authors
    Sandun De Silva
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Sandun De Silva

    Released under Apache 2.0

    Contents

  4. h

    subscene

    • huggingface.co
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    REFINE ai (2025). subscene [Dataset]. https://huggingface.co/datasets/refine-ai/subscene
    Explore at:
    Dataset updated
    Mar 12, 2025
    Dataset authored and provided by
    REFINE ai
    Description

    Subscene is a vast collection of multilingual subtitles, encompassing 65 different languages and consisting of more than 30 billion tokens with a total size of 410.70 GB. This dataset includes subtitles for movies, series, and animations gathered from the Subscene dump. It provides a rich resource for studying language variations and building multilingual NLP models. We have carefully applied a fastText classifier to remove any non-language content from incorrect subsets. Additionally, we performed basic cleaning and filtration. However, there is still room for further cleaning and refinement.

  5. Movie Parallel Subtitles (EN-IT-RU)

    • kaggle.com
    zip
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timur Sharifullin (2025). Movie Parallel Subtitles (EN-IT-RU) [Dataset]. https://www.kaggle.com/datasets/timursharifullindata/movie-parallel-subtitles-small-sentiment-dataset
    Explore at:
    zip(6056 bytes)Available download formats
    Dataset updated
    Jun 26, 2025
    Authors
    Timur Sharifullin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains 25 aligned movie subtitle segments in English, Russian, and Italian, extracted from the ParTree corpus. Each row provides a short, context-rich movie line with its translations in all three languages, making it ideal for research and development in machine translation, multilingual NLP, and cross-lingual transfer learning.

    Key features: - Parallel triplets: English, Russian, Italian - Sourced from authentic movie subtitles for natural, conversational language - Suitable for training, validation, and benchmarking of translation and multilingual models

    Data originally from the ParTree corpus, available via Swiss-AL

  6. h

    german-opus-subtitles

    • huggingface.co
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    arnomatic (2025). german-opus-subtitles [Dataset]. https://huggingface.co/datasets/arnomatic/german-opus-subtitles
    Explore at:
    Dataset updated
    Jan 12, 2025
    Authors
    arnomatic
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    German OPUS OpenSubtitles

      Dataset Description
    

    This dataset contains German movie and TV subtitles from the OPUS OpenSubtitles corpus. It provides a large collection of natural, conversational German text extracted from movie and TV show subtitles.

      Key Features
    

    141,565,623 lines of German dialogue 4.2 GB of clean text data 92.5% unique lines (low duplication rate) Natural conversational German across diverse genres Minimal contamination (0.2% English, 0.8% ALLโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/arnomatic/german-opus-subtitles.

  7. R

    Subtitles Dataset

    • universe.roboflow.com
    zip
    Updated Oct 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    subtitles (2022). Subtitles Dataset [Dataset]. https://universe.roboflow.com/subtitles-jtdc8/subtitles-xmseb/model/1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 9, 2022
    Dataset authored and provided by
    subtitles
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Letters Bounding Boxes
    Description

    Subtitles

    ## Overview
    
    Subtitles is a dataset for object detection tasks - it contains Letters annotations for 500 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  8. d

    ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.

    • doi.org
    • swissubase.ch
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles. [Dataset]. http://doi.org/10.48656/5mz4-x435
    Explore at:
    Dataset updated
    Dec 5, 2023
    Description

    A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.

  9. h

    YouTube-Subtitles

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies, Bangor University (2025). YouTube-Subtitles [Dataset]. https://huggingface.co/datasets/techiaith/YouTube-Subtitles
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset authored and provided by
    Language Technologies, Bangor University
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    techiaith/YouTube-Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. t

    Movie Subtitles - Dataset - LDM

    • service.tib.eu
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Movie Subtitles - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/movie-subtitles
    Explore at:
    Dataset updated
    Jan 3, 2025
    Description

    The dataset is used to test the proposed methodologies for mining parallel data from comparable corpora.

  11. E

    SubIMDB: A Structured Corpus of Subtitles

    • live.european-language-grid.eu
    • zenodo.org
    • +1more
    txt
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SubIMDB: A Structured Corpus of Subtitles [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7453
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 15, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.

  12. YouTube Video Statistics and Subtitles Dataset

    • kaggle.com
    zip
    Updated Jul 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamza (2023). YouTube Video Statistics and Subtitles Dataset [Dataset]. https://www.kaggle.com/datasets/hamza3692/youtube-video-statistics-and-subtitles-dataset
    Explore at:
    zip(11716047 bytes)Available download formats
    Dataset updated
    Jul 3, 2023
    Authors
    Hamza
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    YouTube
    Description

    The YouTube Insights dataset offers valuable data for researchers, data scientists, and YouTube enthusiasts to explore video performance and engagement. This dataset focuses on key elements such as video titles, view counts, analytics, and subtitles.

    With a wide range of YouTube videos, spanning various genres and upload dates, this dataset provides insights into video popularity and audience engagement. Researchers can analyze video titles to understand effective strategies for capturing viewer attention. View counts offer quantitative measures of video popularity, while analytics data provides metrics like likes, dislikes, comments, and shares.

    The inclusion of subtitles enhances the dataset, enabling language pattern analysis, sentiment analysis, and keyword extraction. Researchers can uncover correlations between subtitles and video content to gain a deeper understanding of audience preferences and behavior.

    The YouTube Insights dataset empowers users to discover valuable insights into YouTube's ecosystem, optimizing content creation and engagement strategies. It serves as a foundation for research, analysis, and innovation in the realm of online video platforms.

  13. h

    french-conversations-from-movie-subtitles

    • huggingface.co
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    daliselmi (2023). french-conversations-from-movie-subtitles [Dataset]. https://huggingface.co/datasets/daliselmi/french-conversations-from-movie-subtitles
    Explore at:
    Dataset updated
    Aug 4, 2023
    Authors
    daliselmi
    Area covered
    French
    Description

    daliselmi/french-conversations-from-movie-subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. Movie Subtitle Durations

    • kaggle.com
    zip
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations
    Explore at:
    zip(432921 bytes)Available download formats
    Dataset updated
    Oct 9, 2023
    Authors
    Nevo Itzhak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

    Dataset statistics:

    • Average duration between subtitles
    • Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds
    • Maximum duration between subtitles
    • Percentage of duration between subtitles from the runtime

    Dataset use cases:

    • Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres
    • Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes
    • Evaluating the effectiveness of tools like the VLC extension mentioned below

    Data Analysis:

    The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

    Figure 1: Histogram of the runtime in minutes

    The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

    Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

    The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

    Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

    Example use case:

    The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

    The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

    Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

    Conclusion

    This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.

  15. h

    survivor-subtitles

    • huggingface.co
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Lambert (2025). survivor-subtitles [Dataset]. https://huggingface.co/datasets/hipml/survivor-subtitles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 3, 2025
    Authors
    Paul Lambert
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Survivor Subtitles Dataset

      Dataset Description
    

    A collection of subtitles from the American reality television show "Survivor", spanning seasons 1 through 47. The dataset contains subtitle text extracted from episode broadcasts.

      Source
    

    The subtitles were obtained from OpenSubtitles.com.

      Dataset Details
    

    Coverage:

    Seasons: 1-47 Episodes per season: ~13-14 Total episodes: ~600

    Format:

    Text files containing timestamped subtitle data Characterโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/hipml/survivor-subtitles.

  16. English Subtitles (opensubtitles.org)

    • kaggle.com
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rahul Kaushik (2024). English Subtitles (opensubtitles.org) [Dataset]. https://www.kaggle.com/datasets/kaushikrahul/english-subtitles-opensubtitles-org
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 3, 2024
    Dataset provided by
    Kaggle
    Authors
    Rahul Kaushik
    Description

    Dataset

    This dataset was created by Rahul Kaushik

    Released under Other (specified in description)

    Contents

  17. S

    Subtitles Editor Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Subtitles Editor Report [Dataset]. https://www.datainsightsmarket.com/reports/subtitles-editor-512222
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming subtitles editor market! This in-depth analysis reveals key trends, growth drivers, and leading companies shaping the future of video accessibility and localization. Explore market size, CAGR, and regional insights for 2025-2033.

  18. S

    Subtitling and Captioning Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Subtitling and Captioning Report [Dataset]. https://www.datainsightsmarket.com/reports/subtitling-and-captioning-1393307
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jun 28, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming subtitling and captioning market! This in-depth analysis reveals key trends, growth drivers, leading companies (SDI Media, IYUNO, Deluxe Media, ZOO Digital), and regional market shares from 2019-2033. Learn about the impact of AI and increasing demand for multilingual content.

  19. h

    Subtitles

    • huggingface.co
    Updated Apr 1, 2002
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peanut Jar Mixers Development (2002). Subtitles [Dataset]. https://huggingface.co/datasets/PJMixers-Dev/Subtitles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2002
    Dataset authored and provided by
    Peanut Jar Mixers Development
    Description

    PJMixers-Dev/Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. Audience preferences for subtitles or dubbing 2021, by country

    • statista.com
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Audience preferences for subtitles or dubbing 2021, by country [Dataset]. https://www.statista.com/statistics/1289864/subtitles-dubbing-audience-preference-by-country/
    Explore at:
    Dataset updated
    Feb 8, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2021
    Area covered
    Worldwide
    Description

    According to a survey of who watch foreign content, as of November 2021, subtitling video content was preferred over dubbing in the United States and the United Kingdom, with ** percent and ** percent of respondents reporting preferring the first method, respectively. By comparison, ** percent of video viewers in Italy reported preferring dubbing, while in Germany, this number rose to ********* respondents.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Adiamaan (2021). Movie Subtitle Dataset [Dataset]. https://www.kaggle.com/datasets/adiamaan/movie-subtitle-dataset
Organization logo

Movie Subtitle Dataset

5k timestamped subtitles with IMDB meta data

Explore at:
zip(254871718 bytes)Available download formats
Dataset updated
Aug 8, 2021
Authors
Adiamaan
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

๐Ÿ’ก Motive

I was thinking about movie sentiments and wanted to see if there is any strong pattern behind how sentiment fluctuates across the movie to how that movie is received or performed.

๐ŸŽ Lowest hanging fruit

To track movie sentiments across the run time, the easy way is to get the movie subtitles and identify the sentiment for each text in the subtitle. The advantage of this approach is that movie subtitles are easy to get, parse, and process and NLP frameworks can easily help with the task. This approach is scalable since irrespective of language, english subtitles are available for almost all movies albeit translation errors.

Search
Clear search
Close search
Google apps
Main menu