100+ datasets found
  1. h

    open_subtitles

    • huggingface.co
    • marketplace.sshopencloud.eu
    Updated Dec 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
    Explore at:
    Dataset updated
    Dec 10, 2020
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

    IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

    This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

    62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

  2. P

    OpenSubtitles Dataset

    • paperswithcode.com
    Updated Jul 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Lison; J{\"o}rg Tiedemann (2022). OpenSubtitles Dataset [Dataset]. https://paperswithcode.com/dataset/opensubtitles
    Explore at:
    Dataset updated
    Jul 10, 2022
    Authors
    Pierre Lison; J{\"o}rg Tiedemann
    Description

    OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.

  3. French Conversations (from movie subtitles)

    • kaggle.com
    Updated Aug 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dali Selmi (2023). French Conversations (from movie subtitles) [Dataset]. https://www.kaggle.com/datasets/daliselmi/french-conversational-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Dali Selmi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    French
    Description

    French Movie Subtitle Conversations Dataset

    Description

    Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.

    Content Overview

    Each conversation in this dataset is structured as a JSON object, featuring three key attributes:

    1. Context: Get a holistic view of the conversation's flow with the preceding 9 lines of dialogue. This context provides invaluable insights into the conversation's dynamics and contextual cues.
    2. Knowledge: Immerse yourself in a wide range of thematic knowledge. This dataset covers an array of topics, ensuring that your models receive exposure to diverse information sources for generating well-informed responses.
    3. Response: Explore how characters react and respond across various scenarios. From casual conversations to intense emotional exchanges, this dataset encapsulates the authenticity of genuine human interaction.

    Data Sample

    Here's a snippet from the dataset to give you an idea of its structure:

    [
     {
      "context": [
       "Tu as attendu longtemps?",
       "Oui en effet.",
       "Je pense que c' est grossier pour un premier rencard.",
       // ... (6 more lines of context)
      ],
      "knowledge": "",
      "response": "On n' avait pas dit 9h?"
     },
     // ... (more data samples)
    ]
    

    Use Cases

    The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:

    • Conversational AI: Train advanced chatbots and dialogue systems in French that can engage users in fluid, contextually aware conversations.
    • Language Modeling: Enhance your language models by leveraging diverse dialogue patterns, colloquialisms, and contextual dependencies present in real-world conversations.
    • Sentiment Analysis: Investigate the emotional tones of conversations across different movie genres and periods, contributing to a better understanding of sentiment variation.

    Why This Dataset

    • Size and Diversity: With a vast collection of over 127,000 conversations spanning diverse genres and tones, this dataset offers an unparalleled breadth and depth in French dialogue data.
    • Contextual Richness: The inclusion of context empowers researchers and practitioners to explore the dynamics of conversation flow, leading to more accurate and contextually relevant responses.
    • Real-world Relevance: Originating from movie subtitles, this dataset mirrors real-world interactions, making it a valuable asset for training models that understand and generate human-like dialogue.

    Acknowledgments

    We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.

    Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.

  4. h

    YouTube-Subtitles

    • huggingface.co
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies, Bangor University (2025). YouTube-Subtitles [Dataset]. https://huggingface.co/datasets/techiaith/YouTube-Subtitles
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset authored and provided by
    Language Technologies, Bangor University
    License

    Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
    License information was derived automatically

    Area covered
    YouTube
    Description

    techiaith/YouTube-Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Movie Subtitles

    • kaggle.com
    Updated Nov 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahwar (2024). Movie Subtitles [Dataset]. https://www.kaggle.com/datasets/ahwardev/movie-subtitles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2024
    Dataset provided by
    Kaggle
    Authors
    Ahwar
    Description

    This dataset includes subtitle files in the SRT (SubRip Subtitle) format for several popular movies, such as Oppenheimer and Tenet. SRT files are plain-text files widely used for subtitles, containing a series of structured entries to synchronize text with video content. Each entry in an SRT file comprises:

    1. A sequential index number to indicate the order of the subtitles.
    2. Timestamps that specify when a subtitle should appear and disappear, formatted as hours:minutes:seconds,milliseconds.
    3. The subtitle text, which is displayed during the designated time interval.

    For example:
    ``` 1 00:00:01,500 --> 00:00:04,000 This is a sample subtitle.

    2 00:00:04,500 --> 00:00:07,000 Here is another subtitle to demonstrate multiple entries.

    
    This straightforward format is highly compatible with media players and easy to edit. SRT files enhance accessibility by providing subtitles for different languages or accommodating viewers with hearing impairments, enriching the experience of enjoying these popular movies.
    
  6. d

    ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.

    • doi.org
    • swissubase.ch
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles. [Dataset]. http://doi.org/10.48656/5mz4-x435
    Explore at:
    Dataset updated
    Mar 21, 2023
    Description

    A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.

  7. R

    Real-time Subtitles Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Real-time Subtitles Report [Dataset]. https://www.datainsightsmarket.com/reports/real-time-subtitles-1989001
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    May 23, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The real-time subtitles market is experiencing robust growth, driven by the increasing demand for accessible content across diverse platforms and languages. The market's expansion is fueled by several key factors: the rising adoption of streaming services and online video platforms, growing accessibility regulations mandating subtitles for various media, and the proliferation of multilingual content consumption. Technological advancements, such as improved speech-to-text accuracy and AI-powered subtitle generation, are further accelerating market growth. The market is segmented by technology (e.g., cloud-based, on-premise), application (e.g., live streaming, video conferencing, education), and end-user (e.g., media & entertainment, corporate, education). Competitive landscape analysis reveals a mix of established players and emerging technology companies, vying for market share through innovation in accuracy, speed, and integration with existing workflows. The forecast period (2025-2033) anticipates continued expansion, with a projected compound annual growth rate (CAGR) reflecting the increasing penetration of real-time subtitling across diverse industries and regions. Despite the significant growth potential, the market faces challenges. High initial investment costs for advanced technologies, the need for highly skilled professionals for accurate transcription and quality control, and variations in language complexities and accents can all constrain market penetration. However, these challenges are being addressed through continuous innovation, including the development of more affordable and user-friendly solutions, improvements in automated transcription technology, and increased accessibility of training programs. Overcoming these hurdles will be crucial for ensuring the continued and sustainable growth of the real-time subtitles market throughout the forecast period. The market is expected to reach a substantial value by 2033, driven by consistent technological advancements, regulatory support, and rising demand.

  8. Movie Subtitle Durations

    • kaggle.com
    Updated Oct 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nevo Itzhak (2023). Movie Subtitle Durations [Dataset]. https://www.kaggle.com/datasets/nevoit/movie-subtitle-durations
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nevo Itzhak
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.

    Dataset statistics:

    • Average duration between subtitles
    • Average duration between subtitles with a duration greater than 10, 30, 60, 120, and 300 seconds
    • Maximum duration between subtitles
    • Percentage of duration between subtitles from the runtime

    Dataset use cases:

    • Understanding how dialogue is used in movies, such as the average duration of a dialogue scene and how the duration of dialogue varies between different genres
    • Developing tools to improve the watching experience by adjusting the playback speed of dialogue scenes
    • Evaluating the effectiveness of tools like the VLC extension mentioned below

    Data Analysis:

    The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">

    Figure 1: Histogram of the runtime in minutes

    The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">

    Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime

    The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">

    Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles

    Example use case:

    The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.

    The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.

    Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.

    Conclusion

    This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.

  9. English - Indonesia Movie Subtitles

    • kaggle.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Greeg Titan (2022). English - Indonesia Movie Subtitles [Dataset]. https://www.kaggle.com/datasets/greegtitan/english-indonesia-movie-subtitles/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Greeg Titan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Collection of movie translation from opensubtitles.org. The dataset contains two feature, the Indonesia translation and english subtitle.

    Features

    featuredescription
    idIndonesian translation of subtitle
    enEnglish source subtitle

    How to use this dataset

    Translation EDA

    Acknowledgement

    For more Information visit:

    J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

    Citation

    P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

  10. Reasons why adults use subtitles when watching TV in known language in the...

    • statista.com
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Reasons why adults use subtitles when watching TV in known language in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/1459167/reasons-use-subtitles-watching-tv-known-language-us/
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 29, 2023 - Jul 5, 2023
    Area covered
    United States
    Description

    Enhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another 33 percent of the respondents stated that they did so because they were in a noisy environment.

  11. E

    SubIMDB: A Structured Corpus of Subtitles

    • live.european-language-grid.eu
    • zenodo.org
    txt
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SubIMDB: A Structured Corpus of Subtitles [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7453
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 15, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.

  12. Real Time Subtitles Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Real Time Subtitles Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/real-time-subtitles-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Real Time Subtitles Market Outlook



    The global real time subtitles market size was valued at approximately USD 2.5 billion in 2023 and is expected to surge to around USD 6.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 11.5% during the forecast period. This notable growth can be attributed to several factors, including the rising demand for accessible content, advancements in artificial intelligence (AI) and machine learning (ML) technologies, and the increasing globalization of media and corporate communications.



    One of the primary growth factors driving the real time subtitles market is the increasing emphasis on accessibility and inclusiveness in media and communications. Governments and organizations worldwide are instituting regulations and policies requiring content to be accessible to individuals who are deaf or hard of hearing. For instance, the Americans with Disabilities Act (ADA) in the United States mandates that video content be accessible, propelling the adoption of real-time subtitle solutions. This regulatory environment, coupled with growing social awareness, significantly fuels market growth.



    Another critical driver is the rapid advancement of AI and ML technologies, which have revolutionized the accuracy and efficiency of real-time subtitle generation. Modern AI-driven subtitle solutions can now offer near-perfect synchronization and error-free transcription, enhancing user experience. These technological advancements are making real-time subtitles more reliable and scalable, thereby increasing their adoption across various sectors such as broadcasting, education, and corporate communications.



    The globalization of media content and corporate operations further contributes to the market's expansion. As companies and content creators aim to reach a global audience, the need for multilingual subtitle solutions becomes imperative. Real-time subtitles facilitate effective communication across different languages and cultural contexts, thereby broadening the reach and appeal of content. This globalization trend is particularly evident in the streaming services sector, where platforms are increasingly providing real-time subtitles in multiple languages to cater to diverse audiences.



    Film Subtitling plays a crucial role in the globalization of media content, as it allows films to reach audiences across different linguistic and cultural backgrounds. With the rise of streaming platforms and international film festivals, the demand for high-quality film subtitling services has surged. These services not only enhance the accessibility of films for non-native speakers but also preserve the original context and cultural nuances of the content. As the film industry continues to expand its global footprint, the importance of accurate and culturally sensitive film subtitling cannot be overstated. This trend is particularly significant for independent filmmakers and studios aiming to distribute their content internationally, as it opens up new markets and increases viewership.



    Regionally, North America and Europe are currently the largest markets for real-time subtitles, driven by stringent accessibility regulations and the advanced state of digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, owing to increasing internet penetration, the proliferation of digital content, and rising awareness about accessibility. China and India, with their massive consumer bases and growing digital economies, are poised to be significant contributors to this regional market growth.



    Component Analysis



    The real time subtitles market by component can be broadly categorized into software, hardware, and services. Each of these segments plays a crucial role in the comprehensive ecosystem of real-time subtitle solutions. The software segment includes various applications and platforms that facilitate subtitle generation and synchronization. This segment is expected to dominate the market due to continuous advancements in AI and ML algorithms that significantly improve the accuracy and efficiency of subtitle generation. Companies are investing heavily in R&D to develop innovative software solutions that cater to diverse linguistic and accessibility needs.



    The hardware segment encompasses the physical devices required to support real-time subtitle generation and display. These include specialized subtitle generation hardware,

  13. F

    Film Subtitling Report

    • marketresearchforecast.com
    doc, pdf, ppt
    Updated Apr 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Forecast (2025). Film Subtitling Report [Dataset]. https://www.marketresearchforecast.com/reports/film-subtitling-541433
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Apr 27, 2025
    Dataset authored and provided by
    Market Research Forecast
    License

    https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global film subtitling market is experiencing robust growth, driven by the increasing consumption of on-demand video content across diverse languages and regions. The rising popularity of streaming platforms like Netflix, Amazon Prime, and Disney+, coupled with the expansion of international film productions, fuels the demand for high-quality subtitling services. This market is segmented by language (Native, Foreign, Minority, Special) and application (Drama, Comedy, Horror, Romance, Action, Other). While precise market sizing requires further data, considering the significant growth in streaming and global film production, a reasonable estimation for the 2025 market size could be in the range of $2.5 to $3 billion USD. A Compound Annual Growth Rate (CAGR) of 8-10% is plausible over the forecast period (2025-2033), reflecting continued market expansion. Key growth drivers include increased globalization of media content, the rise of multilingual audiences, accessibility requirements for diverse viewers, and technological advancements in subtitling software and automation. However, the market also faces restraints. These include fluctuating language-specific demand, varying quality standards across subtitling providers, the need for skilled and experienced linguists, and the potential for copyright issues surrounding unauthorized subtitling. The competition is fierce among established players like PoliLingua, JBI Studios, and BTI Studios, as well as smaller, specialized firms. The geographical distribution of the market is expected to be broadly diversified, with North America and Europe representing significant market shares, though Asia-Pacific is projected to experience substantial growth in the coming years due to the booming entertainment industry and expanding internet penetration. The increasing adoption of AI-powered subtitling tools might streamline the process but poses challenges related to maintaining accuracy and cultural nuances. Strategic partnerships and investments in technological innovation are crucial for companies to maintain competitiveness and cater to the evolving demands of the film subtitling industry.

  14. V

    Video Subtitle Translation Service Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Video Subtitle Translation Service Report [Dataset]. https://www.datainsightsmarket.com/reports/video-subtitle-translation-service-538596
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global video subtitle translation services market is experiencing robust growth, driven by the proliferation of video content across various platforms and the increasing demand for accessibility and global reach. The market's expansion is fueled by several key factors. Firstly, the rise of streaming services and online video platforms necessitates multilingual subtitles to cater to a diverse global audience. Secondly, the growing emphasis on accessibility for individuals with hearing impairments is driving demand for accurate and high-quality subtitles. Thirdly, advancements in artificial intelligence (AI) and machine learning (ML) technologies are enhancing the speed and efficiency of translation processes, making the service more cost-effective. Finally, globalization and increased cross-border communication are further propelling market growth. We estimate the market size in 2025 to be approximately $2.5 billion, with a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, leading to a projected market value of around $7.8 billion by 2033. This growth trajectory is anticipated despite certain restraints, such as the need for human oversight to ensure accuracy and cultural nuances in translations, and the challenges associated with handling diverse dialects and languages. Market segmentation plays a crucial role in understanding the landscape. While specific segment breakdowns aren't provided, we can infer significant segments based on industry trends. These likely include language pairs (e.g., English to Spanish, English to Mandarin), video type (e.g., corporate videos, films, educational content), and service type (e.g., human translation, machine translation with post-editing). The competitive landscape is characterized by a mix of established players like Stepes, Ai-Media, and 3Play Media, and smaller, specialized companies catering to niche markets. The ongoing technological advancements and increasing market demand indicate that the video subtitle translation services market is poised for sustained, considerable growth in the coming years, creating opportunities for both established and emerging players.

  15. Audience preferences for subtitles or dubbing 2021, by country

    • statista.com
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Audience preferences for subtitles or dubbing 2021, by country [Dataset]. https://www.statista.com/statistics/1289864/subtitles-dubbing-audience-preference-by-country/
    Explore at:
    Dataset updated
    Jun 25, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2021
    Area covered
    Worldwide
    Description

    According to a survey of who watch foreign content, as of November 2021, subtitling video content was preferred over dubbing in the United States and the United Kingdom, with ** percent and ** percent of respondents reporting preferring the first method, respectively. By comparison, ** percent of video viewers in Italy reported preferring dubbing, while in Germany, this number rose to ********* respondents.

  16. O

    Data from: JESC (Japanese-English Subtitle Corpus)

    • opendatalab.com
    zip
    Updated Sep 21, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford University (2022). JESC (Japanese-English Subtitle Corpus) [Dataset]. https://opendatalab.com/OpenDataLab/JESC
    Explore at:
    zip(447883426 bytes)Available download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Google
    Stanford University
    Rakuten Institute of Technology
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Japanese-English Subtitle Corpus is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web.

  17. SURE Project Subtitle Files

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 21, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Szarkowska Agnieszka; Szarkowska Agnieszka (2020). SURE Project Subtitle Files [Dataset]. http://doi.org/10.5281/zenodo.1160582
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 21, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Szarkowska Agnieszka; Szarkowska Agnieszka
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Subtitle files (.xlsx format) from Experiment 1 and Experiment 2 from the project "SURE - Exploring Subtitle Reading Process with Eyetracking Technology", supported by a grant from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 702606.

    In each experiment, there are files in English, Polish and Spanish, with time codes, as used in the study.

    Each file contains three versions: subtitled at 12, 16 and 20 characters per second (cps).

  18. S

    Subtitling Services Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Subtitling Services Report [Dataset]. https://www.datainsightsmarket.com/reports/subtitling-services-1462258
    Explore at:
    pdf, ppt, docAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global subtitling services market, currently valued at $1259 million in 2025, is projected to experience robust growth, driven by the increasing consumption of video content across diverse platforms and languages. The compound annual growth rate (CAGR) of 6.4% from 2025 to 2033 indicates a significant expansion in market size, exceeding $2000 million by the end of the forecast period. This growth is fueled by several key factors, including the rise of streaming services, the increasing demand for multilingual content accessibility, and the growing popularity of online education and e-learning platforms that necessitate subtitling for wider reach and inclusivity. Furthermore, advancements in automated subtitling technologies are streamlining the process, reducing costs, and improving turnaround times, contributing positively to market expansion. However, challenges such as ensuring high-quality translations that accurately convey nuances and cultural contexts, and managing the complexities of different dialects and accents, remain crucial aspects for service providers to address. Competitive intensity is moderate, with a range of companies, including both established players like 3Play Media and emerging players like GoPhrazy, catering to diverse client needs. The market is segmented based on factors such as service type (e.g., live subtitling, post-production subtitling), industry vertical (e.g., media and entertainment, education, corporate), and language pairs. The geographical distribution of revenue likely shows significant concentration in North America and Europe, reflecting high internet penetration and media consumption in these regions, but emerging markets in Asia and Latin America also present significant growth opportunities. Continued innovation in AI-powered subtitling tools and a focus on providing accurate, culturally sensitive translations will be crucial for companies to maintain a competitive edge in this dynamic market.

  19. Anime Subtitles

    • kaggle.com
    Updated Aug 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jess Fan (2021). Anime Subtitles [Dataset]. https://www.kaggle.com/datasets/jef1056/anime-subtitles/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jess Fan
    Description

    Content

    The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)

    This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.

    This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.

    V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)

    Format

    The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such: def dataset_fn_local(split, shuffle_files=False): global nq_tsv_path del shuffle_files # Load lines from the text file as examples. files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)] print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files. First 10: {files_to_read[0:10]}") ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0)) ds = ds.shuffle(buffer_size=600000) ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE) ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex))) return ds

    Acknowledgements

    A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.

    This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data

  20. h

    HowTo100M-subtitles-small

    • huggingface.co
    Updated Nov 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Diyar Hamedi (2023). HowTo100M-subtitles-small [Dataset]. https://huggingface.co/datasets/diyarhamedi/HowTo100M-subtitles-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2023
    Authors
    Diyar Hamedi
    Description

    HowTo100M-subtitles-small

    The subtitles from a subset of the HowTo100M dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles

open_subtitles

OpenSubtitles

Helsinki-NLP/open_subtitles

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 10, 2020
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

Search
Clear search
Close search
Google apps
Main menu