https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.
IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!
This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.
62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G
OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset – a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.
Each conversation in this dataset is structured as a JSON object, featuring three key attributes:
Here's a snippet from the dataset to give you an idea of its structure:
[
{
"context": [
"Tu as attendu longtemps?",
"Oui en effet.",
"Je pense que c' est grossier pour un premier rencard.",
// ... (6 more lines of context)
],
"knowledge": "",
"response": "On n' avait pas dit 9h?"
},
// ... (more data samples)
]
The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:
We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.
Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
techiaith/YouTube-Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset includes subtitle files in the SRT (SubRip Subtitle) format for several popular movies, such as Oppenheimer and Tenet. SRT files are plain-text files widely used for subtitles, containing a series of structured entries to synchronize text with video content. Each entry in an SRT file comprises:
hours:minutes:seconds,milliseconds
. For example:
```
1
00:00:01,500 --> 00:00:04,000
This is a sample subtitle.
2 00:00:04,500 --> 00:00:07,000 Here is another subtitle to demonstrate multiple entries.
This straightforward format is highly compatible with media players and easy to edit. SRT files enhance accessibility by providing subtitles for different languages or accommodating viewers with hearing impairments, enriching the experience of enjoying these popular movies.
A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The real-time subtitles market is experiencing robust growth, driven by the increasing demand for accessible content across diverse platforms and languages. The market's expansion is fueled by several key factors: the rising adoption of streaming services and online video platforms, growing accessibility regulations mandating subtitles for various media, and the proliferation of multilingual content consumption. Technological advancements, such as improved speech-to-text accuracy and AI-powered subtitle generation, are further accelerating market growth. The market is segmented by technology (e.g., cloud-based, on-premise), application (e.g., live streaming, video conferencing, education), and end-user (e.g., media & entertainment, corporate, education). Competitive landscape analysis reveals a mix of established players and emerging technology companies, vying for market share through innovation in accuracy, speed, and integration with existing workflows. The forecast period (2025-2033) anticipates continued expansion, with a projected compound annual growth rate (CAGR) reflecting the increasing penetration of real-time subtitling across diverse industries and regions. Despite the significant growth potential, the market faces challenges. High initial investment costs for advanced technologies, the need for highly skilled professionals for accurate transcription and quality control, and variations in language complexities and accents can all constrain market penetration. However, these challenges are being addressed through continuous innovation, including the development of more affordable and user-friendly solutions, improvements in automated transcription technology, and increased accessibility of training programs. Overcoming these hurdles will be crucial for ensuring the continued and sustainable growth of the real-time subtitles market throughout the forecast period. The market is expected to reach a substantial value by 2033, driven by consistent technological advancements, regulatory support, and rising demand.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset includes statistics about durations between two consecutive subtitles in 5,000 top-ranked IMDB movies. The dataset can be used to understand how dialogue is used in films and to develop tools to improve the watching experience. This notebook contains the code and data that were used to create this dataset.
Dataset statistics:
Dataset use cases:
Data Analysis:
The next histogram shows the distribution of movie runtimes in minutes. The mean runtime is 99.903 minutes, the maximum runtime is 877 minutes, and the median runtime is 98.5 minutes.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F5c78e4866f203dfe5f7a7f55e41f69d0%2Ffig%201.png?generation=1696861842737260&alt=media" alt="">
Figure 1: Histogram of the runtime in minutes
The next histogram shows the distribution of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime. The mean percentage of gaps is 0.187, the maximum percentage of gaps is 0.033, and the median percentage of gaps is 327.586.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F235453706269472da11082f080b1f41d%2Ffig%202.png?generation=1696862163125288&alt=media" alt="">
Figure 2: Histogram of the percentage of gaps (duration between two consecutive subtitles) out of all the movie runtime
The next histogram shows the distribution of the total movie's subtitle duration (seconds) between two consecutive subtitles. The mean subtitle duration is 4,837.089 seconds and the median subtitle duration is 2,906.435 seconds.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3228936%2F234d31e3abaf6c4d174f494bf5cb86fa%2Ffig%203.png?generation=1696862309880510&alt=media" alt="">
Figure 3: Histogram of the total movie's subtitle duration (seconds) between two consecutive subtitles
Example use case:
The Dynamic Adjustment of Playback Speed (DAPS), a VLC extension, can be used to save time while watching movies by increasing the playback speed between dialogues. However, it is essential to choose the appropriate settings for the extension, as increasing the playback speed can impact the overall tone and impact of the film.
The dataset of 5,000 top-ranked movie subtitle durations can be used to help users choose the appropriate settings for the DAPS extension. For example, users who are watching a fast-paced action movie may want to set a higher minimum duration between subtitles before speeding up, while users who are watching a slow-paced drama movie may want to set a lower minimum duration.
Additionally, users can use the dataset to understand how the different settings of the DAPS extension impact the overall viewing experience. For example, users can experiment with different settings to see how they affect the pacing of the movie and the overall impact of the dialogue scenes.
Conclusion
This dataset is a valuable resource for researchers and developers who are interested in understanding and improving the use of dialogue in movies or in tools for watching movies.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Collection of movie translation from opensubtitles.org. The dataset contains two feature, the Indonesia translation and english subtitle.
feature | description |
---|---|
id | Indonesian translation of subtitle |
en | English source subtitle |
Translation EDA
For more Information visit:
J. Tiedemann, 2016, Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
Citation
P. Lison and J. Tiedemann, 2016, OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)
Enhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another 33 percent of the respondents stated that they did so because they were in a noisy environment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global real time subtitles market size was valued at approximately USD 2.5 billion in 2023 and is expected to surge to around USD 6.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 11.5% during the forecast period. This notable growth can be attributed to several factors, including the rising demand for accessible content, advancements in artificial intelligence (AI) and machine learning (ML) technologies, and the increasing globalization of media and corporate communications.
One of the primary growth factors driving the real time subtitles market is the increasing emphasis on accessibility and inclusiveness in media and communications. Governments and organizations worldwide are instituting regulations and policies requiring content to be accessible to individuals who are deaf or hard of hearing. For instance, the Americans with Disabilities Act (ADA) in the United States mandates that video content be accessible, propelling the adoption of real-time subtitle solutions. This regulatory environment, coupled with growing social awareness, significantly fuels market growth.
Another critical driver is the rapid advancement of AI and ML technologies, which have revolutionized the accuracy and efficiency of real-time subtitle generation. Modern AI-driven subtitle solutions can now offer near-perfect synchronization and error-free transcription, enhancing user experience. These technological advancements are making real-time subtitles more reliable and scalable, thereby increasing their adoption across various sectors such as broadcasting, education, and corporate communications.
The globalization of media content and corporate operations further contributes to the market's expansion. As companies and content creators aim to reach a global audience, the need for multilingual subtitle solutions becomes imperative. Real-time subtitles facilitate effective communication across different languages and cultural contexts, thereby broadening the reach and appeal of content. This globalization trend is particularly evident in the streaming services sector, where platforms are increasingly providing real-time subtitles in multiple languages to cater to diverse audiences.
Film Subtitling plays a crucial role in the globalization of media content, as it allows films to reach audiences across different linguistic and cultural backgrounds. With the rise of streaming platforms and international film festivals, the demand for high-quality film subtitling services has surged. These services not only enhance the accessibility of films for non-native speakers but also preserve the original context and cultural nuances of the content. As the film industry continues to expand its global footprint, the importance of accurate and culturally sensitive film subtitling cannot be overstated. This trend is particularly significant for independent filmmakers and studios aiming to distribute their content internationally, as it opens up new markets and increases viewership.
Regionally, North America and Europe are currently the largest markets for real-time subtitles, driven by stringent accessibility regulations and the advanced state of digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, owing to increasing internet penetration, the proliferation of digital content, and rising awareness about accessibility. China and India, with their massive consumer bases and growing digital economies, are poised to be significant contributors to this regional market growth.
The real time subtitles market by component can be broadly categorized into software, hardware, and services. Each of these segments plays a crucial role in the comprehensive ecosystem of real-time subtitle solutions. The software segment includes various applications and platforms that facilitate subtitle generation and synchronization. This segment is expected to dominate the market due to continuous advancements in AI and ML algorithms that significantly improve the accuracy and efficiency of subtitle generation. Companies are investing heavily in R&D to develop innovative software solutions that cater to diverse linguistic and accessibility needs.
The hardware segment encompasses the physical devices required to support real-time subtitle generation and display. These include specialized subtitle generation hardware,
https://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The global film subtitling market is experiencing robust growth, driven by the increasing consumption of on-demand video content across diverse languages and regions. The rising popularity of streaming platforms like Netflix, Amazon Prime, and Disney+, coupled with the expansion of international film productions, fuels the demand for high-quality subtitling services. This market is segmented by language (Native, Foreign, Minority, Special) and application (Drama, Comedy, Horror, Romance, Action, Other). While precise market sizing requires further data, considering the significant growth in streaming and global film production, a reasonable estimation for the 2025 market size could be in the range of $2.5 to $3 billion USD. A Compound Annual Growth Rate (CAGR) of 8-10% is plausible over the forecast period (2025-2033), reflecting continued market expansion. Key growth drivers include increased globalization of media content, the rise of multilingual audiences, accessibility requirements for diverse viewers, and technological advancements in subtitling software and automation. However, the market also faces restraints. These include fluctuating language-specific demand, varying quality standards across subtitling providers, the need for skilled and experienced linguists, and the potential for copyright issues surrounding unauthorized subtitling. The competition is fierce among established players like PoliLingua, JBI Studios, and BTI Studios, as well as smaller, specialized firms. The geographical distribution of the market is expected to be broadly diversified, with North America and Europe representing significant market shares, though Asia-Pacific is projected to experience substantial growth in the coming years due to the booming entertainment industry and expanding internet penetration. The increasing adoption of AI-powered subtitling tools might streamline the process but poses challenges related to maintaining accuracy and cultural nuances. Strategic partnerships and investments in technological innovation are crucial for companies to maintain competitiveness and cater to the evolving demands of the film subtitling industry.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global video subtitle translation services market is experiencing robust growth, driven by the proliferation of video content across various platforms and the increasing demand for accessibility and global reach. The market's expansion is fueled by several key factors. Firstly, the rise of streaming services and online video platforms necessitates multilingual subtitles to cater to a diverse global audience. Secondly, the growing emphasis on accessibility for individuals with hearing impairments is driving demand for accurate and high-quality subtitles. Thirdly, advancements in artificial intelligence (AI) and machine learning (ML) technologies are enhancing the speed and efficiency of translation processes, making the service more cost-effective. Finally, globalization and increased cross-border communication are further propelling market growth. We estimate the market size in 2025 to be approximately $2.5 billion, with a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, leading to a projected market value of around $7.8 billion by 2033. This growth trajectory is anticipated despite certain restraints, such as the need for human oversight to ensure accuracy and cultural nuances in translations, and the challenges associated with handling diverse dialects and languages. Market segmentation plays a crucial role in understanding the landscape. While specific segment breakdowns aren't provided, we can infer significant segments based on industry trends. These likely include language pairs (e.g., English to Spanish, English to Mandarin), video type (e.g., corporate videos, films, educational content), and service type (e.g., human translation, machine translation with post-editing). The competitive landscape is characterized by a mix of established players like Stepes, Ai-Media, and 3Play Media, and smaller, specialized companies catering to niche markets. The ongoing technological advancements and increasing market demand indicate that the video subtitle translation services market is poised for sustained, considerable growth in the coming years, creating opportunities for both established and emerging players.
According to a survey of who watch foreign content, as of November 2021, subtitling video content was preferred over dubbing in the United States and the United Kingdom, with ** percent and ** percent of respondents reporting preferring the first method, respectively. By comparison, ** percent of video viewers in Italy reported preferring dubbing, while in Germany, this number rose to ********* respondents.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Japanese-English Subtitle Corpus is a large Japanese-English parallel corpus covering the underrepresented domain of conversational dialogue. It consists of more than 3.2 million examples, making it the largest freely available dataset of its kind. The corpus was assembled by crawling and aligning subtitles found on the web.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Subtitle files (.xlsx format) from Experiment 1 and Experiment 2 from the project "SURE - Exploring Subtitle Reading Process with Eyetracking Technology", supported by a grant from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 702606.
In each experiment, there are files in English, Polish and Spanish, with time codes, as used in the study.
Each file contains three versions: subtitled at 12, 16 and 20 characters per second (cps).
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global subtitling services market, currently valued at $1259 million in 2025, is projected to experience robust growth, driven by the increasing consumption of video content across diverse platforms and languages. The compound annual growth rate (CAGR) of 6.4% from 2025 to 2033 indicates a significant expansion in market size, exceeding $2000 million by the end of the forecast period. This growth is fueled by several key factors, including the rise of streaming services, the increasing demand for multilingual content accessibility, and the growing popularity of online education and e-learning platforms that necessitate subtitling for wider reach and inclusivity. Furthermore, advancements in automated subtitling technologies are streamlining the process, reducing costs, and improving turnaround times, contributing positively to market expansion. However, challenges such as ensuring high-quality translations that accurately convey nuances and cultural contexts, and managing the complexities of different dialects and accents, remain crucial aspects for service providers to address. Competitive intensity is moderate, with a range of companies, including both established players like 3Play Media and emerging players like GoPhrazy, catering to diverse client needs. The market is segmented based on factors such as service type (e.g., live subtitling, post-production subtitling), industry vertical (e.g., media and entertainment, education, corporate), and language pairs. The geographical distribution of revenue likely shows significant concentration in North America and Europe, reflecting high internet penetration and media consumption in these regions, but emerging markets in Asia and Latin America also present significant growth opportunities. Continued innovation in AI-powered subtitling tools and a focus on providing accurate, culturally sensitive translations will be crucial for companies to maintain a competitive edge in this dynamic market.
The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)
This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.
This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.
V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)
The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such:
def dataset_fn_local(split, shuffle_files=False):
global nq_tsv_path
del shuffle_files
# Load lines from the text file as examples.
files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)]
print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files.
First 10: {files_to_read[0:10]}")
ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0))
ds = ds.shuffle(buffer_size=600000)
ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
return ds
A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.
This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
HowTo100M-subtitles-small
The subtitles from a subset of the HowTo100M dataset.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.
IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!
This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.
62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G