A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.
This dataset was created by Rahul Kaushik
Released under Other (specified in description)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Subtitles is a dataset for object detection tasks - it contains Letters annotations for 500 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Subscene is a vast collection of multilingual subtitles, encompassing 65 different languages and consisting of more than 30 billion tokens with a total size of 410.70 GB. This dataset includes subtitles for movies, series, and animations gathered from the Subscene dump. It provides a rich resource for studying language variations and building multilingual NLP models. We have carefully applied a fastText classifier to remove any non-language content from incorrect subsets. Additionally, we performed basic cleaning and filtration. However, there is still room for further cleaning and refinement.
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
techiaith/YouTube-Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
The dataset is used to test the proposed methodologies for mining parallel data from comparable corpora.
daliselmi/french-conversations-from-movie-subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Alan Wake 2 Subtitles is a dataset for object detection tasks - it contains Subtitle annotations for 565 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The real-time subtitles market is experiencing robust growth, driven by the increasing demand for accessible content across diverse platforms and languages. The market's expansion is fueled by several key factors: the rising adoption of streaming services and online video platforms, growing accessibility regulations mandating subtitles for various media, and the proliferation of multilingual content consumption. Technological advancements, such as improved speech-to-text accuracy and AI-powered subtitle generation, are further accelerating market growth. The market is segmented by technology (e.g., cloud-based, on-premise), application (e.g., live streaming, video conferencing, education), and end-user (e.g., media & entertainment, corporate, education). Competitive landscape analysis reveals a mix of established players and emerging technology companies, vying for market share through innovation in accuracy, speed, and integration with existing workflows. The forecast period (2025-2033) anticipates continued expansion, with a projected compound annual growth rate (CAGR) reflecting the increasing penetration of real-time subtitling across diverse industries and regions. Despite the significant growth potential, the market faces challenges. High initial investment costs for advanced technologies, the need for highly skilled professionals for accurate transcription and quality control, and variations in language complexities and accents can all constrain market penetration. However, these challenges are being addressed through continuous innovation, including the development of more affordable and user-friendly solutions, improvements in automated transcription technology, and increased accessibility of training programs. Overcoming these hurdles will be crucial for ensuring the continued and sustainable growth of the real-time subtitles market throughout the forecast period. The market is expected to reach a substantial value by 2033, driven by consistent technological advancements, regulatory support, and rising demand.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset represents the processed movie subtitle data adjusted for sentiment analysis, which was implemented using IBM Watson Natural Language Understanding (IBM NLU). The source data contains Slovak and English subtitles from 10 movies, which are matched into pairs. Each of the subtitles is matched with a machine translation generated using Google Translate and identified sentiment score using the OpenAI GPT model. In the next matrix, the results of the sentiment analysis from IBM NLU service for each segment are processed. The third file contains the results of validating the accuracy and error rates of the machine translations from the BLEU and TER metrics.
Enhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another ** percent of the respondents stated that they did so because they were in a noisy environment.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Survivor Subtitles Dataset
Dataset Description
A collection of subtitles from the American reality television show "Survivor", spanning seasons 1 through 47. The dataset contains subtitle text extracted from episode broadcasts.
Source
The subtitles were obtained from OpenSubtitles.com.
Dataset Details
Coverage:
Seasons: 1-47 Episodes per season: ~13-14 Total episodes: ~600
Format:
Text files containing timestamped subtitle data Character… See the full description on the dataset page: https://huggingface.co/datasets/hipml/survivor-subtitles.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global subtitles editor market is experiencing robust growth, driven by the increasing consumption of video content across various languages and platforms. The market's expansion is fueled by several key factors. The rise of streaming services and online video platforms necessitates accurate and efficient subtitling for broader audience reach. Furthermore, the increasing demand for accessible media for individuals with hearing impairments is a significant driver. Educational institutions and businesses increasingly utilize subtitles for training materials and online courses, further boosting market demand. Technological advancements, such as the development of AI-powered automated subtitling tools, are streamlining the subtitling process, leading to increased efficiency and reduced costs. However, challenges remain, including the need for skilled human editors to ensure accuracy and quality, as well as the linguistic nuances that automated tools may overlook. Market segmentation reveals strong demand from media workers, subtitle translators, and educators, with software solutions currently dominating the market share. The market is geographically diverse, with North America and Europe representing significant portions of the market, but strong growth potential exists in Asia-Pacific and other emerging regions as internet penetration and video consumption continue to rise. We estimate a current market size of approximately $300 million in 2025, with a projected CAGR of 15% from 2025 to 2033. This growth trajectory suggests a sizeable market opportunity for established players and new entrants alike. The competitive landscape is fragmented, with a mix of established software providers and newer AI-powered solutions vying for market share. Companies are focusing on developing user-friendly interfaces, advanced features like real-time subtitling and multilingual support, and efficient integration with video editing platforms. The ongoing innovation in AI-powered transcription and translation technologies is expected to further transform the market, potentially leading to greater efficiency and affordability. However, maintaining accuracy and addressing the ethical considerations of AI implementation will remain critical for sustained growth and market acceptance. The focus on providing highly accurate and culturally sensitive translations will also be vital in penetrating new markets globally, particularly in regions with diverse languages and dialects. Future growth hinges on delivering value-added services, such as quality control, streamlined workflows, and collaborative platforms, in response to the ever-evolving needs of video content creators and consumers.
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global video subtitle translation services market is experiencing robust growth, driven by the proliferation of video content across various platforms and the increasing demand for accessibility and global reach. The market's expansion is fueled by several key factors. Firstly, the rise of streaming services and online video platforms necessitates multilingual subtitles to cater to a diverse global audience. Secondly, the growing emphasis on accessibility for individuals with hearing impairments is driving demand for accurate and high-quality subtitles. Thirdly, advancements in artificial intelligence (AI) and machine learning (ML) technologies are enhancing the speed and efficiency of translation processes, making the service more cost-effective. Finally, globalization and increased cross-border communication are further propelling market growth. We estimate the market size in 2025 to be approximately $2.5 billion, with a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, leading to a projected market value of around $7.8 billion by 2033. This growth trajectory is anticipated despite certain restraints, such as the need for human oversight to ensure accuracy and cultural nuances in translations, and the challenges associated with handling diverse dialects and languages. Market segmentation plays a crucial role in understanding the landscape. While specific segment breakdowns aren't provided, we can infer significant segments based on industry trends. These likely include language pairs (e.g., English to Spanish, English to Mandarin), video type (e.g., corporate videos, films, educational content), and service type (e.g., human translation, machine translation with post-editing). The competitive landscape is characterized by a mix of established players like Stepes, Ai-Media, and 3Play Media, and smaller, specialized companies catering to niche markets. The ongoing technological advancements and increasing market demand indicate that the video subtitle translation services market is poised for sustained, considerable growth in the coming years, creating opportunities for both established and emerging players.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global real time subtitles market size was valued at approximately USD 2.5 billion in 2023 and is expected to surge to around USD 6.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 11.5% during the forecast period. This notable growth can be attributed to several factors, including the rising demand for accessible content, advancements in artificial intelligence (AI) and machine learning (ML) technologies, and the increasing globalization of media and corporate communications.
One of the primary growth factors driving the real time subtitles market is the increasing emphasis on accessibility and inclusiveness in media and communications. Governments and organizations worldwide are instituting regulations and policies requiring content to be accessible to individuals who are deaf or hard of hearing. For instance, the Americans with Disabilities Act (ADA) in the United States mandates that video content be accessible, propelling the adoption of real-time subtitle solutions. This regulatory environment, coupled with growing social awareness, significantly fuels market growth.
Another critical driver is the rapid advancement of AI and ML technologies, which have revolutionized the accuracy and efficiency of real-time subtitle generation. Modern AI-driven subtitle solutions can now offer near-perfect synchronization and error-free transcription, enhancing user experience. These technological advancements are making real-time subtitles more reliable and scalable, thereby increasing their adoption across various sectors such as broadcasting, education, and corporate communications.
The globalization of media content and corporate operations further contributes to the market's expansion. As companies and content creators aim to reach a global audience, the need for multilingual subtitle solutions becomes imperative. Real-time subtitles facilitate effective communication across different languages and cultural contexts, thereby broadening the reach and appeal of content. This globalization trend is particularly evident in the streaming services sector, where platforms are increasingly providing real-time subtitles in multiple languages to cater to diverse audiences.
Film Subtitling plays a crucial role in the globalization of media content, as it allows films to reach audiences across different linguistic and cultural backgrounds. With the rise of streaming platforms and international film festivals, the demand for high-quality film subtitling services has surged. These services not only enhance the accessibility of films for non-native speakers but also preserve the original context and cultural nuances of the content. As the film industry continues to expand its global footprint, the importance of accurate and culturally sensitive film subtitling cannot be overstated. This trend is particularly significant for independent filmmakers and studios aiming to distribute their content internationally, as it opens up new markets and increases viewership.
Regionally, North America and Europe are currently the largest markets for real-time subtitles, driven by stringent accessibility regulations and the advanced state of digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, owing to increasing internet penetration, the proliferation of digital content, and rising awareness about accessibility. China and India, with their massive consumer bases and growing digital economies, are poised to be significant contributors to this regional market growth.
The real time subtitles market by component can be broadly categorized into software, hardware, and services. Each of these segments plays a crucial role in the comprehensive ecosystem of real-time subtitle solutions. The software segment includes various applications and platforms that facilitate subtitle generation and synchronization. This segment is expected to dominate the market due to continuous advancements in AI and ML algorithms that significantly improve the accuracy and efficiency of subtitle generation. Companies are investing heavily in R&D to develop innovative software solutions that cater to diverse linguistic and accessibility needs.
The hardware segment encompasses the physical devices required to support real-time subtitle generation and display. These include specialized subtitle generation hardware,
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Percentage of subtitles recognized correctly in Experiment 2.
The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)
This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.
This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.
V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)
The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such:
def dataset_fn_local(split, shuffle_files=False):
global nq_tsv_path
del shuffle_files
# Load lines from the text file as examples.
files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)]
print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files.
First 10: {files_to_read[0:10]}")
ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0))
ds = ds.shuffle(buffer_size=600000)
ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
return ds
A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.
This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The global subtitle generator market is experiencing robust growth, driven by the increasing demand for multilingual content across various platforms. The market, valued at approximately $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This significant expansion is fueled by several key factors. The rise of streaming services and online video platforms necessitates accessible content for diverse global audiences, boosting the adoption of subtitle generators. Furthermore, advancements in Artificial Intelligence (AI) and Natural Language Processing (NLP) technologies are leading to more accurate, efficient, and cost-effective subtitle generation solutions. The increasing accessibility of these technologies is also empowering independent content creators and smaller businesses to leverage subtitles for enhanced reach and engagement. Market segmentation reveals strong demand across both cloud-based and on-premise solutions, with the cloud-based segment experiencing faster growth due to its scalability and cost-effectiveness. Enterprise users account for a larger market share, driven by their need to manage large volumes of content and ensure multilingual compliance. Geographic analysis shows strong growth in North America and Asia Pacific, fueled by the large user bases and robust technological infrastructure in these regions. However, challenges such as maintaining accuracy in complex audio and ensuring the quality of machine-translated subtitles continue to present some restraints to market growth. Looking ahead, the subtitle generator market is poised for continued expansion. The integration of AI and machine learning will continue to enhance the accuracy and efficiency of subtitle generation. Increased demand for personalized and interactive subtitles, and the emergence of new applications in sectors like education and healthcare, will further propel market growth. Competitive landscape analysis reveals a mix of established players and innovative startups, leading to continuous innovation and improvement in the market offerings. This combination of technological advancements, increasing demand, and diverse applications promises a bright future for the subtitle generator market, setting the stage for significant expansion and market penetration in the coming years.
A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.