100+ datasets found

s
ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.
swissubase.ch
doi.org
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles. [Dataset]. http://doi.org/10.48656/5mz4-x435
Explore at:
Unique identifier
https://doi.org/10.48656/5mz4-x435
Dataset updated
Jun 13, 2024
Description
A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.
English Subtitles (opensubtitles.org)
kaggle.com
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rahul Kaushik (2024). English Subtitles (opensubtitles.org) [Dataset]. https://www.kaggle.com/datasets/kaushikrahul/english-subtitles-opensubtitles-org
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 3, 2024
Dataset provided by
Kaggle
Authors
Rahul Kaushik
Description
Dataset

This dataset was created by Rahul Kaushik

Released under Other (specified in description)

Contents
R
Subtitles Dataset
universe.roboflow.com
zip
Updated Oct 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
subtitles (2022). Subtitles Dataset [Dataset]. https://universe.roboflow.com/subtitles-jtdc8/subtitles-xmseb/model/1
Explore at:
zipAvailable download formats
Dataset updated
Oct 9, 2022
Dataset authored and provided by
subtitles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Letters Bounding Boxes
Description
Subtitles

## Overview Subtitles is a dataset for object detection tasks - it contains Letters annotations for 500 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
h
subscene
huggingface.co
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
REFINE ai (2025). subscene [Dataset]. https://huggingface.co/datasets/refine-ai/subscene
Explore at:
Dataset updated
Mar 12, 2025
Dataset authored and provided by
REFINE ai
Description
Subscene is a vast collection of multilingual subtitles, encompassing 65 different languages and consisting of more than 30 billion tokens with a total size of 410.70 GB. This dataset includes subtitles for movies, series, and animations gathered from the Subscene dump. It provides a rich resource for studying language variations and building multilingual NLP models. We have carefully applied a fastText classifier to remove any non-language content from incorrect subsets. Additionally, we performed basic cleaning and filtration. However, there is still room for further cleaning and refinement.
h
YouTube-Subtitles
huggingface.co
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technologies, Bangor University (2025). YouTube-Subtitles [Dataset]. https://huggingface.co/datasets/techiaith/YouTube-Subtitles
Explore at:
Dataset updated
Apr 16, 2025
Dataset authored and provided by
Language Technologies, Bangor University
License
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
Area covered
YouTube
Description
techiaith/YouTube-Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
t
Movie Subtitles - Dataset - LDM
service.tib.eu
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Movie Subtitles - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/movie-subtitles
Explore at:
Dataset updated
Jan 3, 2025
Description
The dataset is used to test the proposed methodologies for mining parallel data from comparable corpora.
h
french-conversations-from-movie-subtitles
huggingface.co
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
daliselmi (2023). french-conversations-from-movie-subtitles [Dataset]. https://huggingface.co/datasets/daliselmi/french-conversations-from-movie-subtitles
Explore at:
Dataset updated
Aug 4, 2023
Authors
daliselmi
Area covered
French
Description
daliselmi/french-conversations-from-movie-subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
R
Alan Wake 2 Subtitles Dataset
universe.roboflow.com
zip
Updated Nov 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kopyl (2023). Alan Wake 2 Subtitles Dataset [Dataset]. https://universe.roboflow.com/kopyl/alan-wake-2-subtitles
Explore at:
zipAvailable download formats
Dataset updated
Nov 6, 2023
Dataset authored and provided by
kopyl
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Subtitle Bounding Boxes
Description
Alan Wake 2 Subtitles

## Overview Alan Wake 2 Subtitles is a dataset for object detection tasks - it contains Subtitle annotations for 565 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
R
Real-time Subtitles Report
datainsightsmarket.com
doc, pdf, ppt
Updated May 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Real-time Subtitles Report [Dataset]. https://www.datainsightsmarket.com/reports/real-time-subtitles-1989001
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
May 23, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The real-time subtitles market is experiencing robust growth, driven by the increasing demand for accessible content across diverse platforms and languages. The market's expansion is fueled by several key factors: the rising adoption of streaming services and online video platforms, growing accessibility regulations mandating subtitles for various media, and the proliferation of multilingual content consumption. Technological advancements, such as improved speech-to-text accuracy and AI-powered subtitle generation, are further accelerating market growth. The market is segmented by technology (e.g., cloud-based, on-premise), application (e.g., live streaming, video conferencing, education), and end-user (e.g., media & entertainment, corporate, education). Competitive landscape analysis reveals a mix of established players and emerging technology companies, vying for market share through innovation in accuracy, speed, and integration with existing workflows. The forecast period (2025-2033) anticipates continued expansion, with a projected compound annual growth rate (CAGR) reflecting the increasing penetration of real-time subtitling across diverse industries and regions. Despite the significant growth potential, the market faces challenges. High initial investment costs for advanced technologies, the need for highly skilled professionals for accurate transcription and quality control, and variations in language complexities and accents can all constrain market penetration. However, these challenges are being addressed through continuous innovation, including the development of more affordable and user-friendly solutions, improvements in automated transcription technology, and increased accessibility of training programs. Overcoming these hurdles will be crucial for ensuring the continued and sustainable growth of the real-time subtitles market throughout the forecast period. The market is expected to reach a substantial value by 2033, driven by consistent technological advancements, regulatory support, and rising demand.
E
SubIMDB: A Structured Corpus of Subtitles
live.european-language-grid.eu
txt
Updated Nov 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). SubIMDB: A Structured Corpus of Subtitles [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7453
Explore at:
txtAvailable download formats
Dataset updated
Nov 15, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.
m
Sentiment in Machine Translation of Slovak Movie Subtitles
data.mendeley.com
Updated Aug 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jaroslav Reichel (2023). Sentiment in Machine Translation of Slovak Movie Subtitles [Dataset]. http://doi.org/10.17632/dp58jkhy8g.1
Explore at:
Unique identifier
https://doi.org/10.17632/dp58jkhy8g.1
Dataset updated
Aug 1, 2023
Authors
Jaroslav Reichel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset represents the processed movie subtitle data adjusted for sentiment analysis, which was implemented using IBM Watson Natural Language Understanding (IBM NLU). The source data contains Slovak and English subtitles from 10 movies, which are matched into pairs. Each of the subtitles is matched with a machine translation generated using Google Translate and identified sentiment score using the OpenAI GPT model. In the next matrix, the results of the sentiment analysis from IBM NLU service for each segment are processed. The third file contains the results of validating the accuracy and error rates of the machine translations from the BLEU and TER metrics.
Reasons why adults use subtitles when watching TV in known language in the...
statista.com
Updated Jul 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Reasons why adults use subtitles when watching TV in known language in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/1459167/reasons-use-subtitles-watching-tv-known-language-us/
Explore at:
Dataset updated
Jul 10, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
Jun 29, 2023 - Jul 5, 2023
Area covered
United States
Description
Enhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another ** percent of the respondents stated that they did so because they were in a noisy environment.
h
survivor-subtitles
huggingface.co
Updated Jan 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Lambert (2025). survivor-subtitles [Dataset]. https://huggingface.co/datasets/hipml/survivor-subtitles
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 3, 2025
Authors
Paul Lambert
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Survivor Subtitles Dataset

Dataset Description

A collection of subtitles from the American reality television show "Survivor", spanning seasons 1 through 47. The dataset contains subtitle text extracted from episode broadcasts.

Source

The subtitles were obtained from OpenSubtitles.com.

Dataset Details

Coverage:

Seasons: 1-47 Episodes per season: ~13-14 Total episodes: ~600

Format:

Text files containing timestamped subtitle data Character… See the full description on the dataset page: https://huggingface.co/datasets/hipml/survivor-subtitles.
S
Subtitles Editor Report
datainsightsmarket.com
doc, pdf, ppt
Updated Apr 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Subtitles Editor Report [Dataset]. https://www.datainsightsmarket.com/reports/subtitles-editor-512222
Explore at:
ppt, pdf, docAvailable download formats
Dataset updated
Apr 25, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global subtitles editor market is experiencing robust growth, driven by the increasing consumption of video content across various languages and platforms. The market's expansion is fueled by several key factors. The rise of streaming services and online video platforms necessitates accurate and efficient subtitling for broader audience reach. Furthermore, the increasing demand for accessible media for individuals with hearing impairments is a significant driver. Educational institutions and businesses increasingly utilize subtitles for training materials and online courses, further boosting market demand. Technological advancements, such as the development of AI-powered automated subtitling tools, are streamlining the subtitling process, leading to increased efficiency and reduced costs. However, challenges remain, including the need for skilled human editors to ensure accuracy and quality, as well as the linguistic nuances that automated tools may overlook. Market segmentation reveals strong demand from media workers, subtitle translators, and educators, with software solutions currently dominating the market share. The market is geographically diverse, with North America and Europe representing significant portions of the market, but strong growth potential exists in Asia-Pacific and other emerging regions as internet penetration and video consumption continue to rise. We estimate a current market size of approximately $300 million in 2025, with a projected CAGR of 15% from 2025 to 2033. This growth trajectory suggests a sizeable market opportunity for established players and new entrants alike. The competitive landscape is fragmented, with a mix of established software providers and newer AI-powered solutions vying for market share. Companies are focusing on developing user-friendly interfaces, advanced features like real-time subtitling and multilingual support, and efficient integration with video editing platforms. The ongoing innovation in AI-powered transcription and translation technologies is expected to further transform the market, potentially leading to greater efficiency and affordability. However, maintaining accuracy and addressing the ethical considerations of AI implementation will remain critical for sustained growth and market acceptance. The focus on providing highly accurate and culturally sensitive translations will also be vital in penetrating new markets globally, particularly in regions with diverse languages and dialects. Future growth hinges on delivering value-added services, such as quality control, streamlined workflows, and collaborative platforms, in response to the ever-evolving needs of video content creators and consumers.
V
Video Subtitle Translation Service Report
datainsightsmarket.com
doc, pdf, ppt
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Video Subtitle Translation Service Report [Dataset]. https://www.datainsightsmarket.com/reports/video-subtitle-translation-service-538596
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Jun 22, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global video subtitle translation services market is experiencing robust growth, driven by the proliferation of video content across various platforms and the increasing demand for accessibility and global reach. The market's expansion is fueled by several key factors. Firstly, the rise of streaming services and online video platforms necessitates multilingual subtitles to cater to a diverse global audience. Secondly, the growing emphasis on accessibility for individuals with hearing impairments is driving demand for accurate and high-quality subtitles. Thirdly, advancements in artificial intelligence (AI) and machine learning (ML) technologies are enhancing the speed and efficiency of translation processes, making the service more cost-effective. Finally, globalization and increased cross-border communication are further propelling market growth. We estimate the market size in 2025 to be approximately $2.5 billion, with a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, leading to a projected market value of around $7.8 billion by 2033. This growth trajectory is anticipated despite certain restraints, such as the need for human oversight to ensure accuracy and cultural nuances in translations, and the challenges associated with handling diverse dialects and languages. Market segmentation plays a crucial role in understanding the landscape. While specific segment breakdowns aren't provided, we can infer significant segments based on industry trends. These likely include language pairs (e.g., English to Spanish, English to Mandarin), video type (e.g., corporate videos, films, educational content), and service type (e.g., human translation, machine translation with post-editing). The competitive landscape is characterized by a mix of established players like Stepes, Ai-Media, and 3Play Media, and smaller, specialized companies catering to niche markets. The ongoing technological advancements and increasing market demand indicate that the video subtitle translation services market is poised for sustained, considerable growth in the coming years, creating opportunities for both established and emerging players.
D
Real Time Subtitles Market Report | Global Forecast From 2025 To 2033
dataintelo.com
csv, pdf, pptx
Updated Jan 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Real Time Subtitles Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/real-time-subtitles-market
Explore at:
csv, pdf, pptxAvailable download formats
Dataset updated
Jan 7, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Real Time Subtitles Market Outlook

The global real time subtitles market size was valued at approximately USD 2.5 billion in 2023 and is expected to surge to around USD 6.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 11.5% during the forecast period. This notable growth can be attributed to several factors, including the rising demand for accessible content, advancements in artificial intelligence (AI) and machine learning (ML) technologies, and the increasing globalization of media and corporate communications.

One of the primary growth factors driving the real time subtitles market is the increasing emphasis on accessibility and inclusiveness in media and communications. Governments and organizations worldwide are instituting regulations and policies requiring content to be accessible to individuals who are deaf or hard of hearing. For instance, the Americans with Disabilities Act (ADA) in the United States mandates that video content be accessible, propelling the adoption of real-time subtitle solutions. This regulatory environment, coupled with growing social awareness, significantly fuels market growth.

Another critical driver is the rapid advancement of AI and ML technologies, which have revolutionized the accuracy and efficiency of real-time subtitle generation. Modern AI-driven subtitle solutions can now offer near-perfect synchronization and error-free transcription, enhancing user experience. These technological advancements are making real-time subtitles more reliable and scalable, thereby increasing their adoption across various sectors such as broadcasting, education, and corporate communications.

The globalization of media content and corporate operations further contributes to the market's expansion. As companies and content creators aim to reach a global audience, the need for multilingual subtitle solutions becomes imperative. Real-time subtitles facilitate effective communication across different languages and cultural contexts, thereby broadening the reach and appeal of content. This globalization trend is particularly evident in the streaming services sector, where platforms are increasingly providing real-time subtitles in multiple languages to cater to diverse audiences.

Film Subtitling plays a crucial role in the globalization of media content, as it allows films to reach audiences across different linguistic and cultural backgrounds. With the rise of streaming platforms and international film festivals, the demand for high-quality film subtitling services has surged. These services not only enhance the accessibility of films for non-native speakers but also preserve the original context and cultural nuances of the content. As the film industry continues to expand its global footprint, the importance of accurate and culturally sensitive film subtitling cannot be overstated. This trend is particularly significant for independent filmmakers and studios aiming to distribute their content internationally, as it opens up new markets and increases viewership.

Regionally, North America and Europe are currently the largest markets for real-time subtitles, driven by stringent accessibility regulations and the advanced state of digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, owing to increasing internet penetration, the proliferation of digital content, and rising awareness about accessibility. China and India, with their massive consumer bases and growing digital economies, are poised to be significant contributors to this regional market growth.

Component Analysis

The real time subtitles market by component can be broadly categorized into software, hardware, and services. Each of these segments plays a crucial role in the comprehensive ecosystem of real-time subtitle solutions. The software segment includes various applications and platforms that facilitate subtitle generation and synchronization. This segment is expected to dominate the market due to continuous advancements in AI and ML algorithms that significantly improve the accuracy and efficiency of subtitle generation. Companies are investing heavily in R&D to develop innovative software solutions that cater to diverse linguistic and accessibility needs.

The hardware segment encompasses the physical devices required to support real-time subtitle generation and display. These include specialized subtitle generation hardware,
s
Open Subtitles
marketplace.sshopencloud.eu
huggingface.co
Updated Dec 16, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Open Subtitles [Dataset]. https://marketplace.sshopencloud.eu/dataset/bK5vn8
Explore at:
Dataset updated
Dec 16, 2016
Description
A public, open repository of film and television subtitles.
Percentage of subtitles recognized correctly in Experiment 2.
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agnieszka Szarkowska; Olivia Gerber-Morón (2023). Percentage of subtitles recognized correctly in Experiment 2. [Dataset]. http://doi.org/10.1371/journal.pone.0199331.t017
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0199331.t017
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Agnieszka Szarkowska; Olivia Gerber-Morón
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Percentage of subtitles recognized correctly in Experiment 2.
Anime Subtitles
kaggle.com
Updated Aug 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jess Fan (2021). Anime Subtitles [Dataset]. https://www.kaggle.com/datasets/jef1056/anime-subtitles/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jess Fan
Description
Content

The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)

This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.

This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.

V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)

Format

The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such: def dataset_fn_local(split, shuffle_files=False): global nq_tsv_path del shuffle_files # Load lines from the text file as examples. files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)] print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files. First 10: {files_to_read[0:10]}") ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0)) ds = ds.shuffle(buffer_size=600000) ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE) ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex))) return ds

Acknowledgements

A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.

This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
S
Subtitle Generator Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Subtitle Generator Report [Dataset]. https://www.archivemarketresearch.com/reports/subtitle-generator-55362
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 10, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global subtitle generator market is experiencing robust growth, driven by the increasing demand for multilingual content across various platforms. The market, valued at approximately $2.5 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033. This significant expansion is fueled by several key factors. The rise of streaming services and online video platforms necessitates accessible content for diverse global audiences, boosting the adoption of subtitle generators. Furthermore, advancements in Artificial Intelligence (AI) and Natural Language Processing (NLP) technologies are leading to more accurate, efficient, and cost-effective subtitle generation solutions. The increasing accessibility of these technologies is also empowering independent content creators and smaller businesses to leverage subtitles for enhanced reach and engagement. Market segmentation reveals strong demand across both cloud-based and on-premise solutions, with the cloud-based segment experiencing faster growth due to its scalability and cost-effectiveness. Enterprise users account for a larger market share, driven by their need to manage large volumes of content and ensure multilingual compliance. Geographic analysis shows strong growth in North America and Asia Pacific, fueled by the large user bases and robust technological infrastructure in these regions. However, challenges such as maintaining accuracy in complex audio and ensuring the quality of machine-translated subtitles continue to present some restraints to market growth. Looking ahead, the subtitle generator market is poised for continued expansion. The integration of AI and machine learning will continue to enhance the accuracy and efficiency of subtitle generation. Increased demand for personalized and interactive subtitles, and the emergence of new applications in sectors like education and healthcare, will further propel market growth. Competitive landscape analysis reveals a mix of established players and innovative startups, leading to continuous innovation and improvement in the market offerings. This combination of technological advancements, increasing demand, and diverse applications promises a bright future for the subtitle generator market, setting the stage for significant expansion and market penetration in the coming years.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles. [Dataset]. http://doi.org/10.48656/5mz4-x435

ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.

Explore at:

Unique identifier

https://doi.org/10.48656/5mz4-x435

Dataset updated

Jun 13, 2024

Description

A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.

Clear search

Close search

Google apps

Main menu

ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.

English Subtitles (opensubtitles.org)

Dataset

Contents

Subtitles Dataset

Subtitles

subscene

YouTube-Subtitles

Movie Subtitles - Dataset - LDM

french-conversations-from-movie-subtitles

Alan Wake 2 Subtitles Dataset

Alan Wake 2 Subtitles

Real-time Subtitles Report

SubIMDB: A Structured Corpus of Subtitles

Sentiment in Machine Translation of Slovak Movie Subtitles

Reasons why adults use subtitles when watching TV in known language in the...

survivor-subtitles

Subtitles Editor Report

Video Subtitle Translation Service Report

Real Time Subtitles Market Report | Global Forecast From 2025 To 2033

Real Time Subtitles Market Outlook

Component Analysis

Open Subtitles

Percentage of subtitles recognized correctly in Experiment 2.

Anime Subtitles

Content

Format

Acknowledgements

Subtitle Generator Report

ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.