Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.
IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!
This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.
62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G
Facebook
TwitterSubtitles are a text representation of the spoken dialogue and other relevant audio information in a video, such as background sounds or music. This dataset is likely to be useful for natural language processing (NLP) tasks, such as language modeling, sentiment analysis, and named entity recognition. It could also be used for machine learning tasks, such as text classification or clustering. With this dataset, researchers and developers can analyze the language used in movies, study how language evolves over time, and train models to perform various NLP tasks on movie subtitles.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides an invaluable opportunity to train a neural network model to effectively and accurately translate text between an array of nine different languages, including Finnish, Hindi, Basque, Esperanto, French, Armenian, Bengali, Icelandic and Russian. Each language CSV file includes three columns: an ID column; a meta column which provides information about the source of the sentence; and finally a 'translation' column that contains the translated sentence. The aim is to build a dataset suitable for training models capable of mastering multilingual translation tasks in order to bridge gaps between languages. Train your model with this unique dataset today!
For more datasets, click here.
- π¨ Your notebook can be here! π¨!
This dataset is a great resource for anyone looking to build a translation model using neural networks. Here is a guide on how to use it:
- Download the appropriate .csv files for the languages you need from the Kaggle dataset.
- The data comes in an easily accessible CSV file, with ID, meta and translation columns included in each row of data. The ID column consists of integer values that can be used to identify each row and create unique feature ignition labels when training your model, while the meta column contains information about where each sentence originated from, allowing you to quickly filter out any sentences with suspect origins if needed. The translation column should include both English translations as well as their foreign language equivalents per sentence (depending on which language you are working with).
To train your neural network model it's important that you have enough training data available and try different language-pairs related sub-set datasets if available before assembling your final full dataset for training later on down the road once all inputs are ready (if needed). This Kaggle set should provide sufficient sample sizes per individual language pair so proceed according appropriate after downloading whatever subsets needed from this main database here first.
Now itβs time to construct our input features vector sets for our neural network configuration/setup by gathering all relevant variables in separate lists/arrays depending on preferred coding method used later when setting up our NN architecture layer setups appropriately based off all gathered items (elements) contained inside their respective list(s)/array(s) generated previously by implementing these steps mentioned above accordingly prior first before doing anything requiring input variable providing relevant core information found initially inside this Primary Open Subtitle Database explored so far properly earlier until now prior to continuing ahead next further below progressively further soon onward next momentarily right straight away very shortly right afterwards verily literally afterwards manually immediately properly eventually orderly personally autonomously biologically etc fortuitously contemporaneously instantaneously automatically justly necessarily lastly rightly confidently quixotically thankfully digitally informatively thereby correspondingly conjecturally constructively alike remarkably consistently instinctually markedly freely liberally perhaps anecdotally feasibly undeniably dynamically promptly easily holistically fairly evidently continually spontaneously intrinsically adaptively pictorially expressively intuitively hopefully methodically rationally prophetically perspicuously naturally savagely progressively peculiarly responsively whimsically illustratively skilfully tenaciously swiftly mysteriously productively continuously electromagnetically agitatedly constantly accurately ingeniously busily purposefully eagerly curiously exuberantly aud
- Creating a neural network to automatically translate texts from any of the 9 languages in this dataset into any other language.
- Developing an AI-powered chatbot that can reply in multiple languages that the users prefer.
- Building an automatic translation system with real-time video conversation capabilities for use by professionals such as interpreters and international translators
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommon...
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Subscene is a vast collection of multilingual subtitles, encompassing 65 different languages and consisting of more than 30 billion tokens with a total size of 410.70 GB. This dataset includes subtitles for movies, series, and animations gathered from the Subscene dump. It provides a rich resource for studying language variations and building multilingual NLP models. We have carefully applied a fastText classifier to remove any non-language content from incorrect subsets. Additionally, we performed basic cleaning and filtration. However, there is still room for further cleaning and refinement.
Facebook
Twitterchenrm/yyets-subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Subtitles is a dataset for object detection tasks - it contains Letters annotations for 500 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The YouTube Insights dataset offers valuable data for researchers, data scientists, and YouTube enthusiasts to explore video performance and engagement. This dataset focuses on key elements such as video titles, view counts, analytics, and subtitles.
With a wide range of YouTube videos, spanning various genres and upload dates, this dataset provides insights into video popularity and audience engagement. Researchers can analyze video titles to understand effective strategies for capturing viewer attention. View counts offer quantitative measures of video popularity, while analytics data provides metrics like likes, dislikes, comments, and shares.
The inclusion of subtitles enhances the dataset, enabling language pattern analysis, sentiment analysis, and keyword extraction. Researchers can uncover correlations between subtitles and video content to gain a deeper understanding of audience preferences and behavior.
The YouTube Insights dataset empowers users to discover valuable insights into YouTube's ecosystem, optimizing content creation and engagement strategies. It serves as a foundation for research, analysis, and innovation in the realm of online video platforms.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
techiaith/YouTube-Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterA multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset β a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.
Each conversation in this dataset is structured as a JSON object, featuring three key attributes:
Here's a snippet from the dataset to give you an idea of its structure:
[
{
"context": [
"Tu as attendu longtemps?",
"Oui en effet.",
"Je pense que c' est grossier pour un premier rencard.",
// ... (6 more lines of context)
],
"knowledge": "",
"response": "On n' avait pas dit 9h?"
},
// ... (more data samples)
]
The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:
We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.
Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The IndicDialogue dataset contains raw subtitle SRT files and dialogues extracted from them. The subtitles are in 10 indic languages, namely Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali and Assamese. This dataset provides a corpus for performing various NLP tasks in low-resource languages using SLMs(Small Language Models) and LLMs(Large Language Models).
Facebook
TwitterThis dataset was created by Aleksandr Kliukin
Facebook
Twitterhttps://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global AI subtitle generation market was valued at $3.8 billion in 2025 and is projected to reach $18.6 billion by 2034, expanding at a compound annual growth rate (CAGR) of 19.3% during the forecast period from 2026 to 2034, driven by an unprecedented surge in digital video content, tightening accessibility regulations across major economies, and the rapid maturation of deep learning-based speech recognition technologies. The proliferation of over-the-top (OTT) streaming platforms, corporate video communications, and e-learning ecosystems has catalyzed demand for fast, accurate, and scalable subtitle generation solutions across virtually every industry vertical. Enterprises and content creators alike are increasingly abandoning manual captioning workflows in favor of AI-powered platforms that can deliver near real-time transcription at a fraction of the traditional cost. Advances in transformer-based language models, including architectures derived from OpenAI's Whisper and Google's Universal Speech Model (USM), have dramatically reduced word error rates (WER) to below 5% for major global languages, making AI-generated subtitles commercially viable for broadcast-grade applications. The integration of large language models (LLMs) with automatic speech recognition (ASR) engines has further enabled context-aware subtitle formatting, speaker diarization, and on-the-fly translation into more than 100 languages. Regulatory tailwinds such as the European Accessibility Act (EAA), scheduled for full enforcement in June 2025, and the U.S. Federal Communications Commission (FCC) mandates on video captioning have compelled media companies to invest heavily in automated captioning infrastructure. Simultaneously, the explosion of short-form video content on platforms such as TikTok, Instagram Reels, and YouTube Shorts has created a massive long-tail demand among individual content creators for quick, affordable subtitle solutions. The market is also benefiting from the hybridization of AI models with human review workflows, where AI handles the heavy lifting at scale while human editors perform quality assurance, creating a services layer that is growing in parallel with pure software revenues.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Over 12k scraped YouTube EN subtitles for videos on GitHub topics.
How? Based on the topics https://github.com/topics I searched YouTube with the phrase "What is {topic}?" and downloaded up to 100 video subtitles for a given topic. The extracted text can be found in the dataset together with the topic name, video title and video URL.
Why? I wan to know if we can rate videos based on their information value, especially when we use YouTube as an information source.
You can find the source code here: https://github.com/detrin/text-info-value
Facebook
Twitterdaliselmi/french-conversations-from-movie-subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming subtitling and captioning market! This in-depth analysis reveals key trends, growth drivers, leading companies (SDI Media, IYUNO, Deluxe Media, ZOO Digital), and regional market shares from 2019-2033. Learn about the impact of AI and increasing demand for multilingual content.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was thinking about movie sentiments and wanted to see if there is any strong pattern behind how sentiment fluctuates across the movie to how that movie is received or performed.
To track movie sentiments across the run time, the easy way is to get the movie subtitles and identify the sentiment for each text in the subtitle. The advantage of this approach is that movie subtitles are easy to get, parse, and process and NLP frameworks can easily help with the task. This approach is scalable since irrespective of language, english subtitles are available for almost all movies albeit translation errors.
Facebook
TwitterPJMixers-Dev/Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Discover the booming real-time subtitles market! Explore its growth drivers, key trends, and leading companies shaping this dynamic industry. Learn about market size, segmentation, and regional variations in this comprehensive analysis of the 2025-2033 forecast.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Explore the booming Captioning and Subtitling Service market, valued at USD 2.5 billion in 2025 and growing at a 10% CAGR. Discover key drivers like broadcast, streaming, and education, and understand regional market shares and future trends.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.
IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!
This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.
62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G