Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.
IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!
This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.
62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
This dataset provides an invaluable opportunity to train a neural network model to effectively and accurately translate text between an array of nine different languages, including Finnish, Hindi, Basque, Esperanto, French, Armenian, Bengali, Icelandic and Russian. Each language CSV file includes three columns: an ID column; a meta column which provides information about the source of the sentence; and finally a 'translation' column that contains the translated sentence. The aim is to build a dataset suitable for training models capable of mastering multilingual translation tasks in order to bridge gaps between languages. Train your model with this unique dataset today!
For more datasets, click here.
- ๐จ Your notebook can be here! ๐จ!
This dataset is a great resource for anyone looking to build a translation model using neural networks. Here is a guide on how to use it:
- Download the appropriate .csv files for the languages you need from the Kaggle dataset.
- The data comes in an easily accessible CSV file, with ID, meta and translation columns included in each row of data. The ID column consists of integer values that can be used to identify each row and create unique feature ignition labels when training your model, while the meta column contains information about where each sentence originated from, allowing you to quickly filter out any sentences with suspect origins if needed. The translation column should include both English translations as well as their foreign language equivalents per sentence (depending on which language you are working with).
To train your neural network model it's important that you have enough training data available and try different language-pairs related sub-set datasets if available before assembling your final full dataset for training later on down the road once all inputs are ready (if needed). This Kaggle set should provide sufficient sample sizes per individual language pair so proceed according appropriate after downloading whatever subsets needed from this main database here first.
Now itโs time to construct our input features vector sets for our neural network configuration/setup by gathering all relevant variables in separate lists/arrays depending on preferred coding method used later when setting up our NN architecture layer setups appropriately based off all gathered items (elements) contained inside their respective list(s)/array(s) generated previously by implementing these steps mentioned above accordingly prior first before doing anything requiring input variable providing relevant core information found initially inside this Primary Open Subtitle Database explored so far properly earlier until now prior to continuing ahead next further below progressively further soon onward next momentarily right straight away very shortly right afterwards verily literally afterwards manually immediately properly eventually orderly personally autonomously biologically etc fortuitously contemporaneously instantaneously automatically justly necessarily lastly rightly confidently quixotically thankfully digitally informatively thereby correspondingly conjecturally constructively alike remarkably consistently instinctually markedly freely liberally perhaps anecdotally feasibly undeniably dynamically promptly easily holistically fairly evidently continually spontaneously intrinsically adaptively pictorially expressively intuitively hopefully methodically rationally prophetically perspicuously naturally savagely progressively peculiarly responsively whimsically illustratively skilfully tenaciously swiftly mysteriously productively continuously electromagnetically agitatedly constantly accurately ingeniously busily purposefully eagerly curiously exuberantly aud
- Creating a neural network to automatically translate texts from any of the 9 languages in this dataset into any other language.
- Developing an AI-powered chatbot that can reply in multiple languages that the users prefer.
- Building an automatic translation system with real-time video conversation capabilities for use by professionals such as interpreters and international translators
If you use this dataset in your research, please credit the original authors. Data Source
**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommon...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset โ a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.
Each conversation in this dataset is structured as a JSON object, featuring three key attributes:
Here's a snippet from the dataset to give you an idea of its structure:
[
{
"context": [
"Tu as attendu longtemps?",
"Oui en effet.",
"Je pense que c' est grossier pour un premier rencard.",
// ... (6 more lines of context)
],
"knowledge": "",
"response": "On n' avait pas dit 9h?"
},
// ... (more data samples)
]
The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:
We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.
Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.
Facebook
TwitterThe dataset is used to test the proposed methodologies for mining parallel data from comparable corpora.
Facebook
TwitterA multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.
Facebook
TwitterThe Open Subtitles dataset consists of transcriptions of spoken dialog in movies and television shows.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains 25 aligned movie subtitle segments in English, Russian, and Italian, extracted from the ParTree corpus. Each row provides a short, context-rich movie line with its translations in all three languages, making it ideal for research and development in machine translation, multilingual NLP, and cross-lingual transfer learning.
Key features: - Parallel triplets: English, Russian, Italian - Sourced from authentic movie subtitles for natural, conversational language - Suitable for training, validation, and benchmarking of translation and multilingual models
Data originally from the ParTree corpus, available via Swiss-AL
Facebook
TwitterPJMixers-Dev/Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset includes subtitle files in the SRT (SubRip Subtitle) format for several popular movies, such as Oppenheimer and Tenet. SRT files are plain-text files widely used for subtitles, containing a series of structured entries to synchronize text with video content. Each entry in an SRT file comprises:
hours:minutes:seconds,milliseconds. For example:
```
1
00:00:01,500 --> 00:00:04,000
This is a sample subtitle.
2 00:00:04,500 --> 00:00:07,000 Here is another subtitle to demonstrate multiple entries.
This straightforward format is highly compatible with media players and easy to edit. SRT files enhance accessibility by providing subtitles for different languages or accommodating viewers with hearing impairments, enriching the experience of enjoying these popular movies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
Alan Wake 2 Subtitles is a dataset for object detection tasks - it contains Subtitle annotations for 565 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.
Facebook
TwitterAttribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
The IndicDialogue dataset contains raw subtitle SRT files and dialogues extracted from them. The subtitles are in 10 indic languages, namely Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali and Assamese. This dataset provides a corpus for performing various NLP tasks in low-resource languages using SLMs(Small Language Models) and LLMs(Large Language Models).
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming subtitling and captioning market! This in-depth analysis reveals key trends, growth drivers, leading companies (SDI Media, IYUNO, Deluxe Media, ZOO Digital), and regional market shares from 2019-2033. Learn about the impact of AI and increasing demand for multilingual content.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global video subtitle translation services market is booming, driven by streaming, accessibility needs, and AI advancements. Learn about market size, growth trends, key players (Stepes, Ai-Media, 3Play Media), and future projections in this comprehensive analysis. Discover how this $2.5 billion market is set to reach $7.8 billion by 2033.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I was thinking about movie sentiments and wanted to see if there is any strong pattern behind how sentiment fluctuates across the movie to how that movie is received or performed.
To track movie sentiments across the run time, the easy way is to get the movie subtitles and identify the sentiment for each text in the subtitle. The advantage of this approach is that movie subtitles are easy to get, parse, and process and NLP frameworks can easily help with the task. This approach is scalable since irrespective of language, english subtitles are available for almost all movies albeit translation errors.
Facebook
TwitterEnhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another ** percent of the respondents stated that they did so because they were in a noisy environment.
Facebook
TwitterPJMixers-Dev/Subtitles-rag-questions-qwq-all-aphrodite dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming subtitles editor market! This in-depth analysis reveals key trends, growth drivers, and leading companies shaping the future of video accessibility and localization. Explore market size, CAGR, and regional insights for 2025-2033.
Facebook
TwitterSurvivor Subtitles Dataset (cleaned)
Dataset Description
A collection of subtitles from the American reality television show "Survivor", spanning seasons 1 through 47. The dataset contains subtitle text extracted from episode broadcasts. This dataset is a modification of the original Survivor Subtitles dataset after cleaning up and joining subtitle fragments. This dataset is a work in progress and any contributions are welcome.
Source
The subtitles wereโฆ See the full description on the dataset page: https://huggingface.co/datasets/hipml/survivor-subtitles-cleaned.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Transcript of the focus group
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.
IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!
This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.
62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G