100+ datasets found
  1. h

    open_subtitles

    • huggingface.co
    • modeldatabase.com
    • +1more
    Updated Dec 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
    Explore at:
    Dataset updated
    Dec 10, 2020
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

    IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

    This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

    62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

  2. P

    YouTube Subtitles Dataset

    • paperswithcode.com
    • kaggle.com
    Updated Mar 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). YouTube Subtitles Dataset [Dataset]. https://paperswithcode.com/dataset/youtube-subtitles
    Explore at:
    Dataset updated
    Mar 31, 2024
    Area covered
    YouTube
    Description

    YT_subtitles is a remarkable tool designed for building a dataset from YouTube subtitles. Let me break it down for you:

    Purpose: The primary goal of this tool is to extract non-machine-generated subtitles from YouTube videos. These subtitles are obtained by searching for specific terms and collecting the relevant video content.

    How It Works:

    You provide a list of search terms (such as "movie review," "GPT-3," or "true crime documentary"). The tool retrieves videos related to these search terms. For each video, it extracts the subtitles (in various languages) and organizes them into minute-by-minute segments. The resulting files contain a string of text per language, with the language name included as a header.

    Dataset Format:

    The dataset is stored in a JSONL (JSON Lines) file format. Each entry corresponds to a minute of subtitles, with language-specific content. If only one language is available, the output consists of a plain text version of the subtitles without additional metadata.

    Use Cases:

    Researchers and developers can utilize this dataset to enhance the multilingual performance of language models. It's particularly valuable for training models that work with diverse languages and real-world video content.

    (1) sdtblck/youtube_subtitle_dataset: YT_subtitles - GitHub. https://github.com/sdtblck/youtube_subtitle_dataset. (2) Youtubean Dataset | Papers With Code. https://paperswithcode.com/dataset/youtubean. (3) youtube subtitles | Kaggle. https://www.kaggle.com/datasets/wadzim/youtube-subtitles. (4) YouTube-8M Segments Dataset - Google Research. https://research.google.com/youtube8m/.

  3. P

    OpenSubtitles Dataset

    • paperswithcode.com
    Updated Jul 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Lison; J{\"o}rg Tiedemann (2022). OpenSubtitles Dataset [Dataset]. https://paperswithcode.com/dataset/opensubtitles
    Explore at:
    Dataset updated
    Jul 10, 2022
    Authors
    Pierre Lison; J{\"o}rg Tiedemann
    Description

    OpenSubtitles is collection of multilingual parallel corpora. The dataset is compiled from a large database of movie and TV subtitles and includes a total of 1689 bitexts spanning 2.6 billion sentences across 60 languages.

  4. English Movie Subtitle Collection

    • kaggle.com
    zip
    Updated Apr 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandun De Silva (2024). English Movie Subtitle Collection [Dataset]. https://www.kaggle.com/datasets/sandundesilva/movie-genre-dataset
    Explore at:
    zip(5081555 bytes)Available download formats
    Dataset updated
    Apr 24, 2024
    Authors
    Sandun De Silva
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Sandun De Silva

    Released under Apache 2.0

    Contents

  5. subtitles

    • kaggle.com
    zip
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergey Sohackiy (2023). subtitles [Dataset]. https://www.kaggle.com/datasets/sergeysohackiy/subtitles
    Explore at:
    zip(9239834 bytes)Available download formats
    Dataset updated
    Aug 1, 2023
    Authors
    Sergey Sohackiy
    Description

    Dataset

    This dataset was created by Sergey Sohackiy

    Contents

  6. E

    SubIMDB: A Structured Corpus of Subtitles

    • live.european-language-grid.eu
    • zenodo.org
    txt
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SubIMDB: A Structured Corpus of Subtitles [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7453
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 15, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.

  7. d

    ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.

    • doi.org
    • swissubase.ch
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles. [Dataset]. http://doi.org/10.48656/5mz4-x435
    Explore at:
    Dataset updated
    Mar 21, 2023
    Description

    A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.

  8. d

    English-Montenegrin parallel corpus of subtitles Opus-MontenegrinSubs 1.0 -...

    • b2find.dkrz.de
    Updated Oct 28, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). English-Montenegrin parallel corpus of subtitles Opus-MontenegrinSubs 1.0 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/cf233eb6-8f38-50fe-87b9-97ad777843b2
    Explore at:
    Dataset updated
    Oct 28, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This corpus contains parallel English-Montenegrin subtitles collected in the scope of conducting a linguistic and translatological research by Petar Božović for his PhD thesis "Audiovisual Translation and Elements of Culture: A Comparative Analysis of Transfer with Reception Study in Montenegro". The data and permission to redistribute were obtained from the Radio and Television of Montenegro (http://www.rtcg.me), the public service broadcaster of Montenegro. The corpus consists of English and Montenegrin subtitles of three TV series: House of Cards (686 minutes), Damages (2878 minutes), and Tudors (1999 minutes). The corpus covers 10 seasons, 110 episodes, and 5,563 minutes in terms of duration. Sentence alignment and basic encoding were performed inside the OPUS project (http://opus.nlpl.eu/MontenegrinSubs.php), while MSD tagging, lemmatisation, and TEI conversion were performed by the CLARIN.SI infrastructure. The English texts were tagged by TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) and the Montenegrin texts by ReLDI Tagger (https://github.com/clarinsi/reldi-tagger) using the Serbian language model. The TreeTagger (Penn Treebank) tagset was mapped to the SPOOK MSD tagset for English (https://nl.ijs.si/spook/msd/html-en/msd-en.html). The corpus is available in TEI format and derived vertical format used by CQP and Manatee (Sketch Engine). The alignments in the vertical file are given separately as tables linking the alignment elements of the two languages.

  9. h

    yyets-subtitles

    • huggingface.co
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chenrm (2024). yyets-subtitles [Dataset]. https://huggingface.co/datasets/chenrm/yyets-subtitles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 8, 2024
    Authors
    chenrm
    Description

    chenrm/yyets-subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    Subtitles

    • huggingface.co
    Updated Apr 1, 2002
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peanut Jar Mixers Development (2002). Subtitles [Dataset]. https://huggingface.co/datasets/PJMixers-Dev/Subtitles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2002
    Dataset authored and provided by
    Peanut Jar Mixers Development
    Description

    PJMixers-Dev/Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. o

    IRL Subtitles

    • osf.io
    Updated Apr 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stephen T Metcalfe; Luc Prisby (2024). IRL Subtitles [Dataset]. https://osf.io/a8ky7
    Explore at:
    Dataset updated
    Apr 26, 2024
    Dataset provided by
    Center For Open Science
    Authors
    Stephen T Metcalfe; Luc Prisby
    Description

    The goal of the project is to create wearable technology that can display highly embedded real-time subtitles from the environment to the user through an augmented reality solution. The use of an array of microphones and a camera are able to precisely identify the origin of a sound, and implementation of open-source speech to text software will produce displayable captions. This project may be useful for people who are deaf or hard of hearing. Furthermore, when used in conjunction with live translation, it may help people of different languages communicate effectively.

  12. D

    Real Time Subtitles Market Research Report 2032

    • dataintelo.com
    csv, pdf, pptx
    Updated Oct 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2024). Real Time Subtitles Market Research Report 2032 [Dataset]. https://dataintelo.com/report/real-time-subtitles-market
    Explore at:
    csv, pdf, pptxAvailable download formats
    Dataset updated
    Oct 4, 2024
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Real Time Subtitles Market Outlook



    The global real time subtitles market size was valued at approximately USD 2.5 billion in 2023 and is expected to surge to around USD 6.8 billion by 2032, growing at a compound annual growth rate (CAGR) of 11.5% during the forecast period. This notable growth can be attributed to several factors, including the rising demand for accessible content, advancements in artificial intelligence (AI) and machine learning (ML) technologies, and the increasing globalization of media and corporate communications.



    One of the primary growth factors driving the real time subtitles market is the increasing emphasis on accessibility and inclusiveness in media and communications. Governments and organizations worldwide are instituting regulations and policies requiring content to be accessible to individuals who are deaf or hard of hearing. For instance, the Americans with Disabilities Act (ADA) in the United States mandates that video content be accessible, propelling the adoption of real-time subtitle solutions. This regulatory environment, coupled with growing social awareness, significantly fuels market growth.



    Another critical driver is the rapid advancement of AI and ML technologies, which have revolutionized the accuracy and efficiency of real-time subtitle generation. Modern AI-driven subtitle solutions can now offer near-perfect synchronization and error-free transcription, enhancing user experience. These technological advancements are making real-time subtitles more reliable and scalable, thereby increasing their adoption across various sectors such as broadcasting, education, and corporate communications.



    The globalization of media content and corporate operations further contributes to the market's expansion. As companies and content creators aim to reach a global audience, the need for multilingual subtitle solutions becomes imperative. Real-time subtitles facilitate effective communication across different languages and cultural contexts, thereby broadening the reach and appeal of content. This globalization trend is particularly evident in the streaming services sector, where platforms are increasingly providing real-time subtitles in multiple languages to cater to diverse audiences.



    Regionally, North America and Europe are currently the largest markets for real-time subtitles, driven by stringent accessibility regulations and the advanced state of digital infrastructure. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period, owing to increasing internet penetration, the proliferation of digital content, and rising awareness about accessibility. China and India, with their massive consumer bases and growing digital economies, are poised to be significant contributors to this regional market growth.



    Component Analysis



    The real time subtitles market by component can be broadly categorized into software, hardware, and services. Each of these segments plays a crucial role in the comprehensive ecosystem of real-time subtitle solutions. The software segment includes various applications and platforms that facilitate subtitle generation and synchronization. This segment is expected to dominate the market due to continuous advancements in AI and ML algorithms that significantly improve the accuracy and efficiency of subtitle generation. Companies are investing heavily in R&D to develop innovative software solutions that cater to diverse linguistic and accessibility needs.



    The hardware segment encompasses the physical devices required to support real-time subtitle generation and display. These include specialized subtitle generation hardware, robust computing systems, and display units. While the hardware segment is smaller compared to software, it remains vital for environments where high reliability and performance are crucial, such as live broadcasting and large corporate events. Technological advancements in computing power and display technologies are also driving growth in this segment, making hardware more compact, efficient, and cost-effective.



    The services segment includes implementation, maintenance, and support services for real-time subtitle solutions. As businesses and organizations increasingly adopt these solutions, the demand for professional services to ensure smooth integration and operation is growing. This segment is expected to witness steady growth as it provides essential support for the seamless deployment and ongoing performance of real-time subtitle systems. Services such as training, customization, and technical

  13. m

    Sentiment in Machine Translation of Slovak Movie Subtitles

    • data.mendeley.com
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaroslav Reichel (2023). Sentiment in Machine Translation of Slovak Movie Subtitles [Dataset]. http://doi.org/10.17632/dp58jkhy8g.1
    Explore at:
    Dataset updated
    Aug 1, 2023
    Authors
    Jaroslav Reichel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset represents the processed movie subtitle data adjusted for sentiment analysis, which was implemented using IBM Watson Natural Language Understanding (IBM NLU). The source data contains Slovak and English subtitles from 10 movies, which are matched into pairs. Each of the subtitles is matched with a machine translation generated using Google Translate and identified sentiment score using the OpenAI GPT model. In the next matrix, the results of the sentiment analysis from IBM NLU service for each segment are processed. The third file contains the results of validating the accuracy and error rates of the machine translations from the BLEU and TER metrics.

  14. P

    SBU Captions Dataset Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vicente Ordonez; Girish Kulkarni; Tamara L. Berg, SBU Captions Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/sbu-captions-dataset
    Explore at:
    Authors
    Vicente Ordonez; Girish Kulkarni; Tamara L. Berg
    Description

    A collection that allows researchers to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results.

  15. P

    MultiSubs Dataset

    • paperswithcode.com
    Updated Jul 6, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Josiah Wang; Pranava Madhyastha; Josiel Figueiredo; Chiraag Lala; Lucia Specia (2021). MultiSubs Dataset [Dataset]. https://paperswithcode.com/dataset/multisubs
    Explore at:
    Dataset updated
    Jul 6, 2021
    Authors
    Josiah Wang; Pranava Madhyastha; Josiel Figueiredo; Chiraag Lala; Lucia Specia
    Description

    MultiSubs is a dataset of multilingual subtitles gathered from the OPUS OpenSubtitles dataset, which in turn was sourced from opensubtitles.org. We have supplemented some text fragments (visually salient nouns in this release) within the subtitles with web images, where the word sense of the fragment has been disambiguated using a cross-lingual approach. We have introduced a fill-in-the-blank task and a lexical translation task to demonstrate the utility of the dataset. Please refer to our paper for a more detailed description of the dataset and tasks. Multisubs will benefit research on visual grounding of words especially in the context of free-form sentence.

    Josiah Wang, Pranava Madhyastha, Josiel Figueiredo, Chiraag Lala, Lucia Specia (2021). MultiSubs: A Large-scale Multimodal and Multilingual Dataset. CoRR, abs/2103.01910. Available at: https://arxiv.org/abs/2103.01910

  16. h

    open-subtitles-bitext-mining

    • huggingface.co
    Updated Apr 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loïc Magne (2024). open-subtitles-bitext-mining [Dataset]. https://huggingface.co/datasets/loicmagne/open-subtitles-bitext-mining
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2024
    Authors
    Loïc Magne
    Description

    loicmagne/open-subtitles-bitext-mining dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. Reasons why adults use subtitles when watching TV in known language in the...

    • statista.com
    Updated Jun 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). Reasons why adults use subtitles when watching TV in known language in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/1459167/reasons-use-subtitles-watching-tv-known-language-us/
    Explore at:
    Dataset updated
    Jun 4, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 29, 2023 - Jul 5, 2023
    Area covered
    United States
    Description

    Enhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another 33 percent of the respondents stated that they did so because they were in a noisy environment.

  18. Movie Subtitles CSV Hindi English

    • kaggle.com
    zip
    Updated Oct 29, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manish Tripathi (2021). Movie Subtitles CSV Hindi English [Dataset]. https://www.kaggle.com/datasets/manishtripathi86/movie-subtitles-csv-hindi-english
    Explore at:
    zip(496667 bytes)Available download formats
    Dataset updated
    Oct 29, 2021
    Authors
    Manish Tripathi
    Description

    Dataset

    This dataset was created by Manish Tripathi

    Contents

  19. Audience preferences for subtitles or dubbing 2021, by country

    • statista.com
    Updated Mar 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2022). Audience preferences for subtitles or dubbing 2021, by country [Dataset]. https://www.statista.com/statistics/1289864/subtitles-dubbing-audience-preference-by-country/
    Explore at:
    Dataset updated
    Mar 18, 2022
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Nov 2021
    Area covered
    Worldwide
    Description

    According to a survey of who watch foreign content, as of November 2021, subtitling video content was preferred over dubbing in the United States and the United Kingdom, with 76 percent and 75 percent of respondents reporting to prefer the first method, respectively. By comparison, 54 percent of video viewers in Italy reported preferring dubbing, while in Germany this number rose to six in 10 respondents.

  20. c

    Global Film subtitling Market Size is USD 8514.2 million in 2024.

    • cognitivemarketresearch.com
    pdf,excel,csv,ppt
    Updated Jul 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cognitive Market Research (2024). Global Film subtitling Market Size is USD 8514.2 million in 2024. [Dataset]. https://www.cognitivemarketresearch.com/film-subtitling-market-report
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Jul 29, 2024
    Dataset authored and provided by
    Cognitive Market Research
    License

    https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy

    Time period covered
    2019 - 2031
    Area covered
    Global
    Description

    According to Cognitive Market Research, the global Film subtitling market size is USD 8514.2 million in 2024. It will expand at a compound annual growth rate (CAGR) of 8.00% from 2024 to 2031.

    North America held the major market share for more than 40% of the global revenue with a market size of USD 3405.68 million in 2024 and will grow at a compound annual growth rate (CAGR) of 10.0% from 2024 to 2031.
    Europe accounted for a market share of over 30% of the global revenue with a market size of USD 2554.26 million.
    Asia Pacific held a market share of around 23% of the global revenue with a market size of USD 1958.27 million in 2024 and will grow at a compound annual growth rate (CAGR) of 10.0% from 2024 to 2031.
    Latin America had a market share for more than 5% of the global revenue with a market size of USD 425.71 million in 2024 and will grow at a compound annual growth rate (CAGR) of 7.4% from 2024 to 2031.
    Middle East and Africa had a market share of around 2% of the global revenue and was estimated at a market size of USD 170.28 million in 2024 and will grow at a compound annual growth rate (CAGR) of 7.7% from 2024 to 2031.
    The significant film subtitling market segment by form is powdered fig due to its versatility and ease of incorporation into various products.
    

    Market Dynamics of Film Subtitling Market

    Key Drivers for Film Subtitling Market

    Growing international distribution of films necessitates multilingual subtitling to reach diverse audiences

    The growing popularity of streaming services and a worldwide audience is driving a dramatic change in the global cinema business towards international distribution. In order to effectively reach a wide range of consumers, multilingual subtitling is now required. In addition to translating language, subtitles maintain cultural subtleties, making sure that movies are watched by people all over the world. The increasing need for content localization—the process of adapting movies to fit the language and cultural preferences of various regions—amplifies this tendency. In order to increase accessibility and audience engagement internationally, as well as to tap into new revenue sources and broaden their worldwide reach, movie producers and distributors are spending more and more in top-notch subtitling services.

    Legal requirements and cultural initiatives promote subtitling for accessibility, including for the hearing impaired

    Globally, there is a growing body of legal frameworks and cultural activities that promote accessibility in media, including movies. When it comes to making sure that material is readable by a variety of audiences, including those who are hard of hearing, subtitles are essential. Many nations have regulations requiring the supply of subtitles as part of accessibility guidelines, with the goal of granting everyone equitable access to cultural and entertainment content. Cultural efforts encourage producers to emphasize subtitling services that address language diversity and accessibility requirements, so advancing inclusive practices in the film business. This increasing focus promotes inclusion in society and creates new business prospects for technology companies that specialize in providing accessible solutions for the entertainment industry as well as subtitling experts.

    Restraint Factor For The Film Subtitling Market

    Compatibility issues with different video formats and platforms can hinder seamless subtitling implementation

    The smooth application of subtitling in the film business is significantly hampered by compatibility concerns with various video formats and platforms. With their precise translations and accessibility features, subtitles play a crucial role in improving the viewing experience. Subtitle integration might become more challenging due to differences in technical requirements between systems and video formats (e.g., encoding, frame rate, and resolution). A less-than-ideal viewing experience is frequently caused by these disparities, which can also lead to synchronization faults, display problems, or even total incompatibility. Technology suppliers and subtitling experts need to create adaptable solutions that work with a variety of platforms and formats in order to overcome these obstacles. In an increasingly digitized and diverse media world, this entails utilizing cutting-edge technology and standards that guarantee flawless subtitling integration, enhancing overa...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles

open_subtitles

OpenSubtitles

Helsinki-NLP/open_subtitles

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 10, 2020
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

Search
Clear search
Close search
Google apps
Main menu