100+ datasets found
  1. h

    open_subtitles

    • huggingface.co
    • marketplace.sshopencloud.eu
    Updated Mar 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Helsinki-NLP Research Group (2023). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
    Explore at:
    Dataset updated
    Mar 21, 2023
    Dataset authored and provided by
    Helsinki-NLP Research Group
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

    IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

    This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

    62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

  2. Open Subtitles Multilingual Translation

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Open Subtitles Multilingual Translation [Dataset]. https://www.kaggle.com/datasets/thedevastator/open-subtitles-multilingual-translation
    Explore at:
    zip(403304423 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Open Subtitles Multilingual Translation

    Train Sequential Neural Networks in Nine Languages

    By Huggingface Hub [source]

    About this dataset

    This dataset provides an invaluable opportunity to train a neural network model to effectively and accurately translate text between an array of nine different languages, including Finnish, Hindi, Basque, Esperanto, French, Armenian, Bengali, Icelandic and Russian. Each language CSV file includes three columns: an ID column; a meta column which provides information about the source of the sentence; and finally a 'translation' column that contains the translated sentence. The aim is to build a dataset suitable for training models capable of mastering multilingual translation tasks in order to bridge gaps between languages. Train your model with this unique dataset today!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • ๐Ÿšจ Your notebook can be here! ๐Ÿšจ!

    How to use the dataset

    This dataset is a great resource for anyone looking to build a translation model using neural networks. Here is a guide on how to use it:

    • Download the appropriate .csv files for the languages you need from the Kaggle dataset.
    • The data comes in an easily accessible CSV file, with ID, meta and translation columns included in each row of data. The ID column consists of integer values that can be used to identify each row and create unique feature ignition labels when training your model, while the meta column contains information about where each sentence originated from, allowing you to quickly filter out any sentences with suspect origins if needed. The translation column should include both English translations as well as their foreign language equivalents per sentence (depending on which language you are working with).
    • To train your neural network model it's important that you have enough training data available and try different language-pairs related sub-set datasets if available before assembling your final full dataset for training later on down the road once all inputs are ready (if needed). This Kaggle set should provide sufficient sample sizes per individual language pair so proceed according appropriate after downloading whatever subsets needed from this main database here first.

    • Now itโ€™s time to construct our input features vector sets for our neural network configuration/setup by gathering all relevant variables in separate lists/arrays depending on preferred coding method used later when setting up our NN architecture layer setups appropriately based off all gathered items (elements) contained inside their respective list(s)/array(s) generated previously by implementing these steps mentioned above accordingly prior first before doing anything requiring input variable providing relevant core information found initially inside this Primary Open Subtitle Database explored so far properly earlier until now prior to continuing ahead next further below progressively further soon onward next momentarily right straight away very shortly right afterwards verily literally afterwards manually immediately properly eventually orderly personally autonomously biologically etc fortuitously contemporaneously instantaneously automatically justly necessarily lastly rightly confidently quixotically thankfully digitally informatively thereby correspondingly conjecturally constructively alike remarkably consistently instinctually markedly freely liberally perhaps anecdotally feasibly undeniably dynamically promptly easily holistically fairly evidently continually spontaneously intrinsically adaptively pictorially expressively intuitively hopefully methodically rationally prophetically perspicuously naturally savagely progressively peculiarly responsively whimsically illustratively skilfully tenaciously swiftly mysteriously productively continuously electromagnetically agitatedly constantly accurately ingeniously busily purposefully eagerly curiously exuberantly aud

    Research Ideas

    • Creating a neural network to automatically translate texts from any of the 9 languages in this dataset into any other language.
    • Developing an AI-powered chatbot that can reply in multiple languages that the users prefer.
    • Building an automatic translation system with real-time video conversation capabilities for use by professionals such as interpreters and international translators

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication](https://creativecommon...

  3. French Conversations (from movie subtitles)

    • kaggle.com
    zip
    Updated Aug 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dali Selmi (2023). French Conversations (from movie subtitles) [Dataset]. https://www.kaggle.com/datasets/daliselmi/french-conversational-dataset
    Explore at:
    zip(2880370702 bytes)Available download formats
    Dataset updated
    Aug 3, 2023
    Authors
    Dali Selmi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    French
    Description

    French Movie Subtitle Conversations Dataset

    Description

    Dive into the world of French dialogue with the French Movie Subtitle Conversations dataset โ€“ a comprehensive collection of over 127,000 movie subtitle conversations. This dataset offers a deep exploration of authentic and diverse conversational contexts spanning various genres, eras, and scenarios. It is thoughtfully organized into three distinct sets: training, testing, and validation.

    Content Overview

    Each conversation in this dataset is structured as a JSON object, featuring three key attributes:

    1. Context: Get a holistic view of the conversation's flow with the preceding 9 lines of dialogue. This context provides invaluable insights into the conversation's dynamics and contextual cues.
    2. Knowledge: Immerse yourself in a wide range of thematic knowledge. This dataset covers an array of topics, ensuring that your models receive exposure to diverse information sources for generating well-informed responses.
    3. Response: Explore how characters react and respond across various scenarios. From casual conversations to intense emotional exchanges, this dataset encapsulates the authenticity of genuine human interaction.

    Data Sample

    Here's a snippet from the dataset to give you an idea of its structure:

    [
     {
      "context": [
       "Tu as attendu longtemps?",
       "Oui en effet.",
       "Je pense que c' est grossier pour un premier rencard.",
       // ... (6 more lines of context)
      ],
      "knowledge": "",
      "response": "On n' avait pas dit 9h?"
     },
     // ... (more data samples)
    ]
    

    Use Cases

    The French Movie Subtitle Conversations dataset serves as a valuable resource for several applications:

    • Conversational AI: Train advanced chatbots and dialogue systems in French that can engage users in fluid, contextually aware conversations.
    • Language Modeling: Enhance your language models by leveraging diverse dialogue patterns, colloquialisms, and contextual dependencies present in real-world conversations.
    • Sentiment Analysis: Investigate the emotional tones of conversations across different movie genres and periods, contributing to a better understanding of sentiment variation.

    Why This Dataset

    • Size and Diversity: With a vast collection of over 127,000 conversations spanning diverse genres and tones, this dataset offers an unparalleled breadth and depth in French dialogue data.
    • Contextual Richness: The inclusion of context empowers researchers and practitioners to explore the dynamics of conversation flow, leading to more accurate and contextually relevant responses.
    • Real-world Relevance: Originating from movie subtitles, this dataset mirrors real-world interactions, making it a valuable asset for training models that understand and generate human-like dialogue.

    Acknowledgments

    We extend our gratitude to the movie subtitle community for their contributions, which have enabled the creation of this diverse and comprehensive French dialogue dataset.

    Unlock the potential of authentic French conversations today with the French Movie Subtitle Conversations dataset. Engage in state-of-the-art research, enhance language models, and create applications that resonate with the nuances of real dialogue.

  4. t

    Movie Subtitles - Dataset - LDM

    • service.tib.eu
    • resodate.org
    Updated Jan 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Movie Subtitles - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/movie-subtitles
    Explore at:
    Dataset updated
    Jan 3, 2025
    Description

    The dataset is used to test the proposed methodologies for mining parallel data from comparable corpora.

  5. d

    ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles.

    • doi.org
    • swissubase.ch
    Updated Mar 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ParTree - Parallel Treebanks: A multilingual corpus of movie subtitles. [Dataset]. http://doi.org/10.48656/5mz4-x435
    Explore at:
    Dataset updated
    Mar 21, 2023
    Description

    A multilingual corpus of movie subtitles aligned on the sentence-level. Contains data on more than 50 languages with a focus on the Indo-European language family. Morphosyntactic annotation (part-of-speech, features, dependencies) in Universal Dependency-style is available for 47 languages.

  6. r

    Open Subtitles dataset

    • resodate.org
    • service.tib.eu
    Updated Dec 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pavel Sountsov; Sunita Sarawagi (2024). Open Subtitles dataset [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvb3Blbi1zdWJ0aXRsZXMtZGF0YXNldA==
    Explore at:
    Dataset updated
    Dec 17, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Pavel Sountsov; Sunita Sarawagi
    Description

    The Open Subtitles dataset consists of transcriptions of spoken dialog in movies and television shows.

  7. Movie Parallel Subtitles (EN-IT-RU)

    • kaggle.com
    zip
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timur Sharifullin (2025). Movie Parallel Subtitles (EN-IT-RU) [Dataset]. https://www.kaggle.com/datasets/timursharifullindata/movie-parallel-subtitles-small-sentiment-dataset
    Explore at:
    zip(6056 bytes)Available download formats
    Dataset updated
    Jun 26, 2025
    Authors
    Timur Sharifullin
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset contains 25 aligned movie subtitle segments in English, Russian, and Italian, extracted from the ParTree corpus. Each row provides a short, context-rich movie line with its translations in all three languages, making it ideal for research and development in machine translation, multilingual NLP, and cross-lingual transfer learning.

    Key features: - Parallel triplets: English, Russian, Italian - Sourced from authentic movie subtitles for natural, conversational language - Suitable for training, validation, and benchmarking of translation and multilingual models

    Data originally from the ParTree corpus, available via Swiss-AL

  8. h

    Subtitles

    • huggingface.co
    Updated Apr 1, 2002
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peanut Jar Mixers Development (2002). Subtitles [Dataset]. https://huggingface.co/datasets/PJMixers-Dev/Subtitles
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2002
    Dataset authored and provided by
    Peanut Jar Mixers Development
    Description

    PJMixers-Dev/Subtitles dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. Movie Subtitle Dataset

    • kaggle.com
    zip
    Updated Aug 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Adiamaan (2021). Movie Subtitle Dataset [Dataset]. https://www.kaggle.com/adiamaan/movie-subtitle-dataset
    Explore at:
    zip(254871718 bytes)Available download formats
    Dataset updated
    Aug 8, 2021
    Authors
    Adiamaan
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    ๐Ÿ’ก Motive

    I was thinking about movie sentiments and wanted to see if there is any strong pattern behind how sentiment fluctuates across the movie to how that movie is received or performed.

    ๐ŸŽ Lowest hanging fruit

    To track movie sentiments across the run time, the easy way is to get the movie subtitles and identify the sentiment for each text in the subtitle. The advantage of this approach is that movie subtitles are easy to get, parse, and process and NLP frameworks can easily help with the task. This approach is scalable since irrespective of language, english subtitles are available for almost all movies albeit translation errors.

  10. R

    Alan Wake 2 Subtitles Dataset

    • universe.roboflow.com
    zip
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kopyl (2023). Alan Wake 2 Subtitles Dataset [Dataset]. https://universe.roboflow.com/kopyl/alan-wake-2-subtitles
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 6, 2023
    Dataset authored and provided by
    kopyl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Subtitle Bounding Boxes
    Description

    Alan Wake 2 Subtitles

    ## Overview
    
    Alan Wake 2 Subtitles is a dataset for object detection tasks - it contains Subtitle annotations for 565 images.
    
    ## Getting Started
    
    You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
    
      ## License
    
      This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
    
  11. E

    SubIMDB: A Structured Corpus of Subtitles

    • live.european-language-grid.eu
    • zenodo.org
    • +1more
    txt
    Updated Nov 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). SubIMDB: A Structured Corpus of Subtitles [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7453
    Explore at:
    txtAvailable download formats
    Dataset updated
    Nov 15, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Exploring language usage through frequency analysis in large corpora is a defining feature in most recent work in corpus and computational linguistics. From a psycholinguistic perspective, however, the corpora used in these contributions are often not representative of language usage: they are either domain-specific, limited in size, or extracted from unreliable sources. In an effort to address this limitation, we introduce SubIMDB, a corpus of everyday language spoken text we created which contains over 225 million words. The corpus was extracted from 38,102 subtitles of family, comedy and children movies and series, and is the first sizeable structured corpus of subtitles made available. Our experiments show that word frequency norms extracted from this corpus are more effective than those from well-known norms such as Kucera-Francis, HAL and SUBTLEXus in predicting various psycholinguistic properties of words, such as lexical decision times, familiarity, age of acquisition and simplicity. We also provide evidence that contradict the long-standing assumption that the ideal size for a corpus can be determined solely based on how well its word frequencies correlate with lexical decision times.

  12. m

    IndicDialogue Dataset

    • data.mendeley.com
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Noor Mairukh Khan Arnob (2024). IndicDialogue Dataset [Dataset]. http://doi.org/10.17632/wcb4bxbyxx.2
    Explore at:
    Dataset updated
    Jun 11, 2024
    Authors
    Noor Mairukh Khan Arnob
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    The IndicDialogue dataset contains raw subtitle SRT files and dialogues extracted from them. The subtitles are in 10 indic languages, namely Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali and Assamese. This dataset provides a corpus for performing various NLP tasks in low-resource languages using SLMs(Small Language Models) and LLMs(Large Language Models).

  13. S

    Subtitling and Captioning Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Jan 16, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2026). Subtitling and Captioning Report [Dataset]. https://www.datainsightsmarket.com/reports/subtitling-and-captioning-1393307
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Jan 16, 2026
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming subtitling and captioning market! This in-depth analysis reveals key trends, growth drivers, leading companies (SDI Media, IYUNO, Deluxe Media, ZOO Digital), and regional market shares from 2019-2033. Learn about the impact of AI and increasing demand for multilingual content.

  14. V

    Video Subtitle Translation Service Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Feb 2, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2026). Video Subtitle Translation Service Report [Dataset]. https://www.datainsightsmarket.com/reports/video-subtitle-translation-service-538596
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Feb 2, 2026
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global video subtitle translation services market is booming, driven by streaming, accessibility needs, and AI advancements. Learn about market size, growth trends, key players (Stepes, Ai-Media, 3Play Media), and future projections in this comprehensive analysis. Discover how this $2.5 billion market is set to reach $7.8 billion by 2033.

  15. Movie Subtitles

    • kaggle.com
    zip
    Updated Nov 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahwar (2024). Movie Subtitles [Dataset]. https://www.kaggle.com/datasets/ahwardev/movie-subtitles
    Explore at:
    zip(133455 bytes)Available download formats
    Dataset updated
    Nov 26, 2024
    Authors
    Ahwar
    Description

    This dataset includes subtitle files in the SRT (SubRip Subtitle) format for several popular movies, such as Oppenheimer and Tenet. SRT files are plain-text files widely used for subtitles, containing a series of structured entries to synchronize text with video content. Each entry in an SRT file comprises:

    1. A sequential index number to indicate the order of the subtitles.
    2. Timestamps that specify when a subtitle should appear and disappear, formatted as hours:minutes:seconds,milliseconds.
    3. The subtitle text, which is displayed during the designated time interval.

    For example:
    ``` 1 00:00:01,500 --> 00:00:04,000 This is a sample subtitle.

    2 00:00:04,500 --> 00:00:07,000 Here is another subtitle to demonstrate multiple entries.

    
    This straightforward format is highly compatible with media players and easy to edit. SRT files enhance accessibility by providing subtitles for different languages or accommodating viewers with hearing impairments, enriching the experience of enjoying these popular movies.
    
  16. Reasons why adults use subtitles when watching TV in known language in the...

    • statista.com
    Updated Nov 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Reasons why adults use subtitles when watching TV in known language in the U.S. 2023 [Dataset]. https://www.statista.com/statistics/1459167/reasons-use-subtitles-watching-tv-known-language-us/
    Explore at:
    Dataset updated
    Nov 27, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    Jun 29, 2023 - Jul 5, 2023
    Area covered
    United States
    Description

    Enhancement of comprehension and more profound understanding of accents were the most common reasons why American adults use subtitles while watching TV in a known language, according to a survey conducted between June and July 2023. Another ** percent of the respondents stated that they did so because they were in a noisy environment.

  17. h

    Subtitles-rag-questions-qwq-all-aphrodite

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peanut Jar Mixers Development, Subtitles-rag-questions-qwq-all-aphrodite [Dataset]. https://huggingface.co/datasets/PJMixers-Dev/Subtitles-rag-questions-qwq-all-aphrodite
    Explore at:
    Dataset authored and provided by
    Peanut Jar Mixers Development
    Description

    PJMixers-Dev/Subtitles-rag-questions-qwq-all-aphrodite dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. S

    Subtitles Editor Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Subtitles Editor Report [Dataset]. https://www.datainsightsmarket.com/reports/subtitles-editor-512222
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Apr 25, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2026 - 2034
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming subtitles editor market! This in-depth analysis reveals key trends, growth drivers, and leading companies shaping the future of video accessibility and localization. Explore market size, CAGR, and regional insights for 2025-2033.

  19. h

    survivor-subtitles-cleaned

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Lambert, survivor-subtitles-cleaned [Dataset]. https://huggingface.co/datasets/hipml/survivor-subtitles-cleaned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Paul Lambert
    Description

    Survivor Subtitles Dataset (cleaned)

      Dataset Description
    

    A collection of subtitles from the American reality television show "Survivor", spanning seasons 1 through 47. The dataset contains subtitle text extracted from episode broadcasts. This dataset is a modification of the original Survivor Subtitles dataset after cleaning up and joining subtitle fragments. This dataset is a work in progress and any contributions are welcome.

      Source
    

    The subtitles wereโ€ฆ See the full description on the dataset page: https://huggingface.co/datasets/hipml/survivor-subtitles-cleaned.

  20. Focus Group for Block Chain for Subtitles

    • data.europa.eu
    • zenodo.org
    unknown
    Updated Jan 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Focus Group for Block Chain for Subtitles [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-7313421?locale=no
    Explore at:
    unknown(188462)Available download formats
    Dataset updated
    Jan 23, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Transcript of the focus group

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Helsinki-NLP Research Group (2023). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles

open_subtitles

OpenSubtitles

Helsinki-NLP/open_subtitles

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Mar 21, 2023
Dataset authored and provided by
Helsinki-NLP Research Group
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

Search
Clear search
Close search
Google apps
Main menu