1 dataset found
  1. h

    open_subtitles

    • huggingface.co
    • marketplace.sshopencloud.eu
    Updated Dec 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
    Explore at:
    Dataset updated
    Dec 10, 2020
    Dataset authored and provided by
    Language Technology Research Group at the University of Helsinki
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

    IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

    This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

    62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

  2. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Language Technology Research Group at the University of Helsinki (2020). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles

open_subtitles

OpenSubtitles

Helsinki-NLP/open_subtitles

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Dec 10, 2020
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

Search
Clear search
Close search
Google apps
Main menu