1 dataset found

h
open_subtitles
huggingface.co
marketplace.sshopencloud.eu
Updated May 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technology Research Group at the University of Helsinki (2024). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles
Explore at:
Dataset updated
May 13, 2024
Dataset authored and provided by
Language Technology Research Group at the University of Helsinki
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Language Technology Research Group at the University of Helsinki (2024). open_subtitles [Dataset]. https://huggingface.co/datasets/Helsinki-NLP/open_subtitles

open_subtitles

OpenSubtitles

Helsinki-NLP/open_subtitles

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

May 13, 2024

Dataset authored and provided by

Language Technology Research Group at the University of Helsinki

License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

This is a new collection of translated movie subtitles from http://www.opensubtitles.org/.

IMPORTANT: If you use the OpenSubtitle corpus: Please, add a link to http://www.opensubtitles.org/ to your website and to your reports and publications produced with the data!

This is a slightly cleaner version of the subtitle collection using improved sentence alignment and better language checking.

62 languages, 1,782 bitexts total number of files: 3,735,070 total number of tokens: 22.10G total number of sentence fragments: 3.35G

Clear search

Close search

Google apps

Main menu