2 datasets found

h
YouTube-Commons
huggingface.co
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PleIAs (2024). YouTube-Commons [Dataset]. https://huggingface.co/datasets/PleIAs/YouTube-Commons
Explore at:
Dataset updated
Apr 17, 2024
Dataset authored and provided by
PleIAs
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
📺 YouTube-Commons 📺

YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

Content

The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.
h
YouTube-Commons
huggingface.co
Updated Sep 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gautam Jha (2025). YouTube-Commons [Dataset]. https://huggingface.co/datasets/gautijha37/YouTube-Commons
Explore at:
Dataset updated
Sep 20, 2025
Authors
Gautam Jha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
YouTube
Description
📺 YouTube-Commons 📺

YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

Content

The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/gautijha37/YouTube-Commons.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

PleIAs (2024). YouTube-Commons [Dataset]. https://huggingface.co/datasets/PleIAs/YouTube-Commons

YouTube-Commons

PleIAs/YouTube-Commons

Youtube Commons Corpus

Explore at:

28 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Apr 17, 2024

Dataset authored and provided by

PleIAs

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

YouTube

Description

📺 YouTube-Commons 📺

YouTube-Commons is a collection of audio transcripts of 2,063,066 videos shared on YouTube under a CC-By license.

  Content

The collection comprises 22,709,724 original and automatically translated transcripts from 3,156,703 videos (721,136 individual channels). In total, this represents nearly 45 billion words (44,811,518,375). All the videos where shared on YouTube with a CC-BY license: the dataset provide all the necessary provenance information… See the full description on the dataset page: https://huggingface.co/datasets/PleIAs/YouTube-Commons.

Clear search

Close search

Google apps

Main menu