HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of - 136M video clips with captions sourced from 1.2M YouTube videos (15 years of video) - 23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness
Each video is associated with a narration available as subtitles automatically downloaded from YouTube.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for youtube_subs_howto100M
Dataset Summary
The youtube_subs_howto100M dataset is an English-language dataset of instruction-response pairs extracted from 309136 YouTube videos. The dataset was orignally inspired by and sourced from the HowTo100M dataset, which was developed for natural language search for video clips.
Supported Tasks and Leaderboards
conversational: The dataset can be used to train a model for instruction(request) and a long form… See the full description on the dataset page: https://huggingface.co/datasets/totuta/youtube_subs_howto100M.
HowTo100M-subtitles-small
The subtitles from a subset of the HowTo100M dataset.
Procedural videos show step-by-step demonstrations of tasks like recipe preparation. Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions. Manually annotating steps and writing instructions is costly, which limits the size of current datasets and hinders effective learning. Leveraging large but noisy video-transcript datasets for pre-training can boost performance, but demands significant computational resources. Furthermore, transcripts contain irrelevant content and exhibit style variation compared to instructions written by human annotators. To mitigate both issues, we propose a technique, Sieve-&-Swap, to automatically curate a smaller dataset: (i) Sieve filters irrelevant transcripts and (ii) Swap enhances the quality of the text instruction by automatically replacing the transcripts with human-written instructions from a text-only recipe dataset. The curated dataset, three orders of magnitude smaller than current webscale datasets, enables efficient training of large-scale models with competitive performance.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ACAV100M processes 140 million full-length videos (total duration 1,030 years) which are used to produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Tigerbot 基于开源数据加工的sft,youtube中如何做(howto)系列。 原始来源:https://www.di.ens.fr/willow/research/howto100m/
Usage
import datasets ds_sft = datasets.load_dataset('TigerResearch/tigerbot-youtube-howto-en-50k')
Adverbs in Recipes (AIR) is a dataset specifically collected for adverb recognition. AIR is a subset of HowTo100M where recipe videos show actions performed in ways that change according to an adverb (e.g. chop thinly/coarsely). AIR was carefully reviewed to ensure reliable annotations.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Howto-Interlink7M
📙 Overview
Howto-Interlink7M presents a unique interleaved video-text dataset, carefully derived from the raw video content of Howto100M.
In the creation of this dataset, we turn a long video into a vision-text interleaved documents by BLIP2 (Img Captioner), GRIT (Img Detector), Whisper (ASR). Similar to VLog. Then, we employed the GPT-4 for an extensive 7 million high-quality pretraining data. During this process, we meticulously filtered out clips… See the full description on the dataset page: https://huggingface.co/datasets/Awiny/Howto-Interlink7M.
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title 'youcook2_features_howto100m'
RareAct is a video dataset of unusual actions, including actions like “blend phone”, “cut keyboard” and “microwave shoes”. It aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by combining verbs and nouns rarely co-occurring together in the large-scale textual corpus from HowTo100M, but that frequently appear separately.
Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two datasets for instructional video analysis tasks:
1. DenseStep200K.json
Description
A large-scale dataset containing 222,000 detailed, temporally grounded instructional steps annotated across 10,000 high-quality instructional videos (totaling 732 hours). Constructed through a training-free automated pipeline leveraging multimodal foundation models (Qwen2.5-VL-72B and DeepSeek-R1-671B) to process noisy HowTo100M videos, achieving precise… See the full description on the dataset page: https://huggingface.co/datasets/gmj03/DenseStep200K.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of - 136M video clips with captions sourced from 1.2M YouTube videos (15 years of video) - 23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness
Each video is associated with a narration available as subtitles automatically downloaded from YouTube.