HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of - 136M video clips with captions sourced from 1.2M YouTube videos (15 years of video) - 23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness
Each video is associated with a narration available as subtitles automatically downloaded from YouTube.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for youtube_subs_howto100M
Dataset Summary
The youtube_subs_howto100M dataset is an English-language dataset of instruction-response pairs extracted from 309136 YouTube videos. The dataset was orignally inspired by and sourced from the HowTo100M dataset, which was developed for natural language search for video clips.
Supported Tasks and Leaderboards
conversational: The dataset can be used to train a model for instruction(request) and a long form… See the full description on the dataset page: https://huggingface.co/datasets/totuta/youtube_subs_howto100M.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📙 Overview
The metadata for Ego4d training set, with paired howto100m video clips. The ego-exo pair is constructed by choosing the ones with shared nouns/verbs.
Each sample represents a short video clip, which consists of
vid: the initial video id. start_second: the start timestamp of the narration. end_second: the end timestamp of the narration. text: the original narration. noun: a list containing the index of nouns in the Ego4d noun vocabulary. verb: a list containing the… See the full description on the dataset page: https://huggingface.co/datasets/Jazzcharles/ego4d_train_pair_howto100m.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📙 Overview
The metadata for HowTo100M. The original ASR is refined by LLAMA-3 language model.
Each sample represents a short video clip, which consists of
vid: the initial video id. uid: a given unique id to index the clip. start_second: the timestamp of the narration. end_second: the end timestamp of the narration (which is simply set to start + 1). text: the original ASR transcript. noun: a list containing the index of nouns in the noun vocabulary. verb: a list containing the… See the full description on the dataset page: https://huggingface.co/datasets/Jazzcharles/HowTo100M_llama3_refined_caption.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Tigerbot 基于开源数据加工的sft,youtube中如何做(howto)系列。 原始来源:https://www.di.ens.fr/willow/research/howto100m/
Usage
import datasets ds_sft = datasets.load_dataset('TigerResearch/tigerbot-youtube-howto-en-50k')
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ACAV100M processes 140 million full-length videos (total duration 1,030 years) which are used to produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
DIBS Features
Pre-extracted CLIP and UniVL features of the YouCook2, ActivityNet and HowTo100M custom subset used in DIBS. To process the HowTo100M subset features, first combine all the split files and then extract them using the following commands:
cat howto_subset_features.tar.gz.part* > howto_subset_features.tar.gz
tar -xvzf howto_subset_features.tar.gz
File Structure ├── yc2 │ ├── clip_features │ │ ├── video │ │… See the full description on the dataset page: https://huggingface.co/datasets/Exclibur/dibs-feature.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two datasets for instructional video analysis tasks:
1. DenseStep200K.json
Description
A large-scale dataset containing 222,000 detailed, temporally grounded instructional steps annotated across 10,000 high-quality instructional videos (totaling 732 hours). Constructed through a training-free automated pipeline leveraging multimodal foundation models (Qwen2.5-VL-72B and DeepSeek-R1-671B) to process noisy HowTo100M videos, achieving precise… See the full description on the dataset page: https://huggingface.co/datasets/gmj03/DenseStep200K.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RareAct is a video dataset of unusual actions, including actions like “blend phone”, “cut keyboard” and “microwave shoes”. It aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by combining verbs and nouns rarely co-occurring together in the large-scale textual corpus from HowTo100M, but that frequently appear separately.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
HowTo100M is a large-scale dataset of narrated videos with an emphasis on instructional videos where content creators teach complex tasks with an explicit intention of explaining the visual content on screen. HowTo100M features a total of - 136M video clips with captions sourced from 1.2M YouTube videos (15 years of video) - 23k activities from domains such as cooking, hand crafting, personal care, gardening or fitness
Each video is associated with a narration available as subtitles automatically downloaded from YouTube.