Facebook
TwitterHowTo100M-subtitles-small
The subtitles from a subset of the HowTo100M dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📙 Overview
The metadata for Ego4d training set, with paired howto100m video clips. The ego-exo pair is constructed by choosing the ones with shared nouns/verbs.
Each sample represents a short video clip, which consists of
vid: the initial video id. start_second: the start timestamp of the narration. end_second: the end timestamp of the narration. text: the original narration. noun: a list containing the index of nouns in the Ego4d noun vocabulary. verb: a list containing the… See the full description on the dataset page: https://huggingface.co/datasets/Jazzcharles/ego4d_train_pair_howto100m.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📙 Overview
The metadata for HowTo100M. The original ASR is refined by LLAMA-3 language model.
Each sample represents a short video clip, which consists of
vid: the initial video id. uid: a given unique id to index the clip. start_second: the timestamp of the narration. end_second: the end timestamp of the narration (which is simply set to start + 1). text: the original ASR transcript. noun: a list containing the index of nouns in the noun vocabulary. verb: a list containing the… See the full description on the dataset page: https://huggingface.co/datasets/Jazzcharles/HowTo100M_llama3_refined_caption.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ACAV100M processes 140 million full-length videos (total duration 1,030 years) which are used to produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).
Facebook
TwitterWe evaluate our approach on HowTo100M Adverbs which mined adverbs from 83 tasks in HowTo100M. Since the annotations were obtained from automatically transcribed narrations of instructional videos, they are noisy; ∼44% of the annotated action-adverb pairs are not visible in the video clip. The dataset contains 5,824 clips annotated with action-adverb pairs from 72 verbs and 6 adverbs. A clear limitation of this dataset is the small number of adverbs it contains, we thus create three new adverb datasets from existing video retrieval datasets: VATEX Adverbs, MSR-VTT Adverbs and ActivityNet Adverbs. These contain less noise and a greater variety of adverbs.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
DIBS Features
Pre-extracted CLIP and UniVL features of the YouCook2, ActivityNet and HowTo100M custom subset used in DIBS. To process the HowTo100M subset features, first combine all the split files and then extract them using the following commands:
cat howto_subset_features.tar.gz.part* > howto_subset_features.tar.gz
tar -xvzf howto_subset_features.tar.gz
File Structure ├── yc2 │ ├── clip_features │ │ ├── video │ │… See the full description on the dataset page: https://huggingface.co/datasets/Exclibur/dibs-feature.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains two datasets for instructional video analysis tasks:
1. DenseStep200K.json
Description
A large-scale dataset containing 222,000 detailed, temporally grounded instructional steps annotated across 10,000 high-quality instructional videos (totaling 732 hours). Constructed through a training-free automated pipeline leveraging multimodal foundation models (Qwen2.5-VL-72B and DeepSeek-R1-671B) to process noisy HowTo100M videos, achieving precise… See the full description on the dataset page: https://huggingface.co/datasets/gmj03/DenseStep200K.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
RareAct is a video dataset of unusual actions, including actions like “blend phone”, “cut keyboard” and “microwave shoes”. It aims at evaluating the zero-shot and few-shot compositionality of action recognition models for unlikely compositions of common action verbs and object nouns. It contains 122 different actions which were obtained by combining verbs and nouns rarely co-occurring together in the large-scale textual corpus from HowTo100M, but that frequently appear separately.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description:
The data format is a pair of video and text annotations. Our dataset comprises four categories:
EgoRe: The QA pairs annotated in our egocentric videos comprise three short, long, and chain-of-thought (CoT) data with video sources derived from Ego4D and HowTo100M.
General: A comprehensive collection of general-purpose image and video datasets, including K400, NextQA, SSV2, VideoChatGPT, and GPT-4o annotated QA data.
Ego-Related: Collection of publicly released… See the full description on the dataset page: https://huggingface.co/datasets/hyf015/EgoThinker-SFT-Dataset.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterHowTo100M-subtitles-small
The subtitles from a subset of the HowTo100M dataset.