YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 cooking recipes; on average, each distinct recipe has 22 videos. The procedure steps for each video are annotated with temporal boundaries and described by imperative English sentences (see the example below). The videos were downloaded from YouTube and are all in the third-person viewpoint. All the videos are unconstrained and can be performed by individual persons at their houses with unfixed cameras. YouCook2 contains rich recipe types and various cooking styles from all over the world.
morpheushoc/youcook2 dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
merve/YouCook2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Due to requests and inaccessibility of online videos, we are sharing the raw video files. By downloading these files, you are agreeing to use them for non-commercial, research purposes only.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
๐ Overview
YouCook2 video features extracted by InternVideo_MM_L14 at 8 fps. It is used for evaluating the video-text retrieval ability of EgoInstructor. Each file (e.g. 10dZTHlkb8w.pth.tar) is a TxD feature vector, where T refers to the length of the video and D is 768.
๐๏ธ How-To-Use
Please refer to code EgoInstructor for details.
๐ Citation
@article{xu2024retrieval, title={Retrieval-augmented egocentric video captioning}, author={Xu, Jilan and Huangโฆ See the full description on the dataset page: https://huggingface.co/datasets/Jazzcharles/youcook2_internvideo_MM_L14_features_fps8.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by hello518123
Released under Apache 2.0
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for escher-kitchen-action
MPII and YouCook2 dataset
Dataset Structure
Data Instances
Each instance contains:
source_image: The original image edited_image: The edited version of the image edit_instruction: The instruction used to edit the image source_image_caption: Caption for the source image target_image_caption: Caption for the edited image Additional metadata fields
Data Splits
{}
VALUE is a Video-And-Language Understanding Evaluation benchmark to test models that are generalizable to diverse tasks, domains, and datasets. It is an assemblage of 11 VidL (video-and-language) datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.
The datasets used for the VALUE benchmark are: TVQA, TVR, TVC, How2R, How2QA, VIOLIN, VLEP, YouCook2 (YC2C, YC2R), VATEX
Not seeing a result you expected?
Learn how you can add new datasets to our index.
YouCook2 is the largest task-oriented, instructional video dataset in the vision community. It contains 2000 long untrimmed videos from 89 cooking recipes; on average, each distinct recipe has 22 videos. The procedure steps for each video are annotated with temporal boundaries and described by imperative English sentences (see the example below). The videos were downloaded from YouTube and are all in the third-person viewpoint. All the videos are unconstrained and can be performed by individual persons at their houses with unfixed cameras. YouCook2 contains rich recipe types and various cooking styles from all over the world.