MSRVTT contains 10K video clips and 200K captions. We adopt the standard 1K-A split protocol, which was introduced in JSFusion and has since become the de facto benchmark split in the Text-Video Retrieval field. Train:
train_7k: 7,010 videos, 140,200 captions
train_9k: 9,000 videos, 180,000 captions
Test:
test_1k: 1,000 videos, 1,000 captions
🌟 Citation
@inproceedings{xu2016msrvtt, title={Msr-vtt: A large video description dataset for bridging video and language}… See the full description on the dataset page: https://huggingface.co/datasets/friedrichor/MSR-VTT.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
iejMac/CLIP-MSR-VTT dataset hosted on Hugging Face and contributed by the HF Datasets community
MSR-VTT (Microsoft Research Video to Text) is a large-scale dataset for the open domain video captioning, which consists of 10,000 video clips from 20 categories, and each video clip is annotated with 20 English sentences by Amazon Mechanical Turks. There are about 29,000 unique words in all captions. The standard splits uses 6,513 clips for training, 497 clips for validation, and 2,990 clips for testing.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
MSRVTT-CTN Dataset
This dataset contains CTN annotations for the MSRVTT-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.
Dataset Structure
Each JSON file contains a dictionary where the keys are the video IDs and the values are the corresponding Causal-Temporal Narrative (CTN) captions. The CTN captions are represented as a dictionary with two keys: "Cause"… See the full description on the dataset page: https://huggingface.co/datasets/narrativebridge/MSRVTT-CTN.
aircrypto/msr-vtt-clipped-large-embedded-test dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by khoa Doãn
This dataset was created by ody kon
This dataset was created by Vishnutheep B
morpheushoc/msrvtt dataset hosted on Hugging Face and contributed by the HF Datasets community
Tevatron/msrvtt dataset hosted on Hugging Face and contributed by the HF Datasets community
CharmingDog/msrvtt dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MSRVTT-Personalization
Follow instruction to get the msrvtt-personalization data.
LICENSE
See License of MSRVTT-Personalization
morpheushoc/msrvtt-qa dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multi-source Video Captioning (MSVC) Dataset Card
Dataset details
Dataset type: MSVC is a set of collected video captioning data. It is constructed to ensure a robust and thorough evaluation of Video-LLMs' video-captioning capabilities. Dataset detail: MSVC is introduced to address limitations in existing video caption benchmarks, MSVC samples a total of 1,500 videos with human-annotated captions from MSVD, MSRVTT, and VATEX, ensuring diverse scenarios and domains.… See the full description on the dataset page: https://huggingface.co/datasets/DAMO-NLP-SG/Multi-Source-Video-Captioning.
https://antoyang.github.io/just-ask.html#ivqahttps://antoyang.github.io/just-ask.html#ivqa
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MSRVTT contains 10K video clips and 200K captions. We adopt the standard 1K-A split protocol, which was introduced in JSFusion and has since become the de facto benchmark split in the Text-Video Retrieval field. Train:
train_7k: 7,010 videos, 140,200 captions
train_9k: 9,000 videos, 180,000 captions
Test:
test_1k: 1,000 videos, 1,000 captions
🌟 Citation
@inproceedings{xu2016msrvtt, title={Msr-vtt: A large video description dataset for bridging video and language}… See the full description on the dataset page: https://huggingface.co/datasets/friedrichor/MSR-VTT.