Clone from "friedrichor/MSR-VTT". MSRVTT contains 10K video clips and 200K captions. We adopt the standard 1K-A split protocol, which was introduced in JSFusion and has since become the de facto benchmark split in the Text-Video Retrieval field. Train:
train_7k: 7,010 videos, 140,200 captions
train_9k: 9,000 videos, 180,000 captions
Test:
test_1k: 1,000 videos, 1,000 captions
🌟 Citation
@inproceedings{xu2016msrvtt, title={Msr-vtt: A large video description dataset… See the full description on the dataset page: https://huggingface.co/datasets/VLM2Vec/MSR-VTT.
MSRVTT contains 10K video clips and 200K captions. We adopt the standard 1K-A split protocol, which was introduced in JSFusion and has since become the de facto benchmark split in the Text-Video Retrieval field. Train:
train_7k: 7,010 videos, 140,200 captions
train_9k: 9,000 videos, 180,000 captions
Test:
test_1k: 1,000 videos, 1,000 captions
🌟 Citation
@inproceedings{xu2016msrvtt, title={Msr-vtt: A large video description dataset for bridging video and language}… See the full description on the dataset page: https://huggingface.co/datasets/friedrichor/MSR-VTT.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Clone from "friedrichor/MSR-VTT". MSRVTT contains 10K video clips and 200K captions. We adopt the standard 1K-A split protocol, which was introduced in JSFusion and has since become the de facto benchmark split in the Text-Video Retrieval field. Train:
train_7k: 7,010 videos, 140,200 captions
train_9k: 9,000 videos, 180,000 captions
Test:
test_1k: 1,000 videos, 1,000 captions
🌟 Citation
@inproceedings{xu2016msrvtt, title={Msr-vtt: A large video description dataset… See the full description on the dataset page: https://huggingface.co/datasets/VLM2Vec/MSR-VTT.