Facebook
Twitterhttps://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
🤿 DENSE VIDEO UNDERSTANDING WITH GATED RESIDUAL TOKENIZATION
Dense Information Video Evaluation (DIVE) Benchmark
The first-ever benchmark dedicated to the task of Dense Video Understanding, focusing on QA-driven high-frame-rate video comprehension, where the answer-relevant information is present in nearly every frame.
👥 Authors
Haichao Zhang1 · Wenhao Chai2 · Shwai He3 · Ang Li3 · Yun Fu1
1… See the full description on the dataset page: https://huggingface.co/datasets/haichaozhang/DenseVideoEvaluation.
Facebook
TwitterDense-World/video-res dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for PLM-Video Human
PLM-Video-Human is a collection of human-annotated resources for training Vision Language Models, focused on detailed video understanding. Training tasks include: fine-grained open-ended question answering (FGQA), Region-based Video Captioning (RCap), Region-based Dense Video Captioning (RDCap) and Region-based Temporal Localization (RTLoc). [📃 Tech Report] [📂 Github]
Dataset Structure
Fine-Grained Question Answering (FGQA)… See the full description on the dataset page: https://huggingface.co/datasets/facebook/PLM-Video-Human.
Facebook
TwitterThis dataset was created by debabee
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set contains traffic surveillance video in the area of Nevsky prospect(central street of Saint Petersburg) between the Moika river and the street Bolshaya Konuyshennaya. The covered area contains two-way road with dense vehicle movement. The length of the selected area is 102 meters, width is 17.5 meters. The resolution of each video is 960x720 pixels. The dataset was colected in November 2017, December 2017 and January 2018.7 days in April 2017 are placed here:https://figshare.com/articles/St_Petersburg_traffic_videos/5439706Dataset and compilation of movement by the opposite lane cases are placed here:https://figshare.com/articles/Nevsky_prospect_traffic_surveillance_video_MBOL-cases_hours_/5841267
Facebook
TwitterThis dataset contains over 100,000 hours of video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.
Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.
Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)
Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.
High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.
AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.
Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use
Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding
Facebook
TwitterThis dataset contains over 1,000 hours of facial expression selfie video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.
Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.
Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)
Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.
High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.
AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification and action recognition Video captioning and summarization Vision-language model (VLM) alignment Multimodal reasoning and grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.
Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use
Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding
Facebook
TwitterContextual reasoning is essential to understand events in long untrimmed videos. In this work, we systematically explore different captioning models with various contexts for the dense-captioning events in video task, which aims to generate captions for different events in the untrimmed video.
Facebook
TwitterThis dataset contains over 2,000 hours of face ID selfie video recordings captured worldwide. Designed for AI and machine-learning applications, it provides richly annotated, context-dense video data suitable for training vision-language models, action-recognition systems, identity-aware AI, and multimodal reasoning.
Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This supports training for identity-aware video analysis, activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.
Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, environments, lighting conditions, demographics, or motion patterns)
Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This diversity ensures strong generalization for global identity-aware deployments.
High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This supports robust performance in real-world face ID and video analysis systems.
AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Face ID model training and evaluation Video classification and action recognition Vision-language model (VLM) alignment Multimodal reasoning and grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.
Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for face ID video usage Documented rights and usage permissions Vetted for commercial and research use
Use Cases Face ID and identity-aware model training Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and fraud prevention Video retrieval, indexing, and summarization Research in identity recognition, activity analysis, and multimodal grounding
Facebook
TwitterThis dataset contains over 5,000 hours of CCTV video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.
Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.
Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)
Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.
High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.
AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.
Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use
Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding
Facebook
TwitterThis dataset is a research work of https://xuange923.github.io/Surveillance-Video-Understanding
All the credits go to the researchers involved. I highly recommend you to read the research paper for a better and concrete understanding about the dataset and experiments performed by research on Temporal Sentence Grounding in Videos, Video Captioning, Dense Video Captioning, Multimodal Anomaly Detection.
The description I gave here are key takeaways about the dataset.
Current surveillance video tasks mainly focus on classifying and localizing anomalous events. Surveillance video datasets lack sentence-level language annotations. The researchers involved propose a new research direction of surveillance video-and-language understanding by constructing the UCA (UCF-Crime Annotation) Dataset.
The researchers manually annotated the event content and event occurrence time for 1,854 videos from UCF-Crime, called UCF-Crime Annotation (UCA).The dataset contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours.
![https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2F8192ec392aa60fc988158fe52521d15c%2FScreenshot%202024-09-17%20225518.png?generation=1726593960412846&alt=media" alt="">]
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Fdf4c4398b869b62198d031d7b80c422a%2FScreenshot%202024-09-17%20225800.png?generation=1726594099581994&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Fb7280d2e2fb4820067b1a77a95744d4a%2FScreenshot%202024-09-17%20230159.png?generation=1726594362520024&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Fe2f8d67bc27aca4a837c6c46e6bb347f%2FScreenshot%202024-09-17%20230320.png?generation=1726594421676695&alt=media" alt="">
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Ffceea2695b15324fed1b851d65e1606f%2FScreenshot%202024-09-12%20193832.png?generation=1726594522513630&alt=media" alt="">
@misc{yuan2023surveillance,
title={Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges},
author={Tongtong Yuan and Xuange Zhang and Kun Liu and Bo Liu and Chen Chen and Jian Jin and Zhenzhen Jiao},
year={2023},
eprint={2309.13925},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Facebook
TwitterThis dataset contains over 5,000 hours of facial expression video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.
Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.
Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)
Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.
High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.
AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.
Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use
Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding
Facebook
TwitterSceneWalk Dataset Card
Dataset details
Dataset type: SceneWalk is a new high-quality video dataset with thorough captioning for each video. It includes dense and detailed descriptions for every video segment across the entire scene context. The SceneWalk dataset, sourced from long and untrimmed 87.8K YouTube videos (avg. 486 seconds each), features frequent scene transitions across a total of 11.8K hrs video duration and 1.3M massively segmented video clips.… See the full description on the dataset page: https://huggingface.co/datasets/IVLLab/SceneWalk.
Facebook
TwitterVideo scene parsing in the wild with diverse scenarios is a challenging and great significance task, especially with the rapid development of automatic driving technique. The dataset Video Scene Parsing in the Wild(VSPW) contains well-trimmed long-temporal, dense annotation and high resolution clips.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Event cameras are sensors that are inspired by biological systems and specialize in capturing changes in brightness. These emerging cameras offer numerous advantages over conventional frame-based cameras, including high dynamic range, high frame rates, and extremely low power consumption. As a result, event cameras are increasingly being used in various fields, such as object detection and tracking, autonomous driving, 3D reconstruction, visual odometry, and SLAM.
We have created the first large-scale synthetic event camera voxel 3D reconstruction dataset, comprising over 39,739 simulated event camera 3D object scans from 13 different object categories. Each entry in the dataset contains a 0.5-second, 240fps high frame rate RGB video scan, simulated event camera data, the original 3D model, and a converted 32x32x32 voxel model.
The 3D models used in this dataset are from ShapeNet (Link: https://shapenet.org/).
Although this dataset only provides voxel representation for ground truth, obtaining other types of representation such as point cloud as ground truth will be trivial with the provided gltf 3D model. We hope that by publishing this dataset, we can accelerate the advancement of event-based 3D reconstruction.
The original paper is available at:
IEEE Xplore - https://ieeexplore.ieee.org/document/10169359
ArXiv - https://arxiv.org/abs/2309.00385
!!! Due to limited resources, we are unable to release the full dataset, which is around 1.2 TB in size. We would greatly appreciate any organizations willing to host the full dataset for us. (Contact: haodong.chen@sydney.edu.au)
!!! The released SynthEVox3D-Tiny dataset, which is the dataset used in the original paper, is around 32 GB. We also provide scripts in the utils folder of the dataset to reproduce our results, making it possible for other researchers to recreate the entire dataset from scratch.
H. Chen, V. Chung, L. Tan and X. Chen, "Dense Voxel 3D Reconstruction Using a Monocular Event Camera," 2023 9th International Conference on Virtual Reality (ICVR), Xianyang, China, 2023, pp. 30-35, doi: 10.1109/ICVR57957.2023.10169359.
@INPROCEEDINGS{10169359,
author={Chen, Haodong and Chung, Vera and Tan, Li and Chen, Xiaoming},
booktitle={2023 9th International Conference on Virtual Reality (ICVR)},
title={Dense Voxel 3D Reconstruction Using a Monocular Event Camera},
year={2023},
volume={},
number={},
pages={30-35},
doi={10.1109/ICVR57957.2023.10169359}}
Facebook
TwitterThis dataset contains over 10,000 hours of human actions video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.
Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.
Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)
Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.
High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.
AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.
Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use
Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Video Detailed Caption Benchmark
Resources
Website arXiv: Paper GitHub: Code Huggingface: AuroraCap Model Huggingface: VDC Benchmark Huggingface: Trainset
Features
Benchmark Collection and Processing
We building VDC upon Panda-70M, Ego4D, Mixkit, Pixabay, and Pexels. Structured detailed captions construction pipeline. We develop a structured detailed captions construction pipeline to generate extra detailed descriptions from various… See the full description on the dataset page: https://huggingface.co/datasets/wchai/Video-Detailed-Caption.
Facebook
TwitterThis dataset contains over 10,000 hours of object manipulation video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.
Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.
Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)
Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.
High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.
AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.
Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use
Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding
Facebook
Twitterhttp://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by
This data set consists of 28 video sequences of driving recorded in the CARLA simulator, resulting in a total of 10767 frames. For each frame, pixel-wise semantic labels are provided. The scenes are recorded in dynamic weather and traffic conditions.
Facebook
Twitterhttps://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/
🤿 DENSE VIDEO UNDERSTANDING WITH GATED RESIDUAL TOKENIZATION
Dense Information Video Evaluation (DIVE) Benchmark
The first-ever benchmark dedicated to the task of Dense Video Understanding, focusing on QA-driven high-frame-rate video comprehension, where the answer-relevant information is present in nearly every frame.
👥 Authors
Haichao Zhang1 · Wenhao Chai2 · Shwai He3 · Ang Li3 · Yun Fu1
1… See the full description on the dataset page: https://huggingface.co/datasets/haichaozhang/DenseVideoEvaluation.