68 datasets found
  1. h

    DenseVideoEvaluation

    • huggingface.co
    Updated Sep 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haichao Zhang (2025). DenseVideoEvaluation [Dataset]. http://doi.org/10.57967/hf/6523
    Explore at:
    Dataset updated
    Sep 17, 2025
    Authors
    Haichao Zhang
    License

    https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/

    Description

    🤿 DENSE VIDEO UNDERSTANDING WITH GATED RESIDUAL TOKENIZATION

      Dense Information Video Evaluation (DIVE) Benchmark
    

    The first-ever benchmark dedicated to the task of Dense Video Understanding, focusing on QA-driven high-frame-rate video comprehension, where the answer-relevant information is present in nearly every frame.

      👥 Authors
    

    Haichao Zhang1 · Wenhao Chai2 · Shwai He3 · Ang Li3 · Yun Fu1

    1… See the full description on the dataset page: https://huggingface.co/datasets/haichaozhang/DenseVideoEvaluation.
    
  2. h

    video-res

    • huggingface.co
    Updated Sep 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dense World (2024). video-res [Dataset]. https://huggingface.co/datasets/Dense-World/video-res
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 13, 2024
    Dataset authored and provided by
    Dense World
    Description

    Dense-World/video-res dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    PLM-Video-Human

    • huggingface.co
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2025). PLM-Video-Human [Dataset]. https://huggingface.co/datasets/facebook/PLM-Video-Human
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    AI at Meta
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for PLM-Video Human

    PLM-Video-Human is a collection of human-annotated resources for training Vision Language Models, focused on detailed video understanding. Training tasks include: fine-grained open-ended question answering (FGQA), Region-based Video Captioning (RCap), Region-based Dense Video Captioning (RDCap) and Region-based Temporal Localization (RTLoc). [📃 Tech Report] [📂 Github]

      Dataset Structure
    
    
    
    
    
    
      Fine-Grained Question Answering (FGQA)… See the full description on the dataset page: https://huggingface.co/datasets/facebook/PLM-Video-Human.
    
  4. video-captions

    • kaggle.com
    zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    debabee (2024). video-captions [Dataset]. https://www.kaggle.com/datasets/debakshii/video-captions
    Explore at:
    zip(698861 bytes)Available download formats
    Dataset updated
    Jul 17, 2024
    Authors
    debabee
    Description

    Dataset

    This dataset was created by debabee

    Contents

  5. Nevsky prospect traffic surveillance video

    • figshare.com
    bin
    Updated May 16, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Grigorev (2018). Nevsky prospect traffic surveillance video [Dataset]. http://doi.org/10.6084/m9.figshare.5841846.v24
    Explore at:
    binAvailable download formats
    Dataset updated
    May 16, 2018
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Artur Grigorev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Nevsky Avenue
    Description

    The data set contains traffic surveillance video in the area of Nevsky prospect(central street of Saint Petersburg) between the Moika river and the street Bolshaya Konuyshennaya. The covered area contains two-way road with dense vehicle movement. The length of the selected area is 102 meters, width is 17.5 meters. The resolution of each video is 960x720 pixels. The dataset was colected in November 2017, December 2017 and January 2018.7 days in April 2017 are placed here:https://figshare.com/articles/St_Petersburg_traffic_videos/5439706Dataset and compilation of movement by the opposite lane cases are placed here:https://figshare.com/articles/Nevsky_prospect_traffic_surveillance_video_MBOL-cases_hours_/5841267

  6. d

    100K+ Hours of Video Data | AI Training Data | Annotated Video for AI |...

    • datarade.ai
    Updated Dec 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSeeds.AI (2025). 100K+ Hours of Video Data | AI Training Data | Annotated Video for AI | Bounding Boxes, Action Labels & Scene Descriptions | Global Coverage [Dataset]. https://datarade.ai/data-products/100k-hours-of-video-data-ai-training-data-annotated-vide-data-seeds
    Explore at:
    .json, .xml, .csv, .txt, .mp4, .movAvailable download formats
    Dataset updated
    Dec 15, 2025
    Dataset provided by
    DataSeeds.AI
    Area covered
    Namibia, France, Botswana, Burkina Faso, Maldives, Slovakia, Kyrgyzstan, Norfolk Island, Estonia, Cameroon
    Description

    This dataset contains over 100,000 hours of video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.

    Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.

    1. Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)

    2. Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.

    3. High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.

    4. AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.

    5. Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use

    Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding

  7. d

    1K+ Hours of Selfie Video Data | AI Training Data | Annotated Video for AI |...

    • datarade.ai
    Updated Dec 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSeeds.AI (2025). 1K+ Hours of Selfie Video Data | AI Training Data | Annotated Video for AI | Bounding Boxes, Action Labels & Scene Descriptions | Global Coverage [Dataset]. https://datarade.ai/data-products/1k-hours-of-selfie-video-data-ai-training-data-annotated-data-seeds
    Explore at:
    .json, .xml, .csv, .txt, .mp4, .movAvailable download formats
    Dataset updated
    Dec 15, 2025
    Dataset provided by
    DataSeeds.AI
    Area covered
    Tanzania, Mexico, American Samoa, Nicaragua, Denmark, India, Djibouti, United Arab Emirates, Nigeria, Tajikistan
    Description

    This dataset contains over 1,000 hours of facial expression selfie video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.

    Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.

    1. Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)

    2. Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.

    3. High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.

    4. AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification and action recognition Video captioning and summarization Vision-language model (VLM) alignment Multimodal reasoning and grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.

    5. Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use

    Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding

  8. r

    Data from: ActivityNet 2019 Task 3: Exploring Contexts for Dense Captioning...

    • resodate.org
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shizhe Chen; Yuqing Song; Yida Zhao; Qin Jin; Zhaoyang Zeng; Bei Liu; Jianlong Fu; Alexander Hauptmann (2024). ActivityNet 2019 Task 3: Exploring Contexts for Dense Captioning Events in Videos [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvYWN0aXZpdHluZXQtMjAxOS10YXNrLTMtLWV4cGxvcmluZy1jb250ZXh0cy1mb3ItZGVuc2UtY2FwdGlvbmluZy1ldmVudHMtaW4tdmlkZW9z
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Shizhe Chen; Yuqing Song; Yida Zhao; Qin Jin; Zhaoyang Zeng; Bei Liu; Jianlong Fu; Alexander Hauptmann
    Description

    Contextual reasoning is essential to understand events in long untrimmed videos. In this work, we systematically explore different captioning models with various contexts for the dense-captioning events in video task, which aims to generate captions for different events in the untrimmed video.

  9. d

    2K+ Hours of Face id Video Data | AI Training Data | Annotated Video for AI...

    • datarade.ai
    Updated Feb 7, 2026
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSeeds.AI (2026). 2K+ Hours of Face id Video Data | AI Training Data | Annotated Video for AI | Bounding Boxes, Action Labels & Scene Descriptions | Global [Dataset]. https://datarade.ai/data-products/2k-hours-of-face-id-video-data-ai-training-data-annotate-data-seeds
    Explore at:
    .json, .xml, .csv, .txt, .mp4, .movAvailable download formats
    Dataset updated
    Feb 7, 2026
    Dataset provided by
    DataSeeds.AI
    Area covered
    Luxembourg, Macao, Bosnia and Herzegovina, Philippines, Guernsey, British Indian Ocean Territory, Colombia, Armenia, Belarus, Niger
    Description

    This dataset contains over 2,000 hours of face ID selfie video recordings captured worldwide. Designed for AI and machine-learning applications, it provides richly annotated, context-dense video data suitable for training vision-language models, action-recognition systems, identity-aware AI, and multimodal reasoning.

    Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This supports training for identity-aware video analysis, activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.

    1. Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, environments, lighting conditions, demographics, or motion patterns)

    2. Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This diversity ensures strong generalization for global identity-aware deployments.

    3. High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This supports robust performance in real-world face ID and video analysis systems.

    4. AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Face ID model training and evaluation Video classification and action recognition Vision-language model (VLM) alignment Multimodal reasoning and grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.

    5. Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for face ID video usage Documented rights and usage permissions Vetted for commercial and research use

    Use Cases Face ID and identity-aware model training Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and fraud prevention Video retrieval, indexing, and summarization Research in identity recognition, activity analysis, and multimodal grounding

  10. d

    5K+ Hours of CCTV Video Data | AI Training Data | Annotated Video for AI |...

    • datarade.ai
    Updated Dec 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSeeds.AI (2025). 5K+ Hours of CCTV Video Data | AI Training Data | Annotated Video for AI | Bounding Boxes, Action Labels & Scene Descriptions | Global Covera [Dataset]. https://datarade.ai/data-products/5k-hours-of-cctv-video-data-ai-training-data-annotated-v-data-seeds
    Explore at:
    .json, .xml, .csv, .txt, .mp4, .movAvailable download formats
    Dataset updated
    Dec 15, 2025
    Dataset provided by
    DataSeeds.AI
    Area covered
    Heard Island and McDonald Islands, Portugal, Belize, Hungary, Serbia, Turkey, Niue, Christmas Island, Bermuda, Dominica
    Description

    This dataset contains over 5,000 hours of CCTV video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.

    Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.

    1. Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)

    2. Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.

    3. High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.

    4. AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.

    5. Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use

    Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding

  11. UCA(UCF Crime Annotation) Dataset

    • kaggle.com
    zip
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keesari Vigneshwar Reddy (2024). UCA(UCF Crime Annotation) Dataset [Dataset]. https://www.kaggle.com/datasets/vigneshwar472/ucaucf-crime-annotation-dataset/code
    Explore at:
    zip(103085257090 bytes)Available download formats
    Dataset updated
    Sep 18, 2024
    Authors
    Keesari Vigneshwar Reddy
    Description

    This dataset is a research work of https://xuange923.github.io/Surveillance-Video-Understanding

    All the credits go to the researchers involved. I highly recommend you to read the research paper for a better and concrete understanding about the dataset and experiments performed by research on Temporal Sentence Grounding in Videos, Video Captioning, Dense Video Captioning, Multimodal Anomaly Detection.

    The description I gave here are key takeaways about the dataset.

    Need

    Current surveillance video tasks mainly focus on classifying and localizing anomalous events. Surveillance video datasets lack sentence-level language annotations. The researchers involved propose a new research direction of surveillance video-and-language understanding by constructing the UCA (UCF-Crime Annotation) Dataset.

    Description

    The researchers manually annotated the event content and event occurrence time for 1,854 videos from UCF-Crime, called UCF-Crime Annotation (UCA).The dataset contains 23,542 sentences, with an average length of 20 words, and its annotated videos are as long as 110.7 hours.

    ![https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2F8192ec392aa60fc988158fe52521d15c%2FScreenshot%202024-09-17%20225518.png?generation=1726593960412846&alt=media" alt="">]

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Fdf4c4398b869b62198d031d7b80c422a%2FScreenshot%202024-09-17%20225800.png?generation=1726594099581994&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Fb7280d2e2fb4820067b1a77a95744d4a%2FScreenshot%202024-09-17%20230159.png?generation=1726594362520024&alt=media" alt="">

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Fe2f8d67bc27aca4a837c6c46e6bb347f%2FScreenshot%202024-09-17%20230320.png?generation=1726594421676695&alt=media" alt="">

    How did the researchers annotate the data?

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F15856017%2Ffceea2695b15324fed1b851d65e1606f%2FScreenshot%202024-09-12%20193832.png?generation=1726594522513630&alt=media" alt="">

    Citation

    @misc{yuan2023surveillance,
    title={Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges}, 
    author={Tongtong Yuan and Xuange Zhang and Kun Liu and Bo Liu and Chen Chen and Jian Jin and Zhenzhen Jiao},
    year={2023},
    eprint={2309.13925},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
    }    
    
  12. DAVIS_data

    • kaggle.com
    zip
    Updated Mar 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monson (2019). DAVIS_data [Dataset]. https://www.kaggle.com/mengzj/davis-data
    Explore at:
    zip(3916975760 bytes)Available download formats
    Dataset updated
    Mar 14, 2019
    Authors
    Monson
    Description

    Dataset

    This dataset was created by Monson

    Contents

  13. d

    5K+ Hours of Facial Expression Video Data | AI Training Data | Annotated...

    • datarade.ai
    Updated Dec 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSeeds.AI (2025). 5K+ Hours of Facial Expression Video Data | AI Training Data | Annotated Video for AI | Bounding Boxes, Action Labels & Scene Descriptions | Global [Dataset]. https://datarade.ai/data-products/5k-hours-of-facial-expression-video-data-ai-training-data-data-seeds
    Explore at:
    .json, .xml, .csv, .txt, .mp4, .movAvailable download formats
    Dataset updated
    Dec 15, 2025
    Dataset provided by
    DataSeeds.AI
    Area covered
    Turkey, Guyana, Croatia, Liberia, Samoa, Monaco, Nigeria, Japan, Gibraltar, India
    Description

    This dataset contains over 5,000 hours of facial expression video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.

    Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.

    1. Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)

    2. Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.

    3. High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.

    4. AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.

    5. Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use

    Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding

  14. h

    SceneWalk

    • huggingface.co
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Integrated Vision Language Lab (2024). SceneWalk [Dataset]. https://huggingface.co/datasets/IVLLab/SceneWalk
    Explore at:
    Dataset updated
    Nov 25, 2024
    Dataset authored and provided by
    Integrated Vision Language Lab
    Description

    SceneWalk Dataset Card

      Dataset details
    

    Dataset type: SceneWalk is a new high-quality video dataset with thorough captioning for each video. It includes dense and detailed descriptions for every video segment across the entire scene context. The SceneWalk dataset, sourced from long and untrimmed 87.8K YouTube videos (avg. 486 seconds each), features frequent scene transitions across a total of 11.8K hrs video duration and 1.3M massively segmented video clips.… See the full description on the dataset page: https://huggingface.co/datasets/IVLLab/SceneWalk.

  15. r

    Video Scene Parsing in the Wild (VSPW)

    • resodate.org
    Updated Dec 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bo Yan; Leilei Cao; Hongbin Wang (2024). Video Scene Parsing in the Wild (VSPW) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvdmlkZW8tc2NlbmUtcGFyc2luZy1pbi10aGUtd2lsZC0tdnNwdy0=
    Explore at:
    Dataset updated
    Dec 16, 2024
    Dataset provided by
    Leibniz Data Manager
    Authors
    Bo Yan; Leilei Cao; Hongbin Wang
    Description

    Video scene parsing in the wild with diverse scenarios is a challenging and great significance task, especially with the rapid development of automatic driving technique. The dataset Video Scene Parsing in the Wild(VSPW) contains well-trimmed long-temporal, dense annotation and high resolution clips.

  16. SynthEVox3D

    • kaggle.com
    zip
    Updated Sep 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H.C. (2023). SynthEVox3D [Dataset]. https://www.kaggle.com/datasets/hche8927/synthevox3d
    Explore at:
    zip(30136019764 bytes)Available download formats
    Dataset updated
    Sep 5, 2023
    Authors
    H.C.
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Event cameras are sensors that are inspired by biological systems and specialize in capturing changes in brightness. These emerging cameras offer numerous advantages over conventional frame-based cameras, including high dynamic range, high frame rates, and extremely low power consumption. As a result, event cameras are increasingly being used in various fields, such as object detection and tracking, autonomous driving, 3D reconstruction, visual odometry, and SLAM.

    We have created the first large-scale synthetic event camera voxel 3D reconstruction dataset, comprising over 39,739 simulated event camera 3D object scans from 13 different object categories. Each entry in the dataset contains a 0.5-second, 240fps high frame rate RGB video scan, simulated event camera data, the original 3D model, and a converted 32x32x32 voxel model.

    The 3D models used in this dataset are from ShapeNet (Link: https://shapenet.org/).

    Although this dataset only provides voxel representation for ground truth, obtaining other types of representation such as point cloud as ground truth will be trivial with the provided gltf 3D model. We hope that by publishing this dataset, we can accelerate the advancement of event-based 3D reconstruction.

    The original paper is available at:

    IEEE Xplore - https://ieeexplore.ieee.org/document/10169359

    ArXiv - https://arxiv.org/abs/2309.00385

    !!! Due to limited resources, we are unable to release the full dataset, which is around 1.2 TB in size. We would greatly appreciate any organizations willing to host the full dataset for us. (Contact: haodong.chen@sydney.edu.au)

    !!! The released SynthEVox3D-Tiny dataset, which is the dataset used in the original paper, is around 32 GB. We also provide scripts in the utils folder of the dataset to reproduce our results, making it possible for other researchers to recreate the entire dataset from scratch.

    Citation

    Plain text

    H. Chen, V. Chung, L. Tan and X. Chen, "Dense Voxel 3D Reconstruction Using a Monocular Event Camera," 2023 9th International Conference on Virtual Reality (ICVR), Xianyang, China, 2023, pp. 30-35, doi: 10.1109/ICVR57957.2023.10169359.
    

    BibTex

    @INPROCEEDINGS{10169359,
     author={Chen, Haodong and Chung, Vera and Tan, Li and Chen, Xiaoming},
     booktitle={2023 9th International Conference on Virtual Reality (ICVR)}, 
     title={Dense Voxel 3D Reconstruction Using a Monocular Event Camera}, 
     year={2023},
     volume={},
     number={},
     pages={30-35},
     doi={10.1109/ICVR57957.2023.10169359}}
    
  17. d

    10K+ Hours of Human Actions Video Data | AI Training Data | Annotated Video...

    • datarade.ai
    Updated Dec 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSeeds.AI (2025). 10K+ Hours of Human Actions Video Data | AI Training Data | Annotated Video for AI | Bounding Boxes, Action Labels & Scene Description | Global Covera [Dataset]. https://datarade.ai/data-products/10k-hours-of-human-actions-video-data-ai-training-data-a-data-seeds
    Explore at:
    .json, .xml, .csv, .txt, .mp4, .movAvailable download formats
    Dataset updated
    Dec 15, 2025
    Dataset provided by
    DataSeeds.AI
    Area covered
    Egypt, Antarctica, Lesotho, Bulgaria, Macao, Pitcairn, British Indian Ocean Territory, Madagascar, Guam, Mauritius
    Description

    This dataset contains over 10,000 hours of human actions video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.

    Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.

    1. Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)

    2. Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.

    3. High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.

    4. AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.

    5. Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use

    Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding

  18. h

    Video-Detailed-Caption

    • huggingface.co
    Updated Oct 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wenhao Chai (2024). Video-Detailed-Caption [Dataset]. https://huggingface.co/datasets/wchai/Video-Detailed-Caption
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 4, 2024
    Authors
    Wenhao Chai
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Video Detailed Caption Benchmark

      Resources
    

    Website arXiv: Paper GitHub: Code Huggingface: AuroraCap Model Huggingface: VDC Benchmark Huggingface: Trainset

      Features
    
    
    
    
    
    
    
    
    
      Benchmark Collection and Processing
    

    We building VDC upon Panda-70M, Ego4D, Mixkit, Pixabay, and Pexels. Structured detailed captions construction pipeline. We develop a structured detailed captions construction pipeline to generate extra detailed descriptions from various… See the full description on the dataset page: https://huggingface.co/datasets/wchai/Video-Detailed-Caption.

  19. d

    10K+ Hours of Object Manipulation Video Data | AI Training Data | Annotated...

    • datarade.ai
    Updated Dec 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataSeeds.AI (2025). 10K+ Hours of Object Manipulation Video Data | AI Training Data | Annotated Video for AI | Bounding Boxes, Action Labels & Scene Description | Global [Dataset]. https://datarade.ai/data-products/10k-hours-of-object-manipulation-video-data-ai-training-da-data-seeds
    Explore at:
    .json, .xml, .csv, .txt, .mp4, .movAvailable download formats
    Dataset updated
    Dec 15, 2025
    Dataset provided by
    DataSeeds.AI
    Area covered
    Monaco, United Kingdom, Estonia, Jordan, Tajikistan, Timor-Leste, Saint Kitts and Nevis, Côte d'Ivoire, Gabon, Turks and Caicos Islands
    Description

    This dataset contains over 10,000 hours of object manipulation video recordings captured worldwide. Designed for AI and machine-learning applications, it offers richly annotated, context-dense video data ideal for training vision-language models, action-recognition systems, and multimodal reasoning.

    Key Features 1. Comprehensive Video Annotation Layers Each video includes synchronized metadata across visual and audio channels, such as: Object annotations (bounding boxes, segmentation masks) Action labels and activity timelines Temporal event boundaries Transcripts for scenes containing speech Visual scene descriptions covering environment, objects, actions, and context Camera metadata (motion type, angle, field of view, lighting conditions) This enables training for activity detection, video captioning, tracking, VLM grounding, and multimodal understanding.

    1. Unique Sourcing Capabilities Videos are collected through controlled contribution pipelines designed to generate authentic, unscripted real-world footage. This provides: Natural human movement and behavior Diverse environments and camera devices Continuous flow of fresh recordings Ability to generate custom datasets (e.g., specific actions, locations, lighting, demographics, or motion patterns)

    2. Global Visual & Cultural Diversity Contributors from 100+ countries supply: Indoor and outdoor recordings Urban, rural, and specialized environments Varied cultural behaviors, activities, and settings Multiple languages and speaking styles where speech is present This ensures robust generalization for global deployment.

    3. High-Quality, Realistic Video Capture Data includes a wide range of visual conditions: 4K, HD, and consumer-grade recordings Static, handheld, and moving cameras Low-light, daylight, and variable lighting Clean vs. noisy audio channels Natural occlusions, motion blur, and complex backgrounds This diversity supports training models for real-world reliability and robustness.

    4. AI-Ready Dataset Architecture Optimized for modern ML workflows, enabling: Video classification & action recognition Video captioning & summarization Vision-language model (VLM) alignment Multimodal reasoning & grounding Safety, moderation, and risk detection Tracking, segmentation, and object detection Compatible with leading ML frameworks and training pipelines.

    5. Licensing & Compliance Fully compliant with global privacy standards Explicit contributor consent for video usage Documented rights and usage permissions Vetted for commercial and research use

    Use Cases Training video classification and action-recognition models Vision-language model pretraining Multimodal AI for enterprise and consumer applications Safety, moderation, and anomaly detection Video captioning, retrieval, and summarization Research in activity analysis, human behavior, and multimodal grounding

  20. b

    Densely Annotated Video Driving Data Set

    • open.bydata.de
    zip
    Updated May 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universitätsbibliothek der Technischen Universität München (2021). Densely Annotated Video Driving Data Set [Dataset]. https://open.bydata.de/datasets/https-mediatum-ub-tum-de-1596437-dataset
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 27, 2021
    Dataset authored and provided by
    Universitätsbibliothek der Technischen Universität München
    License

    http://dcat-ap.de/def/licenses/cc-byhttp://dcat-ap.de/def/licenses/cc-by

    Description

    This data set consists of 28 video sequences of driving recorded in the CARLA simulator, resulting in a total of 10767 frames. For each frame, pixel-wise semantic labels are provided. The scenes are recorded in dynamic weather and traffic conditions.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Haichao Zhang (2025). DenseVideoEvaluation [Dataset]. http://doi.org/10.57967/hf/6523

DenseVideoEvaluation

haichaozhang/DenseVideoEvaluation

DIVE Benchmark: Dense Information Video Evaluation for Dense Video Understanding

Explore at:
Dataset updated
Sep 17, 2025
Authors
Haichao Zhang
License

https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/

Description

🤿 DENSE VIDEO UNDERSTANDING WITH GATED RESIDUAL TOKENIZATION

  Dense Information Video Evaluation (DIVE) Benchmark

The first-ever benchmark dedicated to the task of Dense Video Understanding, focusing on QA-driven high-frame-rate video comprehension, where the answer-relevant information is present in nearly every frame.

  👥 Authors

Haichao Zhang1 · Wenhao Chai2 · Shwai He3 · Ang Li3 · Yun Fu1

1… See the full description on the dataset page: https://huggingface.co/datasets/haichaozhang/DenseVideoEvaluation.
Search
Clear search
Close search
Google apps
Main menu