Dataset Card for LLaVA-Video-178K
Uses
This dataset is used for the training of the LLaVA-Video model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the OpenAI Usage Policy.
Data Sources
For the training of LLaVA-Video, we utilized video-language data from five primary sources:
LLaVA-Video-178K: This dataset includes 178,510 caption entries, 960,792 open-ended… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card LLaVA-Video-small-swift
Small subset of LLaVA-Video-178K for educational purposes to learn how to fine-tune video models.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
farewellthree/llava-video-json dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TinyLLaVA-Video
This dataset combines data from multiple sources for pre-training and fine-tuning. Pretrain Data: Four subsets of LLaVA-Video-178K (0_30_s_academic_v0_1, 30_60_s_academic_v0_1, 0_30_s_youtube_v0_1, 30_60_s_youtube_v0_1), supplemented with filtered Video-LLaVA data (https://huggingface.co/datasets/LanguageBind/Video-LLaVA) and data from Valley (https://github.com/RupertLuo/Valley). The video data can be downloaded from the linked datasets, and cleaned annotations are provided… See the full description on the dataset page: https://huggingface.co/datasets/Zhang199/TinyLLaVA-Video-v1-training-data.
weili-0234/llava-video-178k-frames dataset hosted on Hugging Face and contributed by the HF Datasets community
Xiaodong/LLaVA-Video-2_3_m_youtube_mc-qwen_filter_1 dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TinyLLaVA-Video-R1
We select multiple choice questions from the NextQA subset of LLaVA-Video-178K as training data. To maintain manageable training time with limited computational resources, we only choose the subset of data with a duration of 0 to 30 seconds, which contains 5,496 samples. In addition, we manually annotate 16 samples for cold-starting and provide the annotations.
Organize Data
Organize the files and annotation files as follows in path/to/your/dataset: dataset ├──… See the full description on the dataset page: https://huggingface.co/datasets/Zhang199/TinyLLaVA-Video-R1-training-data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Active lava lakes represent a variety of open-vent volcanism in which a sizable body of lava accumulates at the top of the magma column, constrained by the vent and/or crater geometry. The longevity of lava lakes reflects a balancing of cooling and outgassing occurring at the surface and input of hot and gas-rich magma from below. Due to their longevity and relative accessibility, lava lakes provide a natural laboratory for studying fundamental volcanic processes such as degassing, convection and cooling. This article examines all seven lakes that existed at the time of writing in 2018, located in the Pacific, Antarctica, Africa, and South and Central America. These lakes span all tectonic environments, and a range of magma compositions. We focus on analysis of the lake surface motion using image velocimetry, which reveals both similarities and contrasts in outgassing and lake dynamics when comparing the different lakes. We identify two categories of lake behavior: Organized (Erta'Ale, Nyiragongo, Kīlauea after 2011, and Erebus) and Chaotic (Villarrica, Masaya, Marum). This division does not map directly to lake size, viscosity, gas emission rate, or temperature. Instead, when examined together, we find that the lakes follow a linear relationship between average surface speed and the ratio of total gas flux to lake surface area. This relationship points to the combined importance of both flux and lake size in addition to the total volume of gas emission, and suggests that a shared deep mechanism controls the supply of heat and gas to all lakes. On the other hand, the differences between Chaotic and Organized lakes highlight the important role of the geometry of the conduit-lake transition, which superimposes a shallow signal on that of the deep circulation. The spatial patterns of surface motion we document suggest that the release of gas bubbles at Chaotic lakes is more efficient (i.e., bubbles are less likely to be retained and recycled) compared with Organized lakes. In addition, the data presented here indicate that the solidified crust of Organized lakes plays a role in regulating convection and outgassing in lava lakes.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains the data presented in Video-R1: Reinforcing Video Reasoning in MLLMs. Code: https://github.com/tulerfeng/Video-R1 Video data folder: CLEVRER, LLaVA-Video-178K, NeXT-QA, PerceptionTest, STAR Image data folder: Chart, General, Knowledge, Math, OCR, Spatial Video-R1-COT-165k.json is for SFT cold start, and Video-R1-260k.json is for RL training. Data Format in Video-R1-COT-165k: { "problem_id": 2, "problem": "What appears on the screen in Russian during the… See the full description on the dataset page: https://huggingface.co/datasets/Video-R1/Video-R1-data.
VideoEspresso
This dataset is the multi-image version.
Leaderboard
Model Params Frames Overall Narrative Analysis Event Dynamic Preparation Steps Causal Analysis Theme Analysis Contextual Analysis Influence Analysis Role Analysis Interaction Analysis Behavior Analysis Emotion Analysis Cooking Process Traffic Analysis Situation Analysis
LLaVA-Video 72B 64 66.3% 68.4% 66.2% 74.5% 62.7% 62.3% 71.6% 62.5% 63.5% 67.7% 63.2% 60.0% 75.5% 76.7% 74.0%
LLaVA-OneVision… See the full description on the dataset page: https://huggingface.co/datasets/hshjerry0315/VideoEspresso_train_multi_image.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
V-NIAH-D Benchmark
A Visual Needle-In-A-Haystack Benchmark with Periodic Distractors. It was presented in VideoRoPE: What Makes for Good Video Rotary Position Embedding?. One can use it by following steps similar to V-NIAH.
VideoRoPE Training Data
To facilitate the reproduction of our experimental results, we have also uploaded the data used by VideoRoPE. We use a subset of the LLaVA-Video-178K dataset to train VideoRoPE. The LLaVA-Video-178K dataset consists of 178K… See the full description on the dataset page: https://huggingface.co/datasets/Wiselnn/VideoRoPE.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
🎞MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions
MMTrail is a large-scale multi-modality video-language dataset with over 20M trailer clips, featuring high-quality multimodal captions that integrate context, visual frames, and background music, aiming to enhance cross-modality studies and fine-grained multimodal-language model training. In short, we provided 2M+ LLaVA Video captions, 2M+ Music captions, and 60M+ Coca frame captions for 27.1khrs of… See the full description on the dataset page: https://huggingface.co/datasets/litwell/MMTrail-20M.
Xiaodong/LaVA-Video-2_3_m_youtube_mc-4o dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
🧬 ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
ViDRiP-LLaVA is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and extends it to the medical domain with domain-specific datasets and fine-tuned models. 🧠 Introducing our ViDRiP-LLaVA: the first multimodal model for diagnostic reasoning in pathology through video-based instruction.… See the full description on the dataset page: https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Test.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
Youtube clips video data processed for conversational llava model. This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Description
Video data are segmented into intervals of 30 seconds. Each interval is converted into a collage of 3 x 3 frames uniformaly selected. Dataset is generated in two-folds:
Basic Llava model tasked with describing the 3 x 3 collage. Llama 3 prompted… See the full description on the dataset page: https://huggingface.co/datasets/Ftest/VTdataset.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Causal2Needles
Overview
Paper Code Causal2Needles is a benchmark dataset and evaluation toolkit designed to assess the capabilities of vision-language models (e.g., Gemini-1.5-Pro and LLaVA-Next-Video-7B) in long-video understanding and causal reasoning.This repository provides:
Dataset (Videos, Questions, Narration...) Instructions for downloading and setting up the dataset Example scripts for testing models Automated evaluation of model performance across three types… See the full description on the dataset page: https://huggingface.co/datasets/causal2needles/Causal2Needles.
Attribution-NonCommercial-NoDerivs 3.0 (CC BY-NC-ND 3.0)https://creativecommons.org/licenses/by-nc-nd/3.0/
License information was derived automatically
⚠️ Access RequiredTo access the files in this dataset, you must agree to the cc-by-nc-nd-3.0 license terms.This dataset is for academic research use only and not intended for commercial or clinical applications.
🧬 ViDRiP-LLaVA: A Dataset and Benchmark for Diagnostic Reasoning from Pathology Videos
ViDRiP-LLaVA is a vision-language framework designed for instruction-based diagnostic reasoning using both image patches and video clips from pathology slides. It builds on LLaVA and… See the full description on the dataset page: https://huggingface.co/datasets/trinhvg/ViDRiP_Instruct_Train.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
M4-IT
This dataset, M4-IT, is a synthetic instruction finetuning dataset used in the development of the M4 framework, designed to enhance real-time interactive reasoning in multi-modal language models. The M4 framework is evaluated on OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts.
Data Description
Building on the LLaVA-NeXT-Data, we crafted a small video-free synthetic instruction finetuning dataset, M4-IT, with the assistance… See the full description on the dataset page: https://huggingface.co/datasets/ColorfulAI/M4-IT.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset Card for LLaVA-Video-178K
Uses
This dataset is used for the training of the LLaVA-Video model. We only allow the use of this dataset for academic research and education purpose. For OpenAI GPT-4 generated data, we recommend the users to check the OpenAI Usage Policy.
Data Sources
For the training of LLaVA-Video, we utilized video-language data from five primary sources:
LLaVA-Video-178K: This dataset includes 178,510 caption entries, 960,792 open-ended… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K.