Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
GitHub Repository: https://github.com/iSEE-Laboratory/Long_RVOS Project Page: https://isee-laboratory.github.io/Long-RVOS/ Paper: arXiv:2505.12702
Dataset Description
Dataset Summary
Long-RVOS is the first large-scale long-term referring video object segmentation benchmark, containing 2,000+ videos with an average duration exceeding 60 seconds. The dataset addresses… See the full description on the dataset page: https://huggingface.co/datasets/iSEE-Laboratory/Long-RVOS.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MOVE: Motion-Guided Few-Shot Video Object Segmentation
🏠 Homepage | 📄 Paper | 🔗 GitHub
Abstract
This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in… See the full description on the dataset page: https://huggingface.co/datasets/FudanCVL/MOVE.
Facebook
TwitterThe dataset (Lego_Tracking folder) has been created manually by recording 12 videos by smartphone. Of these, 10 were designated for training and 2 for testing. The videos showcase conveyor belts transporting LEGO bricks, captured from various perspectives (top, front, and diagonal) to provide diverse viewpoints. The videos have been recorded in the AI laboratory of the Eötvös Loránd University (https://github.com/BahruzHuseynov/Object-Tracking-AI_Lab) and the dataset has been used to make a research about detection-segmentation-tracking pipeline to complete the AI laboratory work. The dataset includes videos of differing complexities, classified as "Overlapping," "Normal," or "Simple," with varying durations ranging from short to long shots. Additionally, the annotation of LEGO bricks was performed frame-by-frame using the RoboFlow web application (https://roboflow.com/).
In addition, you can go to the dataset which has been prepared by applying systematic sampling on training videos to train and validate YOLOv8 and RT-DETR models from ultralytics: https://www.kaggle.com/datasets/hbahruz/multiple-lego/data
Another dataset prepared can be used for the semantic segmentation: https://www.kaggle.com/datasets/hbahruz/lego-semantic-segmentation/
| Test | View | Complexity | Only Lego | Frames per second | Approxiomate duration (seconds) | Num. frames |
|---|---|---|---|---|---|---|
| Video 1 | Top | Normal | + | 20 | 68 | 1401 |
| Video 2 | Diagonal | Normal | + | 25 | 57 | 1444 |
| Training | View | Complexity | Only Lego | Frames per second | Approxiomate duration (seconds) | Num. frames |
| Video 1 | Top | Overlapping | + | 16 | 19 | 300 |
| Video 2 | Front | Overlapping | + | 13 | 16 | 196 |
| Video 3 | Diagonal | Normal | + | 20 | 56 | 1136 |
| Video 4 | Top | Overlapping | + | 20 | 42 | 839 |
| Video 5 | Diagonal | Overlapping | + | 21 | 14 | 839 |
| Video 6 | Top | Normal | - | 20 | 50 | 1000 |
| Video 7 | Top | Simple | + | 15 | 20 | 303 |
| Video 8 | Diagonal | Normal | + | 13 | 13 | 277 |
| Video 9 | Top | Normal | + | 19 | 28 | 537 |
| Video 10 | Front | Normal | - | 20 | 58 | 1162 |
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes
🔥 Evaluation Server | 🏠 Homepage | 📄 Paper | 🔗 GitHub
Download
We recommend using huggingface-cli to download: pip install -U "huggingface_hub[cli]" huggingface-cli download FudanCVL/MOSEv2 --repo-type dataset --local-dir ./MOSEv2 --local-dir-use-symlinks False --max-workers 16
Dataset Summary
MOSEv2 is a comprehensive video object segmentation dataset designed to advance… See the full description on the dataset page: https://huggingface.co/datasets/FudanCVL/MOSEv2.
Facebook
TwitterThis datasets contains mmtracking package and its dependencies for an offline installation. Use mmdetection dataset to install mmdetection (it's also required by mmtracking).
OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.
https://github.com/open-mmlab/mmtracking
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Experimental data for the paper "Hierarchical Deep Learning Framework for Automated Marine Vegetation and Fauna Analysis Using ROV Video Data."This dataset supports the study "Hierarchical Deep Learning Framework for Automated Marine Vegetation and Fauna Analysis Using ROV Video Data" by providing resources essential for reproducing and validating the research findings.Dataset Contents and Structure:Hierarchical Model Weights: - .pth files containing trained weights for all alpha regularization values used in hierarchical classification models.MaskRCNN-Segmented Objects: - .jpg files representing segmented objects detected by the MaskRCNN model. - Accompanied by maskrcnn-segmented-objects-dataset.parquet, which includes metadata and classifications: - Columns:masked_image: Path to the segmented image file.confidence: Confidence score for the prediction.predicted_species: Predicted species label.species: True species label.MaskRCNN Weights: - Trained MaskRCNN model weights, including hierarchical CNN models integrated with MaskRCNN in the processing pipeline.Pre-Trained Models:.pt files for all object detectors trained on the Esefjorden Marine Vegetation Segmentation Dataset (EMVSD) in YOLO txt format.Segmented Object Outputs: - Segmentation outputs and datasets for the following models: - RT-DETR: - Segmented objects: rtdetr-segmented-objects/ - Dataset: rtdetr-segmented-objects-dataset.parquet - YOLO-SAG: - Segmented objects: yolosag-segmented-objects/ - Dataset: yolosag-segmented-objects-dataset.parquet - YOLOv11: - Segmented objects: yolov11-segmented-objects/ - Dataset: yolov11-segmented-objects-dataset.parquet - YOLOv8: - Segmented objects: yolov8-segmented-objects/ - Dataset: yolov8-segmented-objects-dataset.parquet - YOLOv9: - Segmented objects: yolov9-segmented-objects/ - Dataset: yolov9-segmented-objects-dataset.parquetUsage Instructions:1. Download and extract the dataset.2. Utilize the Python scripts provided in the associated GitHub repository for evaluation and inference: https://github.com/Ci2Lab/FjordVisionReproducibility:The dataset includes pre-trained weights, segmentation outputs, and experimental results to facilitate reproducibility. The .parquet files and segmented object directories follow a standardized format to ensure consistency.Licensing:This dataset is released under the CC-BY 4.0 license, permitting reuse with proper attribution.Related Materials:- GitHub Repository: https://github.com/Ci2Lab/FjordVision
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction
[📂 GitHub] [📦 Model] [🌐 Homepage] [📄 Paper]
Highlights
🔥We introduce Segment Concept (SeC), a concept-driven segmentation framework for video object segmentation that integrates Large Vision-Language Models (LVLMs) for robust, object-centric representations. 🔥SeC dynamically balances semantic reasoning with feature matching, adaptively adjusting computational efforts based on… See the full description on the dataset page: https://huggingface.co/datasets/OpenIXCLab/SeCVOS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
https://i.imgur.com/eEWi4PT.png" alt="EgoHands Dataset">
The EgoHands dataset is a collection of 4800 annotated images of human hands from a first-person view originally collected and labeled by Sven Bambach, Stefan Lee, David Crandall, and Chen Yu of Indiana University.
The dataset was captured via frames extracted from video recorded through head-mounted cameras on a Google Glass headset while peforming four activities: building a puzzle, playing chess, playing Jenga, and playing cards. There are 100 labeled frames for each of 48 video clips.
The original EgoHands dataset was labeled with polygons for segmentation and released in a Matlab binary format. We converted it to an object detection dataset using a modified version of this script from @molyswu and have archived it in many popular formats for use with your computer vision models.
After converting to bounding boxes for object detection, we noticed that there were several dozen unlabeled hands. We added these by hand and improved several hundred of the other labels that did not fully encompass the hands (usually to include omitted fingertips, knuckles, or thumbs). In total, 344 images' annotations were edited manually.
We chose a new random train/test split of 80% training, 10% validation, and 10% testing. Notably, this is not the same split as in the original EgoHands paper.
There are two versions of the converted dataset available:
* specific is labeled with four classes: myleft, myright, yourleft, yourright representing which hand of which person (the viewer or the opponent across the table) is contained in the bounding box.
* generic contains the same boxes but with a single hand class.
The authors have graciously allowed Roboflow to re-host this derivative dataset. It is released under a Creative Commons by Attribution 4.0 license. You may use it for academic or commercial purposes but must cite the original paper.
Please use the following Bibtext:
@inproceedings{egohands2015iccv,
title = {Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions},
author = {Sven Bambach and Stefan Lee and David Crandall and Chen Yu},
booktitle = {IEEE International Conference on Computer Vision (ICCV)},
year = {2015}
}
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
[CVPR 2025] M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation
If you like our project, please give us a star ⭐ on GitHub for the latest update.
💡 Description
Venue: CVPR2025 Repository: 🛠️Tool, 🏠Page Paper: arxiv.org/html/2412.13803v2 Point of Contact: Jiaxin Li , Zixuan Chen
📁 Structure
This dataset contains annotated videos and images for object segmentation tasks with phase transition information. The directory… See the full description on the dataset page: https://huggingface.co/datasets/Lijiaxin0111/M3_VOS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains a diverse set of features extracted from the V3C1+V3C2 dataset, sourced from the Vimeo Creative Commons Collection. These features were utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] during the latest editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).
The original V3C1+V3C2 dataset, provided by NIST, can be downloaded using the instructions provided at https://videobrowsershowdown.org/about-vbs/existing-data-and-tools/.
It comprises 7,235 video files, amounting for 2,300h of video content and encompassing 2,508,113 predefined video segments.
We subdivided the predefined video segments longer than 10 seconds into multiple segments, with each segment spanning no longer than 16 seconds. As a result, we obtained a total of 2,648,219 segments. For each segment, we extracted one frame, specifically the middle one, and computed several features, which are described in detail below.
This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:
@inproceedings{amato2023visione,
title={VISIONE at Video Browser Showdown 2023},
author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio},
booktitle={International Conference on Multimedia Modeling},
pages={615--621},
year={2023},
organization={Springer}
}
This repository comprises the following files:
*Please be sure to use the v2 version of this repository, since v1 feature files may contain inconsistencies that have now been corrected
*Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:
†Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the V3C1+V3C2 dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used. Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.
We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.
References:
[Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.
[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.
[Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.
[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).
[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.
[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.
[Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).
[Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
[Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.
[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
CamSeq01 is a groundtruth dataset that can be freely used for research work in object recognition in video.
This dataset contains 101 960x720 pixel image pairs. Every mask is designated by an "_L" addition to the file name. All images (original and groundtruth) are in uncompressed 24-bit color PNG format.
Julien Fauqueur, Gabriel Brostow, Roberto Cipolla, Assisted Video Object Labeling By Joint Tracking of Regions and Keypoints, IEEE International Conference on Computer Vision (ICCV'2007) Interactive Computer Vision Workshop. Rio de Janeiro, Brazil, October 2007
This work has been carried out with the support of Toyota Motor Europe.
The original dataset can be found here: http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamSeq01
http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamSeq01
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation
GitHub Repository: https://github.com/iSEE-Laboratory/Long_RVOS Project Page: https://isee-laboratory.github.io/Long-RVOS/ Paper: arXiv:2505.12702
Dataset Description
Dataset Summary
Long-RVOS is the first large-scale long-term referring video object segmentation benchmark, containing 2,000+ videos with an average duration exceeding 60 seconds. The dataset addresses… See the full description on the dataset page: https://huggingface.co/datasets/iSEE-Laboratory/Long-RVOS.