11 datasets found

h
Long-RVOS
huggingface.co
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
iSEE-Laboratory (2025). Long-RVOS [Dataset]. https://huggingface.co/datasets/iSEE-Laboratory/Long-RVOS
Explore at:
Dataset updated
Nov 2, 2025
Dataset authored and provided by
iSEE-Laboratory
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

GitHub Repository: https://github.com/iSEE-Laboratory/Long_RVOS Project Page: https://isee-laboratory.github.io/Long-RVOS/ Paper: arXiv:2505.12702

Dataset Description Dataset Summary

Long-RVOS is the first large-scale long-term referring video object segmentation benchmark, containing 2,000+ videos with an average duration exceeding 60 seconds. The dataset addresses… See the full description on the dataset page: https://huggingface.co/datasets/iSEE-Laboratory/Long-RVOS.
h
MOVE
huggingface.co
Updated Sep 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FudanCVL (2025). MOVE [Dataset]. https://huggingface.co/datasets/FudanCVL/MOVE
Explore at:
Dataset updated
Sep 28, 2025
Dataset authored and provided by
FudanCVL
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
MOVE: Motion-Guided Few-Shot Video Object Segmentation

🏠 Homepage | 📄 Paper | 🔗 GitHub

Abstract

This work addresses motion-guided few-shot video object segmentation (FSVOS), which aims to segment dynamic objects in videos based on a few annotated examples with the same motion patterns. Existing FSVOS datasets and methods typically focus on object categories, which are static attributes that ignore the rich temporal dynamics in videos, limiting their application in… See the full description on the dataset page: https://huggingface.co/datasets/FudanCVL/MOVE.

Multiple Lego Tracking Dataset

kaggle.com

zip

Updated Dec 27, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Bahruz Huseynov (2024). Multiple Lego Tracking Dataset [Dataset]. https://www.kaggle.com/datasets/hbahruz/multiple-lego-tracking-dataset

Explore at:

zip(1816930688 bytes)Available download formats

Dataset updated

Dec 27, 2024

Authors

Bahruz Huseynov

Description

The dataset (Lego_Tracking folder) has been created manually by recording 12 videos by smartphone. Of these, 10 were designated for training and 2 for testing. The videos showcase conveyor belts transporting LEGO bricks, captured from various perspectives (top, front, and diagonal) to provide diverse viewpoints. The videos have been recorded in the AI laboratory of the Eötvös Loránd University (https://github.com/BahruzHuseynov/Object-Tracking-AI_Lab) and the dataset has been used to make a research about detection-segmentation-tracking pipeline to complete the AI laboratory work. The dataset includes videos of differing complexities, classified as "Overlapping," "Normal," or "Simple," with varying durations ranging from short to long shots. Additionally, the annotation of LEGO bricks was performed frame-by-frame using the RoboFlow web application (https://roboflow.com/).

In addition, you can go to the dataset which has been prepared by applying systematic sampling on training videos to train and validate YOLOv8 and RT-DETR models from ultralytics: https://www.kaggle.com/datasets/hbahruz/multiple-lego/data

Another dataset prepared can be used for the semantic segmentation: https://www.kaggle.com/datasets/hbahruz/lego-semantic-segmentation/

Test	View	Complexity	Only Lego	Frames per second	Approxiomate duration (seconds)	Num. frames
Video 1	Top	Normal	+	20	68	1401
Video 2	Diagonal	Normal	+	25	57	1444
Training	View	Complexity	Only Lego	Frames per second	Approxiomate duration (seconds)	Num. frames
Video 1	Top	Overlapping	+	16	19	300
Video 2	Front	Overlapping	+	13	16	196
Video 3	Diagonal	Normal	+	20	56	1136
Video 4	Top	Overlapping	+	20	42	839
Video 5	Diagonal	Overlapping	+	21	14	839
Video 6	Top	Normal	-	20	50	1000
Video 7	Top	Simple	+	15	20	303
Video 8	Diagonal	Normal	+	13	13	277
Video 9	Top	Normal	+	19	28	537
Video 10	Front	Normal	-	20	58	1162

h
MOSEv2
huggingface.co
Updated Sep 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FudanCVL (2025). MOSEv2 [Dataset]. https://huggingface.co/datasets/FudanCVL/MOSEv2
Explore at:
Dataset updated
Sep 28, 2025
Dataset authored and provided by
FudanCVL
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
MOSEv2: A More Challenging Dataset for Video Object Segmentation in Complex Scenes

🔥 Evaluation Server | 🏠 Homepage | 📄 Paper | 🔗 GitHub

Download

We recommend using huggingface-cli to download: pip install -U "huggingface_hub[cli]" huggingface-cli download FudanCVL/MOSEv2 --repo-type dataset --local-dir ./MOSEv2 --local-dir-use-symlinks False --max-workers 16

Dataset Summary

MOSEv2 is a comprehensive video object segmentation dataset designed to advance… See the full description on the dataset page: https://huggingface.co/datasets/FudanCVL/MOSEv2.
mmtracking
kaggle.com
zip
Updated Feb 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Araik Tamazian (2022). mmtracking [Dataset]. https://www.kaggle.com/atamazian/mmtracking
Explore at:
zip(3454866 bytes)Available download formats
Dataset updated
Feb 8, 2022
Authors
Araik Tamazian
Description
Description

This datasets contains mmtracking package and its dependencies for an offline installation. Use mmdetection dataset to install mmdetection (it's also required by mmtracking).

What's MMTracking?

OpenMMLab Video Perception Toolbox. It supports Video Object Detection (VID), Multiple Object Tracking (MOT), Single Object Tracking (SOT), Video Instance Segmentation (VIS) with a unified framework.

Source

https://github.com/open-mmlab/mmtracking

Documentation

https://mmtracking.readthedocs.io/
f
Data from: Hierarchical Deep Learning Framework for Automated Marine...
figshare.com
bin
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bjørn Christian Weinbach (2024). Hierarchical Deep Learning Framework for Automated Marine Vegetation and Fauna Analysis Using ROV Video Data [Dataset]. http://doi.org/10.6084/m9.figshare.25688718.v4
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25688718.v4
Dataset updated
Dec 9, 2024
Dataset provided by
figshare
Authors
Bjørn Christian Weinbach
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Experimental data for the paper "Hierarchical Deep Learning Framework for Automated Marine Vegetation and Fauna Analysis Using ROV Video Data."This dataset supports the study "Hierarchical Deep Learning Framework for Automated Marine Vegetation and Fauna Analysis Using ROV Video Data" by providing resources essential for reproducing and validating the research findings.Dataset Contents and Structure:Hierarchical Model Weights: - .pth files containing trained weights for all alpha regularization values used in hierarchical classification models.MaskRCNN-Segmented Objects: - .jpg files representing segmented objects detected by the MaskRCNN model. - Accompanied by maskrcnn-segmented-objects-dataset.parquet, which includes metadata and classifications: - Columns:masked_image: Path to the segmented image file.confidence: Confidence score for the prediction.predicted_species: Predicted species label.species: True species label.MaskRCNN Weights: - Trained MaskRCNN model weights, including hierarchical CNN models integrated with MaskRCNN in the processing pipeline.Pre-Trained Models:.pt files for all object detectors trained on the Esefjorden Marine Vegetation Segmentation Dataset (EMVSD) in YOLO txt format.Segmented Object Outputs: - Segmentation outputs and datasets for the following models: - RT-DETR: - Segmented objects: rtdetr-segmented-objects/ - Dataset: rtdetr-segmented-objects-dataset.parquet - YOLO-SAG: - Segmented objects: yolosag-segmented-objects/ - Dataset: yolosag-segmented-objects-dataset.parquet - YOLOv11: - Segmented objects: yolov11-segmented-objects/ - Dataset: yolov11-segmented-objects-dataset.parquet - YOLOv8: - Segmented objects: yolov8-segmented-objects/ - Dataset: yolov8-segmented-objects-dataset.parquet - YOLOv9: - Segmented objects: yolov9-segmented-objects/ - Dataset: yolov9-segmented-objects-dataset.parquetUsage Instructions:1. Download and extract the dataset.2. Utilize the Python scripts provided in the associated GitHub repository for evaluation and inference: https://github.com/Ci2Lab/FjordVisionReproducibility:The dataset includes pre-trained weights, segmentation outputs, and experimental results to facilitate reproducibility. The .parquet files and segmented object directories follow a standardized format to ensure consistency.Licensing:This dataset is released under the CC-BY 4.0 license, permitting reuse with proper attribution.Related Materials:- GitHub Repository: https://github.com/Ci2Lab/FjordVision
h
SeCVOS
huggingface.co
Updated Jul 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IXCLab@Shanghai AI Lab (2025). SeCVOS [Dataset]. https://huggingface.co/datasets/OpenIXCLab/SeCVOS
Explore at:
Dataset updated
Jul 19, 2025
Dataset authored and provided by
IXCLab@Shanghai AI Lab
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

[📂 GitHub] [📦 Model] [🌐 Homepage] [📄 Paper]

Highlights

🔥We introduce Segment Concept (SeC), a concept-driven segmentation framework for video object segmentation that integrates Large Vision-Language Models (LVLMs) for robust, object-centric representations. 🔥SeC dynamically balances semantic reasoning with feature matching, adaptively adjusting computational efforts based on… See the full description on the dataset page: https://huggingface.co/datasets/OpenIXCLab/SeCVOS.
R
EgoHands Object Detection Dataset - specific
public.roboflow.com
zip
Updated Apr 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
IU Computer Vision Lab (2022). EgoHands Object Detection Dataset - specific [Dataset]. https://public.roboflow.com/object-detection/hands/1
Explore at:
zipAvailable download formats
Dataset updated
Apr 22, 2022
Dataset authored and provided by
IU Computer Vision Lab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of hands
Description
https://i.imgur.com/eEWi4PT.png" alt="EgoHands Dataset">

About this dataset

The EgoHands dataset is a collection of 4800 annotated images of human hands from a first-person view originally collected and labeled by Sven Bambach, Stefan Lee, David Crandall, and Chen Yu of Indiana University.

The dataset was captured via frames extracted from video recorded through head-mounted cameras on a Google Glass headset while peforming four activities: building a puzzle, playing chess, playing Jenga, and playing cards. There are 100 labeled frames for each of 48 video clips.

Our modifications

The original EgoHands dataset was labeled with polygons for segmentation and released in a Matlab binary format. We converted it to an object detection dataset using a modified version of this script from @molyswu and have archived it in many popular formats for use with your computer vision models.

After converting to bounding boxes for object detection, we noticed that there were several dozen unlabeled hands. We added these by hand and improved several hundred of the other labels that did not fully encompass the hands (usually to include omitted fingertips, knuckles, or thumbs). In total, 344 images' annotations were edited manually.

We chose a new random train/test split of 80% training, 10% validation, and 10% testing. Notably, this is not the same split as in the original EgoHands paper.

There are two versions of the converted dataset available: * specific is labeled with four classes: myleft, myright, yourleft, yourright representing which hand of which person (the viewer or the opponent across the table) is contained in the bounding box. * generic contains the same boxes but with a single hand class.

Using this dataset

The authors have graciously allowed Roboflow to re-host this derivative dataset. It is released under a Creative Commons by Attribution 4.0 license. You may use it for academic or commercial purposes but must cite the original paper.

Please use the following Bibtext: @inproceedings{egohands2015iccv, title = {Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions}, author = {Sven Bambach and Stefan Lee and David Crandall and Chen Yu}, booktitle = {IEEE International Conference on Computer Vision (ICCV)}, year = {2015} }
h
M3_VOS
huggingface.co
Updated Aug 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SII-Jiaxin Li (2025). M3_VOS [Dataset]. https://huggingface.co/datasets/Lijiaxin0111/M3_VOS
Explore at:
Dataset updated
Aug 8, 2025
Authors
SII-Jiaxin Li
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
[CVPR 2025] M3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation

If you like our project, please give us a star ⭐ on GitHub for the latest update.

💡 Description

Venue: CVPR2025 Repository: 🛠️Tool, 🏠Page Paper: arxiv.org/html/2412.13803v2 Point of Contact: Jiaxin Li , Zixuan Chen

📁 Structure

This dataset contains annotated videos and images for object segmentation tasks with phase transition information. The directory… See the full description on the dataset page: https://huggingface.co/datasets/Lijiaxin0111/M3_VOS.
VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...
zenodo.org
data.niaid.nih.gov
+1more
application/gzip
Updated Feb 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Amato; Giuseppe Amato; Paolo Bolettieri; Paolo Bolettieri; Fabio Carrara; Fabio Carrara; Fabrizio Falchi; Fabrizio Falchi; Claudio Gennaro; Claudio Gennaro; Nicola Messina; Nicola Messina; Lucia Vadicamo; Lucia Vadicamo; Claudio Vairo; Claudio Vairo (2024). VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from V3C1+V3C2 Dataset [Dataset]. http://doi.org/10.5281/zenodo.8188570
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8188570
Dataset updated
Feb 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Giuseppe Amato; Giuseppe Amato; Paolo Bolettieri; Paolo Bolettieri; Fabio Carrara; Fabio Carrara; Fabrizio Falchi; Fabrizio Falchi; Claudio Gennaro; Claudio Gennaro; Nicola Messina; Nicola Messina; Lucia Vadicamo; Lucia Vadicamo; Claudio Vairo; Claudio Vairo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a diverse set of features extracted from the V3C1+V3C2 dataset, sourced from the Vimeo Creative Commons Collection. These features were utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] during the latest editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).

The original V3C1+V3C2 dataset, provided by NIST, can be downloaded using the instructions provided at https://videobrowsershowdown.org/about-vbs/existing-data-and-tools/.

It comprises 7,235 video files, amounting for 2,300h of video content and encompassing 2,508,113 predefined video segments.

We subdivided the predefined video segments longer than 10 seconds into multiple segments, with each segment spanning no longer than 16 seconds. As a result, we obtained a total of 2,648,219 segments. For each segment, we extracted one frame, specifically the middle one, and computed several features, which are described in detail below.

This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:

@inproceedings{amato2023visione, title={VISIONE at Video Browser Showdown 2023}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio}, booktitle={International Conference on Multimedia Modeling}, pages={615--621}, year={2023}, organization={Springer} }

This repository comprises the following files:

msb.tar.gz contains tab-separated files (.tsv) for each video. Each tsv file reports, for each video segment, the timestamp and frame number marking the start/end of the video segment, along with the timestamp of the extracted middle frame and the associated identifier ("id_visione").

extract-keyframes-from-msb.tar.gz contains a Python script designed to extract the middle frame of each video segment from the MSB files. To run the script successfully, please ensure that you have the original V3C videos available.

features-aladin.tar.gz^† contains ALADIN [Messina N. et al. 2022] features extracted for all the segment's middle frames.

features-clip-laion.tar.gz^† contains CLIP ViT-H/14 - LAION-2B [Schuhmann et al. 2022] features extracted for all the segment's middle frames.

features-clip-openai.tar.gz^† contains CLIP ViT-L/14 [Radford et al. 2021] features extracted for all the segment's middle frames.

features-clip2video.tar.gz^† contains CLIP2Video [Fang H. et al. 2021] extracted for all the video segments. In particular 1) we concatenate consecutive short segments so to create segments at least 3 seconds long; 2) we downsample the obtained segments to 2.5 fps; 3) we feed the network with the first min(36, n) frames, where n is the number of frames of the segment. Notice that the minimum processed length consists of 7 frames, given that the segment is no shorter than 3s.

objects-frcnn-oiv4.tar.gz^* contains the objects detected using Faster R-CNN+Inception ResNet (trained on the Open Images V4 [Kuznetsova et al. 2020]).

objects-mrcnn-lvis.tar.gz^* contains the objects detected using Mask R-CNN [He et al. 2017] (trained on LVIS).

objects-vfnet64-coco.tar.gz^* contains the objects detected using VfNet [Zhang et al. 2021] (trained on COCO dataset).

*Please be sure to use the v2 version of this repository, since v1 feature files may contain inconsistencies that have now been corrected

*Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:

"object_class_names": vector with the class name of each detected object.

"object_scores": scores corresponding to each detected object.

"object_boxes_yxyx": bounding boxes of the detected objects in the format (ymin, xmin, ymax, xmax).

^†Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the V3C1+V3C2 dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used. Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.

We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.

References:

[Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.

[Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).

[Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).
CamSeq 2007 (Semantic Segmentation)
kaggle.com
Updated May 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlo Lepelaars (2020). CamSeq 2007 (Semantic Segmentation) [Dataset]. https://www.kaggle.com/carlolepelaars/camseq-semantic-segmentation/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 3, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carlo Lepelaars
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
N.B. The owner of this Dataset is The University of Cambridge (2007).

I'm not in any way affiliated with The University of Cambridge. I just thought it would be nice for people to have this dataset available on Kaggle.

Context

CamSeq01 is a groundtruth dataset that can be freely used for research work in object recognition in video.

Content

This dataset contains 101 960x720 pixel image pairs. Every mask is designated by an "_L" addition to the file name. All images (original and groundtruth) are in uncompressed 24-bit color PNG format.

Citation

Julien Fauqueur, Gabriel Brostow, Roberto Cipolla, Assisted Video Object Labeling By Joint Tracking of Regions and Keypoints, IEEE International Conference on Computer Vision (ICCV'2007) Interactive Computer Vision Workshop. Rio de Janeiro, Brazil, October 2007

Acknowledgements

This work has been carried out with the support of Toyota Motor Europe.

The original dataset can be found here: http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamSeq01

Source / Contact

http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamSeq01

Image Source

https://sthalles.github.io/deep_segmentation_network
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

iSEE-Laboratory (2025). Long-RVOS [Dataset]. https://huggingface.co/datasets/iSEE-Laboratory/Long-RVOS

Long-RVOS

iSEE-Laboratory/Long-RVOS

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Nov 2, 2025

Dataset authored and provided by

iSEE-Laboratory

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

GitHub Repository: https://github.com/iSEE-Laboratory/Long_RVOS Project Page: https://isee-laboratory.github.io/Long-RVOS/ Paper: arXiv:2505.12702

  Dataset Description







  Dataset Summary

Long-RVOS is the first large-scale long-term referring video object segmentation benchmark, containing 2,000+ videos with an average duration exceeding 60 seconds. The dataset addresses… See the full description on the dataset page: https://huggingface.co/datasets/iSEE-Laboratory/Long-RVOS.

Clear search

Close search

Google apps

Main menu

Long-RVOS

MOVE

Multiple Lego Tracking Dataset

MOSEv2

mmtracking

Description

What's MMTracking?

Source

Documentation

Data from: Hierarchical Deep Learning Framework for Automated Marine...

SeCVOS

EgoHands Object Detection Dataset - specific

About this dataset

Our modifications

Using this dataset

M3_VOS

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...

CamSeq 2007 (Semantic Segmentation)

N.B. The owner of this Dataset is The University of Cambridge (2007).

I'm not in any way affiliated with The University of Cambridge. I just thought it would be nice for people to have this dataset available on Kaggle.

Context

Content

Citation

Acknowledgements

Source / Contact

Image Source

Long-RVOS

iSEE-Laboratory/Long-RVOS