Cityscapes-VPS is a video extension of the Cityscapes validation split. It provides 2500-frame panoptic labels that temporally extend the 500 Cityscapes image-panoptic labels. There are total 3000-frame panoptic labels which correspond to 5, 10, 15, 20, 25, and 30th frames of each 500 videos, where all instance ids are associated over time. It not only supports video panoptic segmentation (VPS) task, but also provides super-set annotations for video semantic segmentation (VSS) and video instance segmentation (VIS) tasks.
We design an all-day semantic segmentation benchmark all-day CityScapes. It is the first semantic segmentation benchmark that contains samples from all-day scenarios, i.e., from dawn to night. Our dataset will be made publicly available at [https://isis-data.science.uva.nl/cv/1ADcityscape.zip].
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cross-domain object detection is a key problem in the research of intelligent detection models. Different from lots of improved algorithms based on two-stage detection models, we try another way. A simple and efficient one-stage model is introduced in this paper, comprehensively considering the inference efficiency and detection precision, and expanding the scope of undertaking cross-domain object detection problems. We name this gradient reverse layer-based model YOLO-G, which greatly improves the object detection precision in cross-domain scenarios. Specifically, we add a feature alignment branch following the backbone, where the gradient reverse layer and a classifier are attached. With only a small increase in computational, the performance is higher enhanced. Experiments such as Cityscapes→Foggy Cityscapes, SIM10k→Cityscape, PASCAL VOC→Clipart, and so on, indicate that compared with most state-of-the-art (SOTA) algorithms, the proposed model achieves much better mean Average Precision (mAP). Furthermore, ablation experiments were also performed on 4 components to confirm the reliability of the model. The project is available at https://github.com/airy975924806/yolo-G.
The CityPersons dataset is a subset of Cityscapes which only consists of person annotations. There are 2975 images for training, 500 and 1575 images for validation and testing. The average of the number of pedestrians in an image is 7. The visible-region and full-body annotations are provided.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cross-domain object detection is a key problem in the research of intelligent detection models. Different from lots of improved algorithms based on two-stage detection models, we try another way. A simple and efficient one-stage model is introduced in this paper, comprehensively considering the inference efficiency and detection precision, and expanding the scope of undertaking cross-domain object detection problems. We name this gradient reverse layer-based model YOLO-G, which greatly improves the object detection precision in cross-domain scenarios. Specifically, we add a feature alignment branch following the backbone, where the gradient reverse layer and a classifier are attached. With only a small increase in computational, the performance is higher enhanced. Experiments such as Cityscapes→Foggy Cityscapes, SIM10k→Cityscape, PASCAL VOC→Clipart, and so on, indicate that compared with most state-of-the-art (SOTA) algorithms, the proposed model achieves much better mean Average Precision (mAP). Furthermore, ablation experiments were also performed on 4 components to confirm the reliability of the model. The project is available at https://github.com/airy975924806/yolo-G.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Effect of additional modules on segmentation performance: Ablation study results in Cityscapes dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Light in all spectrum travel in the physical world. The trichromatism (RGB) human vision captures and understands it. Machine vision makes an analogy which use RGB camera for semantic segmentation and scene understanding. We argue that such machine vision suffers from metamerism, that different objects may appear in same RGB color while actually distinctive in spectrum. While learning based solutions, especially deep learning, have been heavily explored, they do not solve the fundamental physical limitation. In this paper, we propose to use Hyperspectral images (HSIs), which capture hundreds of consecutive narrow bands from the real visible world and therefore metamerism no longer exists. In short, we aim to 'see beyond human vision'. In practice, we introduce a novel large scale high quality HSI dataset for semantic segmentation in cityscapes. Namely, Hyperspectral City dataset. The dataset contains 1330 HSIs which are captured in typical urban driving scenes. Each HSI has 1889×1422 spatial resolution and 128 spectral channels ranged from 450nm to 950nm. The dataset provides semantic annotation at pixel level which is done manually by professional annotators. We believe this dataset enables a new direction for scene understanding.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison of the proposed CycleGAN with other SOTA deep generation models.
The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance comparison of semantic segmentation methods on Cityscapes, DensPASS.
DensePASS - a novel densely annotated dataset for panoramic segmentation under cross-domain conditions, specifically built to study the Pinhole-to-Panoramic transfer and accompanied with pinhole camera training examples obtained from Cityscapes. DensePASS covers both, labelled- and unlabelled 360-degree images, with the labelled data comprising 19 classes which explicitly fit the categories available in the source domain (i.e. pinhole) data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper presents a novel method for improving semantic segmentation performance in computer vision tasks. Our approach utilizes an enhanced UNet architecture that leverages an improved ResNet50 backbone. We replace the last layer of ResNet50 with deformable convolution to enhance feature representation. Additionally, we incorporate an attention mechanism, specifically ECA-ASPP (Attention Spatial Pyramid Pooling), in the encoding path of UNet to capture multi-scale contextual information effectively. In the decoding path of UNet, we explore the use of attention mechanisms after concatenating low-level features with high-level features. Specifically, we investigate two types of attention mechanisms: ECA (Efficient Channel Attention) and LKA (Large Kernel Attention). Our experiments demonstrate that incorporating attention after concatenation improves segmentation accuracy. Furthermore, we compare the performance of ECA and LKA modules in the decoder path. The results indicate that the LKA module outperforms the ECA module. This finding highlights the importance of exploring different attention mechanisms and their impact on segmentation performance. To evaluate the effectiveness of the proposed method, we conduct experiments on benchmark datasets, including Stanford and Cityscapes, as well as the newly introduced WildPASS and DensPASS datasets. Based on our experiments, the proposed method achieved state-of-the-art results including mIoU 85.79 and 82.25 for the Stanford dataset, and the Cityscapes dataset, respectively. The results demonstrate that our proposed method performs well on these datasets, achieving state-of-the-art results with high segmentation accuracy.
RailSem19 offers 8500 unique images taken from a the ego-perspective of a rail vehicle (trains and trams). Extensive semantic annotations are provided, both geometry-based (rail-relevant polygons, all rails as polylines) and dense label maps with many Cityscapes-compatible road labels. Many frames show areas of intersection between road and rail vehicles (railway crossings, trams driving on city streets). RailSem19 is usefull for rail applications and road applications alike.
Image credit: https://wilddash.cc/railsem19
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Cityscapes-VPS is a video extension of the Cityscapes validation split. It provides 2500-frame panoptic labels that temporally extend the 500 Cityscapes image-panoptic labels. There are total 3000-frame panoptic labels which correspond to 5, 10, 15, 20, 25, and 30th frames of each 500 videos, where all instance ids are associated over time. It not only supports video panoptic segmentation (VPS) task, but also provides super-set annotations for video semantic segmentation (VSS) and video instance segmentation (VIS) tasks.