11 datasets found

Microsoft COCO 2017 Object Detection Dataset - raw
public.roboflow.com
zip
Updated Feb 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2025). Microsoft COCO 2017 Object Detection Dataset - raw [Dataset]. https://public.roboflow.com/object-detection/microsoft-coco-subset/2
Explore at:
zipAvailable download formats
Dataset updated
Feb 1, 2025
Dataset authored and provided by
Microsofthttp://microsoft.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Bounding Boxes of coco-objects
Description
This is the full 2017 COCO object detection dataset (train and valid), which is a subset of the most recent 2020 COCO object detection dataset.

COCO is a large-scale object detection, segmentation, and captioning dataset of many object types easily recognizable by a 4-year-old. The data is initially collected and published by Microsoft. The original source of the data is here and the paper introducing the COCO dataset is here.
Microsoft Coco 2017 Dataset
universe.roboflow.com
zip
Updated Jan 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roboflow Public (2022). Microsoft Coco 2017 Dataset [Dataset]. https://universe.roboflow.com/roboflow-public/microsoft-coco-2017-dataset
Explore at:
zipAvailable download formats
Dataset updated
Jan 4, 2022
Dataset provided by
Roboflowhttps://roboflow.com/
Authors
Roboflow Public
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Coco Objects Bounding Boxes
Description
Microsoft COCO 2017 Dataset

## Overview Microsoft COCO 2017 Dataset is a dataset for object detection tasks - it contains Coco Objects annotations for 2,245 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
g
COCO Dataset 2017
gts.ai
json
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS, COCO Dataset 2017 [Dataset]. https://gts.ai/dataset-download/coco-dataset-2017/
Explore at:
jsonAvailable download formats
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset.
O
COCO 2017
opendatalab.com
huggingface.co
zip
Updated Sep 30, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2017). COCO 2017 [Dataset]. https://opendatalab.com/OpenDataLab/COCO_2017
Explore at:
zip(49105147630 bytes)Available download formats
Dataset updated
Sep 30, 2017
Dataset provided by
Microsoft
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints
h
MSCOCO
huggingface.co
Updated Dec 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shunsuke Kitada (2024). MSCOCO [Dataset]. https://huggingface.co/datasets/shunk031/MSCOCO
Explore at:
Dataset updated
Dec 27, 2024
Authors
Shunsuke Kitada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for MSCOCO

Dataset Summary

COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:

Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More… See the full description on the dataset page: https://huggingface.co/datasets/shunk031/MSCOCO.
Z
COCO, LVIS, Open Images V4 classes mapping
data.niaid.nih.gov
zenodo.org
+1more
Updated Oct 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Amato (2022). COCO, LVIS, Open Images V4 classes mapping [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7194299
Explore at:
Dataset updated
Oct 13, 2022
Dataset provided by
Giuseppe Amato
Fabio Carrara
Nicola Messina
Claudio Gennaro
Paolo Bolettieri
Fabrizio Falchi
Claudio Vairo
Lucia Vadicamo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a mapping between the classes of COCO, LVIS, and Open Images V4 datasets into a unique set of 1460 classes.

COCO [Lin et al 2014] contains 80 classes, LVIS [gupta2019lvis] contains 1460 classes, Open Images V4 [Kuznetsova et al. 2020] contains 601 classes.

We built a mapping of these classes using a semi-automatic procedure in order to have a unique final list of 1460 classes. We also generated a hierarchy for each class, using wordnet

This repository contains the following files:

coco_classes_map.txt, contains the mapping for the 80 coco classes

lvis_classes_map.txt, contains the mapping for the 1460 coco classes

openimages_classes_map.txt, contains the mapping for the 601 coco classes

classname_hyperset_definition.csv, contains the final set of 1460 classes, their definition and hierarchy

all-classnames.xlsx, contains a side-by-side view of all classes considered

This mapping was used in VISIONE [Amato et al. 2021, Amato et al. 2022] that is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). For the object detection VISIONE uses three pre-trained models: VfNet Zhang et al. 2021, Mask R-CNN He et al. 2017, and a Faster R-CNN+Inception ResNet (trained on the Open Images V4).

This is repository is released under a Creative Commons Attribution license, please cite the following paper if you use it in your work in any form:

@inproceedings{amato2021visione, title={The visione video search system: exploiting off-the-shelf text search engines for large-scale video retrieval}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Debole, Franca and Falchi, Fabrizio and Gennaro, Claudio and Vadicamo, Lucia and Vairo, Claudio}, journal={Journal of Imaging}, volume={7}, number={5}, pages={76}, year={2021}, publisher={Multidisciplinary Digital Publishing Institute} }

References:

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_52

[Amato et al. 2021] Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L. and Vairo, C., 2021. The visione video search system: exploiting off-the-shelf text search engines for large-scale video retrieval. Journal of Imaging, 7(5), p.76.

[Gupta et al.2019] Gupta, A., Dollar, P. and Girshick, R., 2019. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356-5364).

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).
t
T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P....
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, P. Dollár (2024). Dataset: Microsoft COCO 2014 and 2017. https://doi.org/10.57702/ch13xmig [Dataset]. https://service.tib.eu/ldmservice/dataset/microsoft-coco-2014-and-2017
Explore at:
Dataset updated
Dec 16, 2024
Description
Microsoft COCO 2014 and 2017 datasets for object detection, segmentation, and captioning
h
COCO
huggingface.co
datasets.activeloop.ai
Updated Feb 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuggingFaceM4 (2023). COCO [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/COCO
Explore at:
Dataset updated
Feb 6, 2023
Dataset authored and provided by
HuggingFaceM4
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MS COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with keypoints.
Z
VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...
data.niaid.nih.gov
data.europa.eu
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claudio Vairo (2024). VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from MVK Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8355036
Explore at:
Dataset updated
Jan 25, 2024
Dataset provided by
Giuseppe Amato
Fabio Carrara
Nicola Messina
Claudio Gennaro
Paolo Bolettieri
Fabrizio Falchi
Claudio Vairo
Lucia Vadicamo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a diverse set of features extracted from the marine video (underwater) dataset (MVK) . These features were utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] during the latest editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).

We used a snapshot of the MVK dataset from 2023, that can be downloaded using the instructions provided at https://download-dbis.dmi.unibas.ch/mvk/. It comprises 1,372 video files. We divided each video into 1 second segments.

This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:

@inproceedings{amato2023visione, title={VISIONE at Video Browser Showdown 2023}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio}, booktitle={International Conference on Multimedia Modeling}, pages={615--621}, year={2023}, organization={Springer} }

This repository comprises the following files:

msb.tar.gz contains tab-separated files (.tsv) for each video. Each tsv file reports, for each video segment, the timestamp and frame number marking the start/end of the video segment, along with the timestamp of the extracted middle frame and the associated identifier ("id_visione").

extract-keyframes-from-msb.tar.gz contains a Python script designed to extract the middle frame of each video segment from the MSB files. To run the script successfully, please ensure that you have the original MVK videos available.

features-aladin.tar.gz† contains ALADIN [Messina N. et al. 2022] features extracted for all the segment's middle frames.

features-clip-laion.tar.gz† contains CLIP ViT-H/14 - LAION-2B [Schuhmann et al. 2022] features extracted for all the segment's middle frames.

features-clip-openai.tar.gz† contains CLIP ViT-L/14 [Radford et al. 2021] features extracted for all the segment's middle frames.

features-clip2video.tar.gz† contains CLIP2Video [Fang H. et al. 2021] extracted for all the 1s video segments.

objects-frcnn-oiv4.tar.gz* contains the objects detected using Faster R-CNN+Inception ResNet (trained on the Open Images V4 [Kuznetsova et al. 2020]).

objects-mrcnn-lvis.tar.gz* contains the objects detected using Mask R-CNN He et al. 2017.

objects-vfnet64-coco.tar.gz* contains the objects detected using VfNet Zhang et al. 2021.

*Please be sure to use the v2 version of this repository, since v1 feature files may contain inconsistencies that have now been corrected

*Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:

"object_class_names": vector with the class name of each detected object.

"object_scores": scores corresponding to each detected object.

"object_boxes_yxyx": bounding boxes of the detected objects in the format (ymin, xmin, ymax, xmax).

†Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the MVK dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used (see links above). Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.

We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.

References:

[Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.

[Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).

[Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CV
Z
VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...
data.niaid.nih.gov
zenodo.org
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Claudio Vairo (2024). VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from VBSLHE Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10013328
Explore at:
Dataset updated
Jan 25, 2024
Dataset provided by
Bolettieri, Paolo
Amato, Giuseppe
Fabio Carrara
Nicola Messina
Claudio Gennaro
Fabrizio Falchi
Claudio Vairo
Lucia Vadicamo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a diverse set of features extracted from the VBSLHE dataset (laparoscopic gynecology) . These features will be utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] in the next editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).

We used a snapshot of the dataset provided by the Medical University of Vienna and Toronto that can be downloaded using the instructions provided at https://download-dbis.dmi.unibas.ch/mvk/. It comprises 75 video files. We divided each video into video shots with a maximum duration of 5 seconds.

This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:

@inproceedings{amato2023visione, title={VISIONE at Video Browser Showdown 2023}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio}, booktitle={International Conference on Multimedia Modeling}, pages={615--621}, year={2023}, organization={Springer} }

This repository (v2) comprises the following files:

msb.tar.gz contains tab-separated files (.tsv) for each video. Each tsv file reports, for each video segment, the timestamp and frame number marking the start/end of the video segment, along with the timestamp of the extracted middle frame and the associated identifier ("id_visione").

extract-keyframes-from-msb.tar.gz contains a Python script designed to extract the middle frame of each video segment from the MSB files. To run the script successfully, please ensure that you have the original VBSLHE videos available.

features-aladin.tar.gz† contains ALADIN [Messina N. et al. 2022] features extracted for all the segment's middle frames.

features-clip-laion.tar.gz† contains CLIP ViT-H/14 - LAION-2B [Schuhmann et al. 2022] features extracted for all the segment's middle frames.

features-clip-openai.tar.gz† contains CLIP ViT-L/14 [Radford et al. 2021] features extracted for all the segment's middle frames.

features-clip2video.tar.gz† contains CLIP2Video [Fang H. et al. 2021] extracted for all the video segments.

objects-frcnn-oiv4.tar.gz* contains the objects detected using Faster R-CNN+Inception ResNet (trained on the Open Images V4 [Kuznetsova et al. 2020]).

objects-mrcnn-lvis.tar.gz* contains the objects detected using Mask R-CNN He et al. 2017.

objects-vfnet64-coco.tar.gz* contains the objects detected using VfNet Zhang et al. 2021.

*Please be sure to use the v2 version of this repository, since v1 feature files may contain inconsistencies that have now been corrected

*Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:

"object_class_names": vector with the class name of each detected object.

"object_scores": scores corresponding to each detected object.

"object_boxes_yxyx": bounding boxes of the detected objects in the format (ymin, xmin, ymax, xmax).

†Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the VBSLHE dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used (see links above). Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.

We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.

References:

[Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.

[Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).

[Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CV
Z
VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...
data.niaid.nih.gov
data.europa.eu
Updated Feb 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Amato (2024). VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from V3C1+V3C2 Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8188569
Explore at:
Dataset updated
Feb 12, 2024
Dataset provided by
Giuseppe Amato
Fabio Carrara
Nicola Messina
Claudio Gennaro
Paolo Bolettieri
Fabrizio Falchi
Claudio Vairo
Lucia Vadicamo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a diverse set of features extracted from the V3C1+V3C2 dataset, sourced from the Vimeo Creative Commons Collection. These features were utilized in the VISIONE system [Amato et al. 2023, Amato et al. 2022] during the latest editions of the Video Browser Showdown (VBS) competition (https://www.videobrowsershowdown.org/).

The original V3C1+V3C2 dataset, provided by NIST, can be downloaded using the instructions provided at https://videobrowsershowdown.org/about-vbs/existing-data-and-tools/.

It comprises 7,235 video files, amounting for 2,300h of video content and encompassing 2,508,113 predefined video segments.

We subdivided the predefined video segments longer than 10 seconds into multiple segments, with each segment spanning no longer than 16 seconds. As a result, we obtained a total of 2,648,219 segments. For each segment, we extracted one frame, specifically the middle one, and computed several features, which are described in detail below.

This repository is released under a Creative Commons Attribution license. If you use it in any form for your work, please cite the following paper:

@inproceedings{amato2023visione, title={VISIONE at Video Browser Showdown 2023}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Falchi, Fabrizio and Gennaro, Claudio and Messina, Nicola and Vadicamo, Lucia and Vairo, Claudio}, booktitle={International Conference on Multimedia Modeling}, pages={615--621}, year={2023}, organization={Springer} }

This repository comprises the following files:

msb.tar.gz contains tab-separated files (.tsv) for each video. Each tsv file reports, for each video segment, the timestamp and frame number marking the start/end of the video segment, along with the timestamp of the extracted middle frame and the associated identifier ("id_visione").

extract-keyframes-from-msb.tar.gz contains a Python script designed to extract the middle frame of each video segment from the MSB files. To run the script successfully, please ensure that you have the original V3C videos available.

features-aladin.tar.gz† contains ALADIN [Messina N. et al. 2022] features extracted for all the segment's middle frames.

features-clip-laion.tar.gz† contains CLIP ViT-H/14 - LAION-2B [Schuhmann et al. 2022] features extracted for all the segment's middle frames.

features-clip-openai.tar.gz† contains CLIP ViT-L/14 [Radford et al. 2021] features extracted for all the segment's middle frames.

features-clip2video.tar.gz† contains CLIP2Video [Fang H. et al. 2021] extracted for all the video segments. In particular 1) we concatenate consecutive short segments so to create segments at least 3 seconds long; 2) we downsample the obtained segments to 2.5 fps; 3) we feed the network with the first min(36, n) frames, where n is the number of frames of the segment. Notice that the minimum processed length consists of 7 frames, given that the segment is no shorter than 3s.

objects-frcnn-oiv4.tar.gz* contains the objects detected using Faster R-CNN+Inception ResNet (trained on the Open Images V4 [Kuznetsova et al. 2020]).

objects-mrcnn-lvis.tar.gz* contains the objects detected using Mask R-CNN He et al. 2017.

objects-vfnet64-coco.tar.gz* contains the objects detected using VfNet Zhang et al. 2021.

*Please be sure to use the v2 version of this repository, since v1 feature files may contain inconsistencies that have now been corrected

*Note on the object annotations: Within an object archive, there is a jsonl file for each video, where each row contains a record of a video segment (the "_id" corresponds to the "id_visione" used in the msb.tar.gz) . Additionally, there are three arrays representing the objects detected, the corresponding scores, and the bounding boxes. The format of these arrays is as follows:

"object_class_names": vector with the class name of each detected object.

"object_scores": scores corresponding to each detected object.

"object_boxes_yxyx": bounding boxes of the detected objects in the format (ymin, xmin, ymax, xmax).

†Note on the cross-modal features: The extracted multi-modal features (ALADIN, CLIPs, CLIP2Video) enable internal searches within the V3C1+V3C2 dataset using the query-by-image approach (features can be compared with the dot product). However, to perform searches based on free text, the text needs to be transformed into the joint embedding space according to the specific network being used. Please be aware that the service for transforming text into features is not provided within this repository and should be developed independently using the original feature repositories linked above.

We have plans to release the code in the future, allowing the reproduction of the VISIONE system, including the instantiation of all the services to transform text into cross-modal features. However, this work is still in progress, and the code is not currently available.

References:

[Amato et al. 2023] Amato, G.et al., 2023, January. VISIONE at Video Browser Showdown 2023. In International Conference on Multimedia Modeling (pp. 615-621). Cham: Springer International Publishing.

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham.

[Fang H. et al. 2021] Fang H. et al., 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Messina et al. 2022] Messina N. et al., 2022, September. Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing (pp. 64-70).

[Radford et al. 2021] Radford A. et al., 2021, July. Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.

[Schuhmann et al. 2022] Schuhmann C. et al., 2022. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35, pp.25278-25294.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Microsoft (2025). Microsoft COCO 2017 Object Detection Dataset - raw [Dataset]. https://public.roboflow.com/object-detection/microsoft-coco-subset/2

Microsoft COCO 2017 Object Detection Dataset - raw

Explore at:

zipAvailable download formats

Dataset updated

Feb 1, 2025

Dataset authored and provided by

Microsofthttp://microsoft.com/

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Variables measured

Bounding Boxes of coco-objects

Description

This is the full 2017 COCO object detection dataset (train and valid), which is a subset of the most recent 2020 COCO object detection dataset.

COCO is a large-scale object detection, segmentation, and captioning dataset of many object types easily recognizable by a 4-year-old. The data is initially collected and published by Microsoft. The original source of the data is here and the paper introducing the COCO dataset is here.

Clear search

Close search

Google apps

Main menu

Microsoft COCO 2017 Object Detection Dataset - raw

Microsoft Coco 2017 Dataset

Microsoft COCO 2017 Dataset

COCO Dataset 2017

COCO 2017

MSCOCO

COCO, LVIS, Open Images V4 classes mapping

T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P....

COCO

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...

VISIONE Feature Repository for VBS: Multi-Modal Features and Detected...

Microsoft COCO 2017 Object Detection Dataset - rawSee More Versions

Microsoft COCO 2017 Object Detection Dataset - raw