COCO is a large-scale object detection, segmentation, and captioning dataset.
Note: * Some images from the train and validation sets don't have annotations. * Coco 2014 and 2017 uses the same images, but different train/val/test splits * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only uses 80 classes. * Panotptic annotations defines defines 200 classes but only uses 133.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/coco-2014-1.1.0.png" alt="Visualization" width="500px">
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.
Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.
Annotations: The dataset has annotations for
object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
The Common Objects in Context (COCO) dataset is a widely recognized collection designed to spur object detection, segmentation, and captioning research. Created by Microsoft, COCO provides annotations, including object categories, keypoints, and more. The model it a valuable asset for machine learning practitioners and researchers. Today, many model architectures are benchmarked against COCO, which has enabled a standard system by which architectures can be compared.
While COCO is often touted to comprise over 300k images, it's pivotal to understand that this number includes diverse formats like keypoints, among others. Specifically, the labeled dataset for object detection stands at 123,272 images.
The full object detection labeled dataset is made available here, ensuring researchers have access to the most comprehensive data for their experiments. With that said, COCO has not released their test set annotations, meaning the test data doesn't come with labels. Thus, this data is not included in the dataset.
The Roboflow team has worked extensively with COCO. Here are a few links that may be helpful as you get started working with this dataset:
This is an open source object detection model by TensorFlow in TensorFlow Lite format. While it is not recommended to use this model in production surveys, it can be useful for demonstration purposes and to get started with smart assistants in ArcGIS Survey123. You are responsible for the use of this model. When using Survey123, it is your responsibility to review and manually correct outputs.This object detection model was trained using the Common Objects in Context (COCO) dataset. COCO is a large-scale object detection dataset that is available for use under the Creative Commons Attribution 4.0 License.The dataset contains 80 object categories and 1.5 million object instances that include people, animals, food items, vehicles, and household items. For a complete list of common objects this model can detect, see Classes.The model can be used in ArcGIS Survey123 to detect common objects in photos that are captured with the Survey123 field app. Using the modelFollow the guide to use the model. You can use this model to detect or redact common objects in images captured with the Survey123 field app. The model must be configured for a survey in Survey123 Connect.Fine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputCamera feed (either low-resolution preview or high-resolution capture).OutputImage with common object detections written to its EXIF metadata or an image with detected objects redacted.Model architectureThis is an open source object detection model by TensorFlow in TensorFlow Lite format with MobileNet architecture. The model is available for use under the Apache License 2.0.Sample resultsHere are a few results from the model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
## Overview
CoCo Val is a dataset for object detection tasks - it contains All Coco Dataset Classes Box annotations for 9,419 images.
## Getting Started
You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model.
## License
This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COCO-Stuff augments all 164K images of the popular COCO dataset with pixel-level stuff annotations. These annotations can be used for scene understanding tasks like semantic segmentation, object detection and image captioning.
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
About:
The dataset was collected on the https://www.rapidata.ai platform and contains tens of thousands of human annotations of 70+ different kinds of objects. Rapidata makes it easy to collect manual labels in several data modalities with this repository containing freehand drawings on ~2000 images from the COCO dataset. Users are shown an image and are asked to paint a class of objects with a brush tool - there is always a single such object on the image, so the task is not… See the full description on the dataset page: https://huggingface.co/datasets/Rapidata/coco-human-inpainted-objects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Esefjorden Marine Vegetation Segmentation Dataset (EMVSD):Comprising 17,000 meticulously labeled images, this dataset is suited for instance segmentation tasks and represents a significant leap forward for marine research in the region. The images are stored in YOLO and COCO formats, ensuring compatibility with widely recognized and adopted object detection frameworks. Our decision to make this dataset publicly accessible underscores our commitment to collaborative research and the advancement of the broader scientific community.Dataset Structure:- Images: - Organized into three subsets: train
, val
, and test
, located under the images/
directory. - Each subset contains high-resolution images optimized for object detection and segmentation tasks.- Annotations: - Available in YOLO txt and COCO formats for compatibility with major object detection frameworks. - Organized into three subsets: train
, val
, and test
, located under the labels/
directory. - Additional metadata: - counts.txt
: Summary of label distributions. - Cache files (train.cache
, val.cache
, test.cache
) for efficient dataset loading.- Metadata: - classes.txt
: Definitions for all annotated classes in the dataset. - Detailed COCO-format annotations in: - train_annotations.json
- val_annotations.json
- test_annotations.json
- Configuration File: - EMVSD.yaml
: Configuration file for seamless integration with machine learning libraries.Example Directory Structure:EMVSD/├── images/│ ├── train/│ ├── val/│ └── test/├── labels/│ ├── train/│ ├── val/│ ├── test/│ ├── counts.txt│ ├── train.cache│ ├── val.cache│ └── test.cache├── classes.txt├── train_annotations.json├── val_annotations.json├── test_annotations.json└── EMVSD.yaml
Panoptic segmentation aims to unify instance and semantic segmentation in the same framework. Existing works propose to merge instance and semantic segmentation using post-processing layers. Recent works unify both segmentation tasks by producing binary masks and class scores for both things and stuff classes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SketchyCOCO dataset consists of two parts: Object-level data Object-level data contains $20198(train18869+val1329)$ triplets of {foreground sketch, foreground image, foreground edge map} examples covering 14 classes, $27683(train22171+val5512)$ pairs of {background sketch, background image} examples covering 3 classes. Scene-level data Scene-level data contains $14081(train 11265 + val 2816)$ pairs of {foreground image&background sketch, scene image} examples, $14081(train 11265 + val 2816)$ pairs of {scene sketch, scene image} examples and the segmentation ground truth for $14081(train 11265 + val 2816)$ scene sketches. Some val scene images come from the train images of the COCO-Stuff dataset for increasing the number of the val images of the SketchyCOCO dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset
The Man OverBoard Drone (MOBDrone) dataset is a large-scale collection of aerial footage images. It contains 126,170 frames extracted from 66 video clips gathered from one UAV flying at an altitude of 10 to 60 meters above the mean sea level. Images are manually annotated with more than 180K bounding boxes localizing objects belonging to 5 categories --- person, boat, lifebuoy, surfboard, wood. More than 113K of these bounding boxes belong to the person category and localize people in the water simulating the need to be rescued.
In this repository, we provide:
66 Full HD video clips (total size: 5.5 GB)
126,170 images extracted from the videos at a rate of 30 FPS (total size: 243 GB)
3 annotation files for the extracted images that follow the MS COCO data format (for more info see https://cocodataset.org/#format-data):
annotations_5_custom_classes.json: this file contains annotations concerning all five categories; please note that class ids do not correspond with the ones provided by the MS COCO standard since we account for two new classes not previously considered in the MS COCO dataset --- lifebuoy and wood
annotations_3_coco_classes.json: this file contains annotations concerning the three classes also accounted by the MS COCO dataset --- person, boat, surfboard. Class ids correspond with the ones provided by the MS COCO standard.
annotations_person_coco_classes.json: this file contains annotations concerning only the 'person' class. Class id corresponds to the one provided by the MS COCO standard.
The MOBDrone dataset is intended as a test data benchmark. However, for researchers interested in using our data also for training purposes, we provide training and test splits:
More details about data generation and the evaluation protocol can be found at our MOBDrone paper: https://arxiv.org/abs/2203.07973
The code to reproduce our results is available at this GitHub Repository: https://github.com/ciampluca/MOBDrone_eval
See also http://aimh.isti.cnr.it/dataset/MOBDrone
Citing the MOBDrone
The MOBDrone is released under a Creative Commons Attribution license, so please cite the MOBDrone if it is used in your work in any form.
Published academic papers should use the academic paper citation for our MOBDrone paper, where we evaluated several pre-trained state-of-the-art object detectors focusing on the detection of the overboard people
@inproceedings{MOBDrone2021, title={MOBDrone: a Drone Video Dataset for Man OverBoard Rescue}, author={Donato Cafarelli and Luca Ciampi and Lucia Vadicamo and Claudio Gennaro and Andrea Berton and Marco Paterni and Chiara Benvenuti and Mirko Passera and Fabrizio Falchi}, booktitle={ICIAP2021: 21th International Conference on Image Analysis and Processing}, year={2021} }
and this Zenodo Dataset
@dataset{donato_cafarelli_2022_5996890, author={Donato Cafarelli and Luca Ciampi and Lucia Vadicamo and Claudio Gennaro and Andrea Berton and Marco Paterni and Chiara Benvenuti and Mirko Passera and Fabrizio Falchi}, title = {{MOBDrone: a large-scale drone-view dataset for man overboard detection}}, month = feb, year = 2022, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.5996890}, url = {https://doi.org/10.5281/zenodo.5996890} }
Personal works, such as machine learning projects/blog posts, should provide a URL to the MOBDrone Zenodo page (https://doi.org/10.5281/zenodo.5996890), though a reference to our MOBDrone paper would also be appreciated.
Contact Information
If you would like further information about the MOBDrone or if you experience any issues downloading files, please contact us at mobdrone[at]isti.cnr.it
Acknowledgements
This work was partially supported by NAUSICAA - "NAUtical Safety by means of Integrated Computer-Assistance Appliances 4.0" project funded by the Tuscany region (CUP D44E20003410009). The data collection was carried out with the collaboration of the Fly&Sense Service of the CNR of Pisa - for the flight operations of remotely piloted aerial systems - and of the Institute of Clinical Physiology (IFC) of the CNR - for the water immersion operations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The goal of this task is to train a model that can localize and classify each instance of Person and Car as accurately as possible.
from IPython.display import Markdown, display
display(Markdown("../input/Car-Person-v2-Roboflow/README.roboflow.txt"))
In this Notebook, I have processed the images with RoboFlow because in COCO formatted dataset was having different dimensions of image and Also data set was not splitted into different Format. To train a custom YOLOv7 model we need to recognize the objects in the dataset. To do so I have taken the following steps:
Image Credit - jinfagang
!git clone https://github.com/WongKinYiu/yolov7 # Downloading YOLOv7 repository and installing requirements
%cd yolov7
!pip install -qr requirements.txt
!pip install -q roboflow
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"
import os
import glob
import wandb
import torch
from roboflow import Roboflow
from kaggle_secrets import UserSecretsClient
from IPython.display import Image, clear_output, display # to display images
print(f"Setup complete. Using torch {torch._version_} ({torch.cuda.get_device_properties(0).name if torch.cuda.is_available() else 'CPU'})")
https://camo.githubusercontent.com/dd842f7b0be57140e68b2ab9cb007992acd131c48284eaf6b1aca758bfea358b/68747470733a2f2f692e696d6775722e636f6d2f52557469567a482e706e67">
I will be integrating W&B for visualizations and logging artifacts and comparisons of different models!
try:
user_secrets = UserSecretsClient()
wandb_api_key = user_secrets.get_secret("wandb_api")
wandb.login(key=wandb_api_key)
anonymous = None
except:
wandb.login(anonymous='must')
print('To use your W&B account,
Go to Add-ons -> Secrets and provide your W&B access token. Use the Label name as WANDB.
Get your W&B access token from here: https://wandb.ai/authorize')
wandb.init(project="YOLOvR",name=f"7. YOLOv7-Car-Person-Custom-Run-7")
https://uploads-ssl.webflow.com/5f6bc60e665f54545a1e52a5/615627e5824c9c6195abfda9_computer-vision-cycle.png" alt="">
In order to train our custom model, we need to assemble a dataset of representative images with bounding box annotations around the objects that we want to detect. And we need our dataset to be in YOLOv7 format.
In Roboflow, We can choose between two paths:
https://raw.githubusercontent.com/Owaiskhan9654/Yolo-V7-Custom-Dataset-Train-on-Kaggle/main/Roboflow.PNG" alt="">
user_secrets = UserSecretsClient()
roboflow_api_key = user_secrets.get_secret("roboflow_api")
rf = Roboflow(api_key=roboflow_api_key)
project = rf.workspace("owais-ahmad").project("custom-yolov7-on-kaggle-on-custom-dataset-rakiq")
dataset = project.version(2).download("yolov7")
Here, I am able to pass a number of arguments: - img: define input image size - batch: determine
ConQA is a dataset created using the intersection between VisualGenome and MS-COCO. The goal of this dataset is to provide a new benchmark for text to image retrieval using short and less descriptive queries than the commonly use captions from MS-COCO or Flicker. ConQA consists of 80 queries divided into 50 conceptual and 30 descriptive queries. A descriptive query mentions some of the objects in the image, for instance, people chopping vegetables. While, a conceptual query does not mention objects or only refers to objects in a general context, e.g., working class life.
Dataset generation For the dataset generation, we followed a 3 step workflow: filtering images, generating queries and seeding relevant, and crowd-sourcing extended annotations.
Filtering images The first step is focused on filtering images that have meaningful scene graphs and captions. To filter the images, we used the following procedure:
The image should have one or more captions. Hence, we discarded the YFCC images with no caption, obtaining images from the MS-COCO subset of Visual Genome. The image should describe a complex scene with multiple objects. We filtered all the scene graphs that did not contain any edges. images pass this filter. The relationships should be verbs and not contain nouns or pronouns. To detect this, we generated sentences for each edge as a concatenation of the words on the labels of the nodes and the relationship and applied Part of Speech tagging. We performed the POS Tagging using the model en_core_web_sm provided by SpaCy. We filter all scene graphs that contain an edge not tagged as a verb or that the tag is not in an ad-hoc list of allowed non-verb keywords. The allowed keywords are top, front, end, side, edge, middle, rear, part, bottom, under, next, left, right, center, background, back, and parallel. We allowed these keywords as they represent positional relationships between objects. After filtering, we obtain images.
Generating Queries To generate ConQA, the dataset authors worked in three pairs and acted as annotators to manually design the queries, namely 50 conceptual and 30 descriptive queries. After that, we proceeded to use the model "ViT-B/32" from CLIP to find relevant images. For conceptual queries, it was challenging to find relevant images directly, so alternative proxy queries were used to identify an initial set of relevant images. These images are the seed for finding other relevant images that were annotated through Amazon Mechanical Turk.
Annotation crowdsourcing Having the initial relevant set defined by the dataset authors, we expanded the relevant candidates by looking into the top-100 visually closest images according to a pre-trained ResNet152 model for each query. As a result, we increase the number of potentially relevant images to analyze without adding human bias to the task.
After selecting the images to annotate, we set up a set of Human Intelligence Tasks (HITs) on Amazon Mechanical Turk. Each task consisted of a query and 5 potentially relevant images. Then, the workers were instructed to determine whether each image is relevant for the given query. If they were not sure, they could alternatively mark the image as “Unsure”. To reduce presentation bias, we randomize the order of images and the options. Additionally, we include validation tasks with control images to ensure a minimum quality in the annotation process, so workers failing 70% or more of validation queries were excluded.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The COCOPlaces dataset contains foreground objects from the COCO dataset overlayed over backgrounds from the Places dataset. It contains annotations and is suitable for benchmarking disentangling or factor identification. Originally used for the project https://github.com/vitskvara/sgad. There are two versions - non-mixed and mixed. In the non-mixed version (uniform_data_64.npy and uniform_labels_64.npy), a total of 10 classes of images were created, where in a single class, the background and object labels are the same. In the mixed version (mashed_data_64.npy and mashed_labels_64.npy), each image has a random object and background (out of 10 possible classes). Then, the label is a tuple of two numbers, describing the individual (object,background) labels.
A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.
RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.
Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".
Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):
dataset | partition | split | refs | images |
---|---|---|---|---|
refcoco | train | 40000 | 19213 | |
refcoco | val | 5000 | 4559 | |
refcoco | test | 5000 | 4527 | |
refcoco | unc | train | 42404 | 16994 |
refcoco | unc | val | 3811 | 1500 |
refcoco | unc | testA | 1975 | 750 |
refcoco | unc | testB | 1810 | 750 |
refcoco+ | unc | train | 42278 | 16992 |
refcoco+ | unc | val | 3805 | 1500 |
refcoco+ | unc | testA | 1975 | 750 |
refcoco+ | unc | testB | 1798 | 750 |
refcocog | train | 44822 | 24698 | |
refcocog | val | 5000 | 4650 | |
refcocog | umd | train | 42226 | 21899 |
refcocog | umd | val | 2573 | 1300 |
refcocog | umd | test | 5023 | 2600 |
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of COCO data-set.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mechanical Parts Dataset
The dataset consists of a total of 2250 images obtained by downloading from various internet platforms. Among the images in the dataset, there are 714 images with bearings, 632 images with bolts, 616 images with gears and 586 images with nuts. A total of 10597 manual labeling processes were carried out in the dataset, including 2099 labels belonging to the bearing class, 2734 labels belonging to the bolt class, 2662 labels belonging to the gear class and 3102 labels belonging to the nut class.
Folder Content
The created dataset is divided into 3 as 80% train, 10% validation and 10% test. In the "Mechanical Parts Dataset" folder, there are three separate folders as "train", "test" and "val". In each of these three folders there are folders named "images" and "labels". Images are kept in the "images" folder and tag information is kept in the "labels" folder.
Finally, inside the folder there is a yaml file named "mech_parts_data" for the Yolo algorithm. This file contains the number of classes and class names.
Images and Labels
The dataset was prepared in accordance with the Yolov5 algorithm.
For example, the tag information of the image named "2a0xhkr_jpg.rf.45a11bf63c40ad6e47da384fdf6bb7a1.jpg" is stored in the txt file with the same name. The tag information (coordinates) in the txt file are as follows: "class x_center y_center width height".
Update 05.01.2023
***Pascal voc and coco json formats have been added.***
Related paper: doi.org/10.5281/zenodo.7496767
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dictionary files for one-hot encoding of flickr, imagenet and coco classes.
This is a subset of CoCo2017 dataset. This dataset is used for image classification, and class-conditional image generation.
task_categories:
COCO is a large-scale object detection, segmentation, and captioning dataset.
Note: * Some images from the train and validation sets don't have annotations. * Coco 2014 and 2017 uses the same images, but different train/val/test splits * The test split don't have any annotations (only images). * Coco defines 91 classes but the data only uses 80 classes. * Panotptic annotations defines defines 200 classes but only uses 133.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/coco-2014-1.1.0.png" alt="Visualization" width="500px">