COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.
Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.
Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.
Annotations: The dataset has annotations for
object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset.
Relations in Captions (REC-COCO) is a new dataset that contains associations between caption tokens and bounding boxes in images. REC-COCO is based on the MS-COCO and V-COCO datasets. For each image in V-COCO, we collect their corresponding captions from MS-COCO and automatically align the concept triplet in V-COCO to the tokens in the caption. This requires finding the token for concepts such as PERSON. As a result, REC-COCO contains the captions and the tokens which correspond to each subject and object, as well as the bounding boxes for the subject and object.
This dataset was created by Deucalion_Sash
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COCO is a large-scale object detection, segmentation, and captioning dataset. http://cocodataset.org COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image * 250,000 people with keypoints
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Synthetically Spoken COCO
Version 1.0
This dataset contain synthetically generated spoken versions of MS COCO [1] captions. This
dataset was created as part the research reported in [5].
The speech was generated using gTTS [2]. The dataset consists of the following files:
- dataset.json: Captions associated with MS COCO images. This information comes from [3].
- sentid.txt: List of caption IDs. This file can be used to locate MFCC features of the MP3 files
in the numpy array stored in dataset.mfcc.npy.
- mp3.tgz: MP3 files with the audio. Each file name corresponds to caption ID in dataset.json
and in sentid.txt.
- dataset.mfcc.npy: Numpy array with the Mel Frequence Cepstral Coefficients extracted from
the audio. Each row corresponds to a caption. The order or the captions corresponds to the
ordering in the file sentid.txt. MFCCs were extracted using [4].
[1] http://mscoco.org/dataset/#overview
[2] https://pypi.python.org/pypi/gTTS
[3] https://github.com/karpathy/neuraltalk
[4] https://github.com/jameslyons/python_speech_features
[5] https://arxiv.org/abs/1702.01991
UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images.
ConQA is a dataset created using the intersection between VisualGenome and MS-COCO. The goal of this dataset is to provide a new benchmark for text to image retrieval using short and less descriptive queries than the commonly use captions from MS-COCO or Flicker. ConQA consists of 80 queries divided into 50 conceptual and 30 descriptive queries. A descriptive query mentions some of the objects in the image, for instance, people chopping vegetables. While, a conceptual query does not mention objects or only refers to objects in a general context, e.g., working class life.
Dataset generation For the dataset generation, we followed a 3 step workflow: filtering images, generating queries and seeding relevant, and crowd-sourcing extended annotations.
Filtering images The first step is focused on filtering images that have meaningful scene graphs and captions. To filter the images, we used the following procedure:
The image should have one or more captions. Hence, we discarded the YFCC images with no caption, obtaining images from the MS-COCO subset of Visual Genome. The image should describe a complex scene with multiple objects. We filtered all the scene graphs that did not contain any edges. images pass this filter. The relationships should be verbs and not contain nouns or pronouns. To detect this, we generated sentences for each edge as a concatenation of the words on the labels of the nodes and the relationship and applied Part of Speech tagging. We performed the POS Tagging using the model en_core_web_sm provided by SpaCy. We filter all scene graphs that contain an edge not tagged as a verb or that the tag is not in an ad-hoc list of allowed non-verb keywords. The allowed keywords are top, front, end, side, edge, middle, rear, part, bottom, under, next, left, right, center, background, back, and parallel. We allowed these keywords as they represent positional relationships between objects. After filtering, we obtain images.
Generating Queries To generate ConQA, the dataset authors worked in three pairs and acted as annotators to manually design the queries, namely 50 conceptual and 30 descriptive queries. After that, we proceeded to use the model "ViT-B/32" from CLIP to find relevant images. For conceptual queries, it was challenging to find relevant images directly, so alternative proxy queries were used to identify an initial set of relevant images. These images are the seed for finding other relevant images that were annotated through Amazon Mechanical Turk.
Annotation crowdsourcing Having the initial relevant set defined by the dataset authors, we expanded the relevant candidates by looking into the top-100 visually closest images according to a pre-trained ResNet152 model for each query. As a result, we increase the number of potentially relevant images to analyze without adding human bias to the task.
After selecting the images to annotate, we set up a set of Human Intelligence Tasks (HITs) on Amazon Mechanical Turk. Each task consisted of a query and 5 potentially relevant images. Then, the workers were instructed to determine whether each image is relevant for the given query. If they were not sure, they could alternatively mark the image as “Unsure”. To reduce presentation bias, we randomize the order of images and the options. Additionally, we include validation tasks with control images to ensure a minimum quality in the annotation process, so workers failing 70% or more of validation queries were excluded.
yhshin1020/coco-img-caption-pairs dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ("people at a holiday resort") and the actions they perform ("people having a picnic"). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.
ESP dataset (Evaluation for Styled Prompt dataset) is a new benchmark for zero-shot domain-conditional caption generation. The dataset aims to evaluate the capability to generate diverse domain-specific language conditioned on the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collected five text domains with everyday usage: blog, social media, instruction, story, and news using Amazon MTurk.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Social Media Captions
Based on the Instagram Influencer Dataset from Seungbae Kim, Jyun-Yu Jiang, and Wei Wang Extended with photo descriptions of ydshieh/vit-gpt2-coco-en model to create a dataset which can be used to finetune Llama-2.
60k complete dataset: Waterfront/social-media-captions 10k smaller subset: Waterfront/social-media-captions-10k
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max. VisDial v1.0 contains 123K dialogues on MS COCO (2017 training set) for training split, 2K dialogues with validation images for validation split and 8K dialogues on test set for test-standard set. The previously released v0.5 and v0.9 versions of VisDial dataset (corresponding to older splits of MS COCO) are considered deprecated.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The 3D surgical tool dataset (3dStool) has been constructed with the aim of assisting the development of computer vision techniques that address the operating room. Note, functions for visualisation, processing, and splitting the dataset can be found in the relevant github repository.
Specifically, even though laparoscopic scenes have received a lot of attention in terms of labelled images, surgical tools that are used at initial stages of an operation, such as scalpels and scissors, have not had any such datasets developed.
3dStool includes 5370 images, accompanied by manually drawn polygon labels, as well as information on the 3D pose of these tools in operation. The tools were recorded while operating on a cadaveric knee. A RealSense D415 was used for image collection, while an optical tracker was employed for the purpose of 3D pose recording. Four surgical tools have been collected for now:
An annotation json file (in the format of COCO) exists for the images, containing the masks, boxes, and other relevant information. Furthermore, pose information is provided in two different manners.
Firstly, a csv in the following format:
Column | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|
Value | X (m) | Y (m) | Z (m) | qi | qj | qk | ql | Class | Image Name |
Position and orientation are both provided in the coordinate axes of the camera used to obtain the data (Realsense D415, Intel, USA). Pose is provided in the form of quaternions, however it is possible to convert this format into other available notations.
The pose data can also be combined with the masks in the form of a final json file, in order to obtain a final COCO-format json with object poses as well. In the data provided, each of the test, train and validation subsets have their own COCO-like json files with the poses fused within, although the "orignal_jsons" only provide the image masks.
The files and directories are structured as follows. Note that this example is based on the "train" directory, but a similar structure has been created for the test and val sets:
This dataset, adapted from COCO Caption, is designed for the Image Caption task and evaluates multimodal model editing in terms of reliability, stability and generality. You can download the dataset from here
Not seeing a result you expected?
Learn how you can add new datasets to our index.
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.