16 datasets found

P
COCO Captions Dataset
paperswithcode.com
opendatalab.com
Updated Sep 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick (2022). COCO Captions Dataset [Dataset]. https://paperswithcode.com/dataset/coco-captions
Explore at:
Dataset updated
Sep 13, 2022
Authors
Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick
Description
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
P
MS COCO Dataset
paperswithcode.com
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár, MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
Explore at:
Dataset updated
Apr 15, 2024
Authors
Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
Description
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

Annotations: The dataset has annotations for

object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.
g
COCO Dataset 2017
gts.ai
json
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GTS, COCO Dataset 2017 [Dataset]. https://gts.ai/dataset-download/coco-dataset-2017/
Explore at:
jsonAvailable download formats
Dataset provided by
GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
Authors
GTS
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset.
REC-COCO (Relations in Captions)
opendatalab.com
zip
Updated Mar 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ikerbasque (2023). REC-COCO (Relations in Captions) [Dataset]. https://opendatalab.com/OpenDataLab/REC-COCO
Explore at:
zip(1019467718 bytes)Available download formats
Dataset updated
Mar 22, 2023
Dataset provided by
巴斯克科學基金會http://www.ikerbasque.net/
University of the Basque Country
Donostia International Physics Center
Description
Relations in Captions (REC-COCO) is a new dataset that contains associations between caption tokens and bounding boxes in images. REC-COCO is based on the MS-COCO and V-COCO datasets. For each image in V-COCO, we collect their corresponding captions from MS-COCO and automatically align the concept triplet in V-COCO to the tokens in the caption. This requires finding the token for concepts such as PERSON. As a result, REC-COCO contains the captions and the tokens which correspond to each subject and object, as well as the bounding boxes for the subject and object.
Ms COCO sketch-caption-image Dataset
kaggle.com
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deucalion_Sash (2024). Ms COCO sketch-caption-image Dataset [Dataset]. https://www.kaggle.com/datasets/deucalionsash/ms-coco-sketch-caption-image-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deucalion_Sash
Description
Dataset

This dataset was created by Deucalion_Sash

Contents
n
COCO
scidm.nchc.org.tw
Updated Oct 10, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). COCO [Dataset]. https://scidm.nchc.org.tw/dataset/coco
Explore at:
Dataset updated
Oct 10, 2020
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COCO is a large-scale object detection, segmentation, and captioning dataset. http://cocodataset.org COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image * 250,000 people with keypoints
Synthetically Spoken COCO
zenodo.org
application/gzip, bin +2
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi (2020). Synthetically Spoken COCO [Dataset]. http://doi.org/10.5281/zenodo.400926
Explore at:
txt, json, bin, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.400926
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synthetically Spoken COCO

Version 1.0

This dataset contain synthetically generated spoken versions of MS COCO [1] captions. This
dataset was created as part the research reported in [5].
The speech was generated using gTTS [2]. The dataset consists of the following files:

- dataset.json: Captions associated with MS COCO images. This information comes from [3].
- sentid.txt: List of caption IDs. This file can be used to locate MFCC features of the MP3 files
in the numpy array stored in dataset.mfcc.npy.
- mp3.tgz: MP3 files with the audio. Each file name corresponds to caption ID in dataset.json
and in sentid.txt.
- dataset.mfcc.npy: Numpy array with the Mel Frequence Cepstral Coefficients extracted from
the audio. Each row corresponds to a caption. The order or the captions corresponds to the
ordering in the file sentid.txt. MFCCs were extracted using [4].

[1] http://mscoco.org/dataset/#overview
[2] https://pypi.python.org/pypi/gTTS
[3] https://github.com/karpathy/neuraltalk
[4] https://github.com/jameslyons/python_speech_features
[5] https://arxiv.org/abs/1702.01991
P
UIT-ViIC Dataset
paperswithcode.com
huggingface.co
Updated Feb 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quan Hoang Lam; Quang Duy Le; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen (2021). UIT-ViIC Dataset [Dataset]. https://paperswithcode.com/dataset/uit-viic
Explore at:
Dataset updated
Feb 24, 2021
Authors
Quan Hoang Lam; Quang Duy Le; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen
Description
UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images.
P
ConQA Dataset
paperswithcode.com
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ConQA Dataset [Dataset]. https://paperswithcode.com/dataset/conqa
Explore at:
Dataset updated
Apr 2, 2024
Authors
Juan Manuel Rodriguez; Nima Tavassoli; Eliezer Levy; Gil Lederman; Dima Sivov; Matteo Lissandrini; Davide Mottin
Description
ConQA is a dataset created using the intersection between VisualGenome and MS-COCO. The goal of this dataset is to provide a new benchmark for text to image retrieval using short and less descriptive queries than the commonly use captions from MS-COCO or Flicker. ConQA consists of 80 queries divided into 50 conceptual and 30 descriptive queries. A descriptive query mentions some of the objects in the image, for instance, people chopping vegetables. While, a conceptual query does not mention objects or only refers to objects in a general context, e.g., working class life.

Dataset generation For the dataset generation, we followed a 3 step workflow: filtering images, generating queries and seeding relevant, and crowd-sourcing extended annotations.

Filtering images The first step is focused on filtering images that have meaningful scene graphs and captions. To filter the images, we used the following procedure:

The image should have one or more captions. Hence, we discarded the YFCC images with no caption, obtaining images from the MS-COCO subset of Visual Genome. The image should describe a complex scene with multiple objects. We filtered all the scene graphs that did not contain any edges. images pass this filter. The relationships should be verbs and not contain nouns or pronouns. To detect this, we generated sentences for each edge as a concatenation of the words on the labels of the nodes and the relationship and applied Part of Speech tagging. We performed the POS Tagging using the model en_core_web_sm provided by SpaCy. We filter all scene graphs that contain an edge not tagged as a verb or that the tag is not in an ad-hoc list of allowed non-verb keywords. The allowed keywords are top, front, end, side, edge, middle, rear, part, bottom, under, next, left, right, center, background, back, and parallel. We allowed these keywords as they represent positional relationships between objects. After filtering, we obtain images.

Generating Queries To generate ConQA, the dataset authors worked in three pairs and acted as annotators to manually design the queries, namely 50 conceptual and 30 descriptive queries. After that, we proceeded to use the model "ViT-B/32" from CLIP to find relevant images. For conceptual queries, it was challenging to find relevant images directly, so alternative proxy queries were used to identify an initial set of relevant images. These images are the seed for finding other relevant images that were annotated through Amazon Mechanical Turk.

Annotation crowdsourcing Having the initial relevant set defined by the dataset authors, we expanded the relevant candidates by looking into the top-100 visually closest images according to a pre-trained ResNet152 model for each query. As a result, we increase the number of potentially relevant images to analyze without adding human bias to the task.

After selecting the images to annotate, we set up a set of Human Intelligence Tasks (HITs) on Amazon Mechanical Turk. Each task consisted of a query and 5 potentially relevant images. Then, the workers were instructed to determine whether each image is relevant for the given query. If they were not sure, they could alternatively mark the image as “Unsure”. To reduce presentation bias, we randomize the order of images and the options. Additionally, we include validation tasks with control images to ensure a minimum quality in the annotation process, so workers failing 70% or more of validation queries were excluded.
h
coco-img-caption-pairs
huggingface.co
Updated Sep 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
coco-img-caption-pairs [Dataset]. https://huggingface.co/datasets/yhshin1020/coco-img-caption-pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 26, 2024
Authors
YH
Description
yhshin1020/coco-img-caption-pairs dataset hosted on Hugging Face and contributed by the HF Datasets community
z
Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...
zenodo.org
data.niaid.nih.gov
zip
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michele Cafagna; Michele Cafagna; Kees van Deemter; Kees van Deemter; Albert Gatt; Albert Gatt (2024). HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales [Dataset]. http://doi.org/10.5281/zenodo.10723071
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10723071
Dataset updated
Feb 28, 2024
Dataset provided by
Association for Computational Linguistics
Authors
Michele Cafagna; Michele Cafagna; Kees van Deemter; Kees van Deemter; Albert Gatt; Albert Gatt
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ("people at a holiday resort") and the actions they perform ("people having a picnic"). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.
P
ESP Dataset Dataset
paperswithcode.com
Updated Mar 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). ESP Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/esp-dataset
Explore at:
Dataset updated
Mar 24, 2023
Description
ESP dataset (Evaluation for Styled Prompt dataset) is a new benchmark for zero-shot domain-conditional caption generation. The dataset aims to evaluate the capability to generate diverse domain-specific language conditioned on the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collected five text domains with everyday usage: blog, social media, instruction, story, and news using Amazon MTurk.
h
social-media-captions-20k
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Johannes, social-media-captions-20k [Dataset]. https://huggingface.co/datasets/Waterfront/social-media-captions-20k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Johannes
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Social Media Captions

Based on the Instagram Influencer Dataset from Seungbae Kim, Jyun-Yu Jiang, and Wei Wang Extended with photo descriptions of ydshieh/vit-gpt2-coco-en model to create a dataset which can be used to finetune Llama-2.

60k complete dataset: Waterfront/social-media-captions 10k smaller subset: Waterfront/social-media-captions-10k
O
VisDial (Visual Dialog)
opendatalab.com
zip
Updated Jan 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of California, Berkeley (2024). VisDial (Visual Dialog) [Dataset]. https://opendatalab.com/OpenDataLab/VisDial
Explore at:
zip(1100899678 bytes)Available download formats
Dataset updated
Jan 7, 2024
Dataset provided by
Georgia Institute of Technology
Carnegie Mellon University
Virginia Polytechnic Institute and State University
University of California, Berkeley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max. VisDial v1.0 contains 123K dialogues on MS COCO (2017 training set) for training split, 2K dialogues with validation images for validation split and 8K dialogues on test set for test-standard set. The previously released v0.5 and v0.9 versions of VisDial dataset (corresponding to older splits of MS COCO) are considered deprecated.
3dStool
zenodo.org
data.niaid.nih.gov
bin
Updated Feb 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spyridon Souipas; Spyridon Souipas (2023). 3dStool [Dataset]. http://doi.org/10.5281/zenodo.7635563
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7635563
Dataset updated
Feb 14, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Spyridon Souipas; Spyridon Souipas
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The 3D surgical tool dataset (3dStool) has been constructed with the aim of assisting the development of computer vision techniques that address the operating room. Note, functions for visualisation, processing, and splitting the dataset can be found in the relevant github repository.

Specifically, even though laparoscopic scenes have received a lot of attention in terms of labelled images, surgical tools that are used at initial stages of an operation, such as scalpels and scissors, have not had any such datasets developed.

3dStool includes 5370 images, accompanied by manually drawn polygon labels, as well as information on the 3D pose of these tools in operation. The tools were recorded while operating on a cadaveric knee. A RealSense D415 was used for image collection, while an optical tracker was employed for the purpose of 3D pose recording. Four surgical tools have been collected for now:

Scalpel

Scissors

Forceps

Electric Burr

An annotation json file (in the format of COCO) exists for the images, containing the masks, boxes, and other relevant information. Furthermore, pose information is provided in two different manners.

Firstly, a csv in the following format:

CSV Structure
Column 1 2 3 4 5 6 7 8 9
Value X (m) Y (m) Z (m) q_i q_j q_k q_l Class Image Name

Position and orientation are both provided in the coordinate axes of the camera used to obtain the data (Realsense D415, Intel, USA). Pose is provided in the form of quaternions, however it is possible to convert this format into other available notations.

The pose data can also be combined with the masks in the form of a final json file, in order to obtain a final COCO-format json with object poses as well. In the data provided, each of the test, train and validation subsets have their own COCO-like json files with the poses fused within, although the "orignal_jsons" only provide the image masks.

The files and directories are structured as follows. Note that this example is based on the "train" directory, but a similar structure has been created for the test and val sets:

Train

manual_json - Contains the json created when manually annotating, the images, therefore no pose data included

pose - Contains the CSV file with the poses of the relevant images, explained in the table above

pose_json - Contains the fused json that includes both the annotations and the pose data for each image

surgical2020 - Contains the images in jpg format
P
E-IC Dataset
paperswithcode.com
Updated Oct 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). E-IC Dataset [Dataset]. https://paperswithcode.com/dataset/e-ic
Explore at:
Dataset updated
Oct 11, 2023
Description
This dataset, adapted from COCO Caption, is designed for the Image Caption task and evaluates multimodal model editing in terms of reliability, stability and generality. You can download the dataset from here
Not seeing a result you expected?
Learn how you can add new datasets to our index.

CSV Structure
Column	1	2	3	4	5	6	7	8	9
Value	X (m)	Y (m)	Z (m)	q_i	q_j	q_k	q_l	Class	Image Name

Facebook

Twitter

Click to copy link

Link copied

Cite

Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick (2022). COCO Captions Dataset [Dataset]. https://paperswithcode.com/dataset/coco-captions

COCO Captions Dataset

Explore at:

Dataset updated

Sep 13, 2022

Authors

Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick

Description

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

Clear search

Close search

Google apps

Main menu

COCO Captions Dataset

MS COCO Dataset

COCO Dataset 2017

REC-COCO (Relations in Captions)

Ms COCO sketch-caption-image Dataset

Dataset

Contents

COCO

Synthetically Spoken COCO

UIT-ViIC Dataset

ConQA Dataset

coco-img-caption-pairs

Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...

ESP Dataset Dataset

social-media-captions-20k

VisDial (Visual Dialog)

3dStool

E-IC Dataset

COCO Captions Dataset