16 datasets found
  1. P

    COCO Captions Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Sep 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick (2022). COCO Captions Dataset [Dataset]. https://paperswithcode.com/dataset/coco-captions
    Explore at:
    Dataset updated
    Sep 13, 2022
    Authors
    Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick
    Description

    COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

  2. P

    MS COCO Dataset

    • paperswithcode.com
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár, MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
    Explore at:
    Dataset updated
    Apr 15, 2024
    Authors
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
    Description

    The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

    Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

    Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

    Annotations: The dataset has annotations for

    object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.

  3. g

    COCO Dataset 2017

    • gts.ai
    json
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GTS, COCO Dataset 2017 [Dataset]. https://gts.ai/dataset-download/coco-dataset-2017/
    Explore at:
    jsonAvailable download formats
    Dataset provided by
    GLOBOSE TECHNOLOGY SOLUTIONS PRIVATE LIMITED
    Authors
    GTS
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset.

  4. REC-COCO (Relations in Captions)

    • opendatalab.com
    zip
    Updated Mar 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ikerbasque (2023). REC-COCO (Relations in Captions) [Dataset]. https://opendatalab.com/OpenDataLab/REC-COCO
    Explore at:
    zip(1019467718 bytes)Available download formats
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    巴斯克科學基金會http://www.ikerbasque.net/
    University of the Basque Country
    Donostia International Physics Center
    Description

    Relations in Captions (REC-COCO) is a new dataset that contains associations between caption tokens and bounding boxes in images. REC-COCO is based on the MS-COCO and V-COCO datasets. For each image in V-COCO, we collect their corresponding captions from MS-COCO and automatically align the concept triplet in V-COCO to the tokens in the caption. This requires finding the token for concepts such as PERSON. As a result, REC-COCO contains the captions and the tokens which correspond to each subject and object, as well as the bounding boxes for the subject and object.

  5. Ms COCO sketch-caption-image Dataset

    • kaggle.com
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deucalion_Sash (2024). Ms COCO sketch-caption-image Dataset [Dataset]. https://www.kaggle.com/datasets/deucalionsash/ms-coco-sketch-caption-image-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Deucalion_Sash
    Description

    Dataset

    This dataset was created by Deucalion_Sash

    Contents

  6. n

    COCO

    • scidm.nchc.org.tw
    Updated Oct 10, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). COCO [Dataset]. https://scidm.nchc.org.tw/dataset/coco
    Explore at:
    Dataset updated
    Oct 10, 2020
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COCO is a large-scale object detection, segmentation, and captioning dataset. http://cocodataset.org COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image * 250,000 people with keypoints

  7. Synthetically Spoken COCO

    • zenodo.org
    application/gzip, bin +2
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi (2020). Synthetically Spoken COCO [Dataset]. http://doi.org/10.5281/zenodo.400926
    Explore at:
    txt, json, bin, application/gzipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi; Grzegorz Chrupała; Lieke Gelderloos; Afra Alishahi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Synthetically Spoken COCO

    Version 1.0

    This dataset contain synthetically generated spoken versions of MS COCO [1] captions. This
    dataset was created as part the research reported in [5].
    The speech was generated using gTTS [2]. The dataset consists of the following files:

    - dataset.json: Captions associated with MS COCO images. This information comes from [3].
    - sentid.txt: List of caption IDs. This file can be used to locate MFCC features of the MP3 files
    in the numpy array stored in dataset.mfcc.npy.
    - mp3.tgz: MP3 files with the audio. Each file name corresponds to caption ID in dataset.json
    and in sentid.txt.
    - dataset.mfcc.npy: Numpy array with the Mel Frequence Cepstral Coefficients extracted from
    the audio. Each row corresponds to a caption. The order or the captions corresponds to the
    ordering in the file sentid.txt. MFCCs were extracted using [4].

    [1] http://mscoco.org/dataset/#overview
    [2] https://pypi.python.org/pypi/gTTS
    [3] https://github.com/karpathy/neuraltalk
    [4] https://github.com/jameslyons/python_speech_features
    [5] https://arxiv.org/abs/1702.01991

  8. P

    UIT-ViIC Dataset

    • paperswithcode.com
    • huggingface.co
    Updated Feb 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Quan Hoang Lam; Quang Duy Le; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen (2021). UIT-ViIC Dataset [Dataset]. https://paperswithcode.com/dataset/uit-viic
    Explore at:
    Dataset updated
    Feb 24, 2021
    Authors
    Quan Hoang Lam; Quang Duy Le; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen
    Description

    UIT-ViIC contains manually written captions for images from Microsoft COCO dataset relating to sports played with ball. UIT-ViIC consists of 19,250 Vietnamese captions for 3,850 images.

  9. P

    ConQA Dataset

    • paperswithcode.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ConQA Dataset [Dataset]. https://paperswithcode.com/dataset/conqa
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Juan Manuel Rodriguez; Nima Tavassoli; Eliezer Levy; Gil Lederman; Dima Sivov; Matteo Lissandrini; Davide Mottin
    Description

    ConQA is a dataset created using the intersection between VisualGenome and MS-COCO. The goal of this dataset is to provide a new benchmark for text to image retrieval using short and less descriptive queries than the commonly use captions from MS-COCO or Flicker. ConQA consists of 80 queries divided into 50 conceptual and 30 descriptive queries. A descriptive query mentions some of the objects in the image, for instance, people chopping vegetables. While, a conceptual query does not mention objects or only refers to objects in a general context, e.g., working class life.

    Dataset generation For the dataset generation, we followed a 3 step workflow: filtering images, generating queries and seeding relevant, and crowd-sourcing extended annotations.

    Filtering images The first step is focused on filtering images that have meaningful scene graphs and captions. To filter the images, we used the following procedure:

    The image should have one or more captions. Hence, we discarded the YFCC images with no caption, obtaining images from the MS-COCO subset of Visual Genome. The image should describe a complex scene with multiple objects. We filtered all the scene graphs that did not contain any edges. images pass this filter. The relationships should be verbs and not contain nouns or pronouns. To detect this, we generated sentences for each edge as a concatenation of the words on the labels of the nodes and the relationship and applied Part of Speech tagging. We performed the POS Tagging using the model en_core_web_sm provided by SpaCy. We filter all scene graphs that contain an edge not tagged as a verb or that the tag is not in an ad-hoc list of allowed non-verb keywords. The allowed keywords are top, front, end, side, edge, middle, rear, part, bottom, under, next, left, right, center, background, back, and parallel. We allowed these keywords as they represent positional relationships between objects. After filtering, we obtain images.

    Generating Queries To generate ConQA, the dataset authors worked in three pairs and acted as annotators to manually design the queries, namely 50 conceptual and 30 descriptive queries. After that, we proceeded to use the model "ViT-B/32" from CLIP to find relevant images. For conceptual queries, it was challenging to find relevant images directly, so alternative proxy queries were used to identify an initial set of relevant images. These images are the seed for finding other relevant images that were annotated through Amazon Mechanical Turk.

    Annotation crowdsourcing Having the initial relevant set defined by the dataset authors, we expanded the relevant candidates by looking into the top-100 visually closest images according to a pre-trained ResNet152 model for each query. As a result, we increase the number of potentially relevant images to analyze without adding human bias to the task.

    After selecting the images to annotate, we set up a set of Human Intelligence Tasks (HITs) on Amazon Mechanical Turk. Each task consisted of a query and 5 potentially relevant images. Then, the workers were instructed to determine whether each image is relevant for the given query. If they were not sure, they could alternatively mark the image as “Unsure”. To reduce presentation bias, we randomize the order of images and the options. Additionally, we include validation tasks with control images to ensure a minimum quality in the annotation process, so workers failing 70% or more of validation queries were excluded.

  10. h

    coco-img-caption-pairs

    • huggingface.co
    Updated Sep 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    coco-img-caption-pairs [Dataset]. https://huggingface.co/datasets/yhshin1020/coco-img-caption-pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 26, 2024
    Authors
    YH
    Description

    yhshin1020/coco-img-caption-pairs dataset hosted on Hugging Face and contributed by the HF Datasets community

  11. z

    Data from: HL Dataset: Visually-grounded Description of Scenes, Actions and...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michele Cafagna; Michele Cafagna; Kees van Deemter; Kees van Deemter; Albert Gatt; Albert Gatt (2024). HL Dataset: Visually-grounded Description of Scenes, Actions and Rationales [Dataset]. http://doi.org/10.5281/zenodo.10723071
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Association for Computational Linguistics
    Authors
    Michele Cafagna; Michele Cafagna; Kees van Deemter; Kees van Deemter; Albert Gatt; Albert Gatt
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Current captioning datasets focus on object-centric captions, describing the visible objects in the image, often ending up stating the obvious (for humans), e.g. "people eating food in a park". Although these datasets are useful to evaluate the ability of Vision & Language models to recognize and describe visual content, they do not support controlled experiments involving model testing or fine-tuning, with more high-level captions, which humans find easy and natural to produce. For example, people often describe images based on the type of scene they depict ("people at a holiday resort") and the actions they perform ("people having a picnic"). Such concepts are based on personal experience and contribute to forming common sense assumptions. We present the High-Level Dataset, a dataset extending 14997 images from the COCO dataset, aligned with a new set of 134,973 human-annotated (high-level) captions collected along three axes: scenes, actions and rationales. We further extend this dataset with confidence scores collected from an independent set of readers, as well as a set of narrative captions generated synthetically, by combining each of the three axes. We describe this dataset and analyse it extensively. We also present baseline results for the High-Level Captioning task.

  12. P

    ESP Dataset Dataset

    • paperswithcode.com
    Updated Mar 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). ESP Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/esp-dataset
    Explore at:
    Dataset updated
    Mar 24, 2023
    Description

    ESP dataset (Evaluation for Styled Prompt dataset) is a new benchmark for zero-shot domain-conditional caption generation. The dataset aims to evaluate the capability to generate diverse domain-specific language conditioned on the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collected five text domains with everyday usage: blog, social media, instruction, story, and news using Amazon MTurk.

  13. h

    social-media-captions-20k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Johannes, social-media-captions-20k [Dataset]. https://huggingface.co/datasets/Waterfront/social-media-captions-20k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Johannes
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Social Media Captions

    Based on the Instagram Influencer Dataset from Seungbae Kim, Jyun-Yu Jiang, and Wei Wang Extended with photo descriptions of ydshieh/vit-gpt2-coco-en model to create a dataset which can be used to finetune Llama-2.

    60k complete dataset: Waterfront/social-media-captions 10k smaller subset: Waterfront/social-media-captions-10k

  14. O

    VisDial (Visual Dialog)

    • opendatalab.com
    zip
    Updated Jan 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of California, Berkeley (2024). VisDial (Visual Dialog) [Dataset]. https://opendatalab.com/OpenDataLab/VisDial
    Explore at:
    zip(1100899678 bytes)Available download formats
    Dataset updated
    Jan 7, 2024
    Dataset provided by
    Georgia Institute of Technology
    Carnegie Mellon University
    Virginia Polytechnic Institute and State University
    University of California, Berkeley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max. VisDial v1.0 contains 123K dialogues on MS COCO (2017 training set) for training split, 2K dialogues with validation images for validation split and 8K dialogues on test set for test-standard set. The previously released v0.5 and v0.9 versions of VisDial dataset (corresponding to older splits of MS COCO) are considered deprecated.

  15. 3dStool

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Feb 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spyridon Souipas; Spyridon Souipas (2023). 3dStool [Dataset]. http://doi.org/10.5281/zenodo.7635563
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 14, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Spyridon Souipas; Spyridon Souipas
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The 3D surgical tool dataset (3dStool) has been constructed with the aim of assisting the development of computer vision techniques that address the operating room. Note, functions for visualisation, processing, and splitting the dataset can be found in the relevant github repository.

    Specifically, even though laparoscopic scenes have received a lot of attention in terms of labelled images, surgical tools that are used at initial stages of an operation, such as scalpels and scissors, have not had any such datasets developed.

    3dStool includes 5370 images, accompanied by manually drawn polygon labels, as well as information on the 3D pose of these tools in operation. The tools were recorded while operating on a cadaveric knee. A RealSense D415 was used for image collection, while an optical tracker was employed for the purpose of 3D pose recording. Four surgical tools have been collected for now:

    1. Scalpel
    2. Scissors
    3. Forceps
    4. Electric Burr

    An annotation json file (in the format of COCO) exists for the images, containing the masks, boxes, and other relevant information. Furthermore, pose information is provided in two different manners.

    Firstly, a csv in the following format:

    CSV Structure
    Column123456789
    ValueX (m)Y (m)Z (m)qiqjqkqlClassImage Name

    Position and orientation are both provided in the coordinate axes of the camera used to obtain the data (Realsense D415, Intel, USA). Pose is provided in the form of quaternions, however it is possible to convert this format into other available notations.

    The pose data can also be combined with the masks in the form of a final json file, in order to obtain a final COCO-format json with object poses as well. In the data provided, each of the test, train and validation subsets have their own COCO-like json files with the poses fused within, although the "orignal_jsons" only provide the image masks.

    The files and directories are structured as follows. Note that this example is based on the "train" directory, but a similar structure has been created for the test and val sets:

    • Train
      • manual_json - Contains the json created when manually annotating, the images, therefore no pose data included
      • pose - Contains the CSV file with the poses of the relevant images, explained in the table above
      • pose_json - Contains the fused json that includes both the annotations and the pose data for each image
      • surgical2020 - Contains the images in jpg format

  16. P

    E-IC Dataset

    • paperswithcode.com
    Updated Oct 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). E-IC Dataset [Dataset]. https://paperswithcode.com/dataset/e-ic
    Explore at:
    Dataset updated
    Oct 11, 2023
    Description

    This dataset, adapted from COCO Caption, is designed for the Image Caption task and evaluates multimodal model editing in terms of reliability, stability and generality. You can download the dataset from here

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick (2022). COCO Captions Dataset [Dataset]. https://paperswithcode.com/dataset/coco-captions

COCO Captions Dataset

Explore at:
Dataset updated
Sep 13, 2022
Authors
Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick
Description

COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.

Search
Clear search
Close search
Google apps
Main menu