65 datasets found

P
COCO-Text Dataset
paperswithcode.com
Updated Feb 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie (2021). COCO-Text Dataset [Dataset]. https://paperswithcode.com/dataset/coco-text
Explore at:
Dataset updated
Feb 2, 2021
Authors
Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie
Description
The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.
P
COCO (Common Objects in Context) Dataset
paperswithcode.com
Updated Dec 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). COCO (Common Objects in Context) Dataset [Dataset]. https://paperswithcode.com/dataset/coco
Explore at:
Dataset updated
Dec 10, 2023
Description
The COCO (Common Objects in Context) dataset is a large-scale object detection, segmentation, and captioning dataset. It is designed to encourage research on a wide variety of object categories and is commonly used for benchmarking computer vision models. It is an essential dataset for researchers and developers working on object detection, segmentation, and pose estimation tasks.
O
COCO 2017
opendatalab.com
huggingface.co
zip
Updated Sep 30, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Microsoft (2017). COCO 2017 [Dataset]. https://opendatalab.com/OpenDataLab/COCO_2017
Explore at:
zip(49105147630 bytes)Available download formats
Dataset updated
Sep 30, 2017
Dataset provided by
Microsoft
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints
COCO image-text pair
kaggle.com
Updated Jul 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SHARATH KRISHNA A H 231 (2023). COCO image-text pair [Dataset]. https://www.kaggle.com/datasets/sharathkrishnaah231/coco-image-text-pair/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 26, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
SHARATH KRISHNA A H 231
Description
Dataset

This dataset was created by SHARATH KRISHNA A H 231

Contents
t
Spoken-COCO - Dataset - LDM
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Spoken-COCO - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/spoken-coco
Explore at:
Dataset updated
Dec 2, 2024
Description
Spoken-COCO is a large-scale dataset of audio and text pairs.
h
COCO
huggingface.co
datasets.activeloop.ai
Updated Feb 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HuggingFaceM4 (2023). COCO [Dataset]. https://huggingface.co/datasets/HuggingFaceM4/COCO
Explore at:
Dataset updated
Feb 6, 2023
Dataset authored and provided by
HuggingFaceM4
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MS COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Object segmentation, Recognition in context, Superpixel stuff segmentation, 330K images (>200K labeled), 1.5 million object instances, 80 object categories, 91 stuff categories, 5 captions per image, 250,000 people with keypoints.
P
SPEECH-COCO Dataset
paperswithcode.com
Updated Sep 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Havard; Laurent Besacier; Olivier Rosec (2021). SPEECH-COCO Dataset [Dataset]. https://paperswithcode.com/dataset/speech-coco
Explore at:
Dataset updated
Sep 28, 2021
Authors
William Havard; Laurent Besacier; Olivier Rosec
Description
SPEECH-COCO contains speech captions that are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images.
COCO, LVIS, Open Images V4 classes mapping
zenodo.org
bin, csv, txt
Updated Oct 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Giuseppe Amato; Giuseppe Amato; Paolo Bolettieri; Paolo Bolettieri; Fabio Carrara; Fabio Carrara; Fabrizio Falchi; Fabrizio Falchi; Claudio Gennaro; Claudio Gennaro; Nicola Messina; Nicola Messina; Lucia Vadicamo; Lucia Vadicamo; Claudio Vairo; Claudio Vairo (2022). COCO, LVIS, Open Images V4 classes mapping [Dataset]. http://doi.org/10.5281/zenodo.7194300
Explore at:
csv, txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7194300
Dataset updated
Oct 13, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Giuseppe Amato; Giuseppe Amato; Paolo Bolettieri; Paolo Bolettieri; Fabio Carrara; Fabio Carrara; Fabrizio Falchi; Fabrizio Falchi; Claudio Gennaro; Claudio Gennaro; Nicola Messina; Nicola Messina; Lucia Vadicamo; Lucia Vadicamo; Claudio Vairo; Claudio Vairo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains a mapping between the classes of COCO, LVIS, and Open Images V4 datasets into a unique set of 1460 classes.

COCO [Lin et al 2014] contains 80 classes, LVIS [gupta2019lvis] contains 1460 classes, Open Images V4 [Kuznetsova et al. 2020] contains 601 classes.

We built a mapping of these classes using a semi-automatic procedure in order to have a unique final list of 1460 classes. We also generated a hierarchy for each class, using wordnet

This repository contains the following files:

coco_classes_map.txt, contains the mapping for the 80 coco classes

lvis_classes_map.txt, contains the mapping for the 1460 coco classes

openimages_classes_map.txt, contains the mapping for the 601 coco classes

classname_hyperset_definition.csv, contains the final set of 1460 classes, their definition and hierarchy

all-classnames.xlsx, contains a side-by-side view of all classes considered

This mapping was used in VISIONE [Amato et al. 2021, Amato et al. 2022] that is a content-based retrieval system that supports various search functionalities (text search, object/color-based search, semantic and visual similarity search, temporal search). For the object detection VISIONE uses three pre-trained models: VfNet [Zhang et al. 2021] (trained on COCO dataset), Mask R-CNN [He et al. 2017] (trained on LVIS), and a Faster R-CNN+Inception ResNet (trained on the Open Images V4).

This is repository is released under a Creative Commons Attribution license, please cite the following paper if you use it in your work in any form:

@inproceedings{amato2021visione, title={The visione video search system: exploiting off-the-shelf text search engines for large-scale video retrieval}, author={Amato, Giuseppe and Bolettieri, Paolo and Carrara, Fabio and Debole, Franca and Falchi, Fabrizio and Gennaro, Claudio and Vadicamo, Lucia and Vairo, Claudio}, journal={Journal of Imaging}, volume={7}, number={5}, pages={76}, year={2021}, publisher={Multidisciplinary Digital Publishing Institute} }

References:

[Amato et al. 2022] Amato, G. et al. (2022). VISIONE at Video Browser Showdown 2022. In: , et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_52

[Amato et al. 2021] Amato, G., Bolettieri, P., Carrara, F., Debole, F., Falchi, F., Gennaro, C., Vadicamo, L. and Vairo, C., 2021. The visione video search system: exploiting off-the-shelf text search engines for large-scale video retrieval. Journal of Imaging, 7(5), p.76.

[Gupta et al.2019] Gupta, A., Dollar, P. and Girshick, R., 2019. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5356-5364).

[He et al. 2017] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

[Kuznetsova et al. 2020] Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A. and Duerig, T., 2020. The open images dataset v4. International Journal of Computer Vision, 128(7), pp.1956-1981.

[Lin et al. 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C.L., 2014, September. Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[Zhang et al. 2021] Zhang, H., Wang, Y., Dayoub, F. and Sunderhauf, N., 2021. Varifocalnet: An iou-aware dense object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8514-8523).
h
COCO-Text
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VLM-Perception, COCO-Text [Dataset]. https://huggingface.co/datasets/VLM-Perception/COCO-Text
Explore at:
Dataset authored and provided by
VLM-Perception
Description
VLM-Perception/COCO-Text dataset hosted on Hugging Face and contributed by the HF Datasets community
h
coco2017
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Philipp, coco2017 [Dataset]. https://huggingface.co/datasets/phiyodr/coco2017
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Philipp
Description
coco2017

Image-text pairs from MS COCO2017.

Data origin

Data originates from cocodataset.org While coco-karpathy uses a dense format (with several sentences and sendids per row), coco-karpathy-long uses a long format with one sentence (aka caption) and sendid per row. coco-karpathy-long uses the first five sentences and therefore is five times as long as coco-karpathy. phiyodr/coco2017: One row corresponds one image with several sentences. phiyodr/coco2017-long: One row… See the full description on the dataset page: https://huggingface.co/datasets/phiyodr/coco2017.
Microsoft COCO (Zhao et al 2017)
kaggle.com
Updated Oct 21, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rachael Tatman (2019). Microsoft COCO (Zhao et al 2017) [Dataset]. https://www.kaggle.com/rtatman/ms-coco/metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rachael Tatman
Description
Context

This dataset contains pickled Python objects with data from the annotations of the Microsoft (MS) COCO dataset. COCO is a large-scale object detection, segmentation, and captioning dataset.

Content

Except for the objs file, which is a plain text file continuing a list of objects, the data in this dataset is all in the pickle format, a way of storing Python objects at binary data files.

Important: These pickles were pickled using Python 2. Since Kernels use Python 3, you will need to specify the encoding when unpickling these files. The Python utility scripts here have been updated to correctly unpickle these files.

# the correct syntax to read these pickled files into Python 3 pickle.load(open('file_path, 'rb'), encoding = "latin1")

Acknowledgements

As a derivative of the original COCO dataset, this dataset is distributed under a CC-BY 4.0 license. These files were distributed as part of the supporting materials for Zhao et al 2017. If you use these files in your work, please cite the following paper:

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., & Chang, K. W. (2017). Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2979-2989).
t
FS-COCO - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). FS-COCO - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/fs-coco
Explore at:
Dataset updated
Dec 16, 2024
Description
FS-COCO: A large-scale scene sketch dataset with fine-grained alignment among sketch, text, and photo.
R
Face Features Test Dataset
universe.roboflow.com
zip
Updated Dec 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Lin (2021). Face Features Test Dataset [Dataset]. https://universe.roboflow.com/peter-lin/face-features-test/dataset/14
Explore at:
zipAvailable download formats
Dataset updated
Dec 6, 2021
Dataset authored and provided by
Peter Lin
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Variables measured
Face Features Bounding Boxes
Description
A simple dataset for benchmarking CreateML object detection models. The images are sampled from COCO dataset with eyes and nose bounding boxes added. It’s not meant to be serious or useful in a real application. The purpose is to look at how long it takes to train CreateML models with varying dataset and batch sizes.

Training performance is affected by model configuration, dataset size and batch configuration. Larger models and batches require more memory. I used CreateML object detection project to compare the performance.

Hardware

M1 Macbook Air * 8 GPU * 4/4 CPU * 16G memory * 512G SSD

M1 Max Macbook Pro * 24 GPU * 2/8 CPU * 32G memory * 2T SSD

Small Dataset Train: 144 Valid: 16 Test: 8

Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 16 | 11 | 1.5 | |32 | 29 | 17 | 2.8 | |64 | 56 | 30 | 5.4 | |128 | 170 | 57 | 12 |

Larger Dataset Train: 301 Valid: 29 Test: 18

Results |batch | M1 ET | M1Max ET | peak mem G | |--------|:------|:---------|:-----------| |16 | 21 | 10 | 1.5 | |32 | 42 | 17 | 3.5 | |64 | 85 | 30 | 8.4 | |128 | 281 | 54 | 16.5 |

CreateML Settings

For all tests, training was set to Full Network. I closed CreateML between each run to make sure memory issues didn't cause a slow down. There is a bug with Monterey as of 11/2021 that leads to memory leak. I kept an eye on the memory usage. If it looked like there was a memory leak, I restarted MacOS.

Observations

In general, more GPU and memory with MBP reduces the training time. Having more memory lets you train with larger datasets. On M1 Macbook Air, the practical limit is 12G before memory pressure impacts performance. On M1 Max MBP, the practical limit is 26G before memory pressure impacts performance. To work around memory pressure, use smaller batch sizes.

On the larger dataset with batch size 128, the M1Max is 5x faster than Macbook Air. Keep in mind a real dataset should have thousands of samples like Coco or Pascal. Ideally, you want a dataset with 100K images for experimentation and millions for the real training. The new M1 Max Macbooks is a cost effective alternative to building a Windows/Linux workstation with RTX 3090 24G. For most of 2021, the price of RTX 3090 with 24G is around $3,000.00. That means an equivalent windows workstation would cost the same as the M1Max Macbook pro I used to run the benchmarks.

Full Network vs Transfer Learning

As of CreateML 3, training with full network doesn't fully utilize the GPU. I don't know why it works that way. You have to select transfer learning to fully use the GPU. The results of transfer learning with the larger dataset. In general, the training time is faster and loss is better.

batch ET min Train Acc Val Acc Test Acc Top IU Train Top IU Valid Top IU Test Peak mem G loss
16 4 75 19 12 78 23 13 1.5 0.41
32 8 75 21 10 78 26 11 2.76 0.02
64 13 75 23 8 78 24 9 5.3 0.017
128 25 75 22 13 78 25 14 8.4 0.012

Github Project

The source code and full results are up on Github https://github.com/woolfel/createmlbench
P
COST Dataset
paperswithcode.com
Updated Dec 27, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jitesh Jain; Jianwei Yang; Humphrey Shi (2023). COST Dataset [Dataset]. https://paperswithcode.com/dataset/cost
Explore at:
Dataset updated
Dec 27, 2023
Authors
Jitesh Jain; Jianwei Yang; Humphrey Shi
Description
Click to add a brief description of the dataset (Markdown and LaTeX enabled).

Provide:

a high-level explanation of the dataset characteristics explain motivations and summary of its content potential use cases of the dataset
E
SPEECH-COCO
live.european-language-grid.eu
audio wav
Updated Dec 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). SPEECH-COCO [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7686
Explore at:
audio wavAvailable download formats
Dataset updated
Dec 10, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction: Our corpus is an extension of the MS COCO image recognition and captioning dataset. MS COCO comprises images paired with a set of five captions. Yet, it does not include any speech. Therefore, we used Voxygen's text-to-speech system to synthesise the available captions. The addition of speech as a new modality enables MSCOCO to be used for researches in the field of language acquisition, unsupervised term discovery, keyword spotting, or semantic embedding using speech and vision. Our corpus is licensed under a Creative Commons Attribution 4.0 License. Data Set: This corpus contains 616,767 spoken captions from MSCOCO's val2014 and train2014 subsets (respectively 414,113 for train2014 and 202,654 for val2014). We used 8 different voices. 4 of them have a British accent (Paul, Bronwen, Judith, and Elizabeth) and the 4 others have an American accent (Phil, Bruce, Amanda, Jenny). In order to make the captions sound more natural, we used SOX tempo command, enabling us to change the speed without changing the pitch. 1/3 of the captions are 10% slower than the original pace, 1/3 are 10% faster. The last third of the captions was kept untouched. We also modified approximately 30% of the original captions and added disfluencies such as "um", "uh", "er" so that the captions would sound more natural. Each WAV file is paired with a JSON file containing various information: timecode of each word in the caption, name of the speaker, name of the WAV file, etc. The JSON files have the following data structure: {"duration": float, "speaker": string, "synthesisedCaption": string, "timecode": list, "speed": float, "wavFilename": string, "captionID": int, "imgID": int, "disfluency": list}. On average, each caption comprises 10.79 tokens, disfluencies included. The WAV files are on average 3.52 seconds long.
COCO14-CC12M
kaggle.com
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reem Junaid (2025). COCO14-CC12M [Dataset]. https://www.kaggle.com/datasets/reemjunaid/coco14-cc12m
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 6, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Reem Junaid
Description
Mixed Image-Caption Dataset (COCO2014 + CC12M)

This dataset contains a collection of 32,000 image-caption pairs sourced from:

COCO 2014: https://cocodataset.org/#home

CC12M (Conceptual Captions 12M): https://github.com/google-research-datasets/conceptual-12m

Each entry is included in the JSON file train_mix_32000.json, with the following fields: - "filename": Image filename (relative to dataset structure) - "caption": Image description - "data": Source dataset ("coco" or "cc12m")

📦 Included

train_mix_32000.json: Metadata file with image paths and captions.

images/: Folder structure containing all 32,000 actual image files referenced in the JSON.

💡 Image paths in the JSON have been adjusted to reflect the folder structure inside this Kaggle dataset.

📄 License

This dataset includes images from:

COCO 2014
Licensed under Creative Commons Attribution 4.0.

CC12M
Provided by Google LLC under a permissive license:

The dataset may be freely used for any purpose, although acknowledgment of Google LLC as the data source would be appreciated.
The dataset is provided "AS IS" without any warranty, express or implied.
View License

🧠 Use Cases

Vision-language pretraining

Knowledge-enhanced captioning

Image-text retrieval tasks

Multi-task learning in vision-language models

🙏 Acknowledgements

COCO Dataset

Google Conceptual Captions (CC12M)

batch	ET min	Train Acc	Val Acc	Test Acc	Top IU Train	Top IU Valid	Top IU Test	Peak mem G	loss
16	4	75	19	12	78	23	13	1.5	0.41
32	8	75	21	10	78	26	11	2.76	0.02
64	13	75	23	8	78	24	9	5.3	0.017
128	25	75	22	13	78	25	14	8.4	0.012

ref_coco

tensorflow.org
opendatalab.com

Updated May 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco

Explore at:

Dataset updated

May 31, 2024

Description

A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

dataset	partition	split	refs	images
refcoco	google	train	40000	19213
refcoco	google	val	5000	4559
refcoco	google	test	5000	4527
refcoco	unc	train	42404	16994
refcoco	unc	val	3811	1500
refcoco	unc	testA	1975	750
refcoco	unc	testB	1810	750
refcoco+	unc	train	42278	16992
refcoco+	unc	val	3805	1500
refcoco+	unc	testA	1975	750
refcoco+	unc	testB	1798	750
refcocog	google	train	44822	24698
refcocog	google	val	5000	4650
refcocog	umd	train	42226	21899
refcocog	umd	val	2573	1300
refcocog	umd	test	5023	2600

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

P
COCO Captions Dataset
paperswithcode.com
opendatalab.com
Updated Feb 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick (2021). COCO Captions Dataset [Dataset]. https://paperswithcode.com/dataset/coco-captions
Explore at:
Dataset updated
Feb 2, 2021
Authors
Xinlei Chen; Hao Fang; Tsung-Yi Lin; Ramakrishna Vedantam; Saurabh Gupta; Piotr Dollar; C. Lawrence Zitnick
Description
COCO Captions contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
O
DrawBench
opendatalab.com
zip
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google Research (2022). DrawBench [Dataset]. https://opendatalab.com/OpenDataLab/DrawBench
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2022
Dataset provided by
Google Research
Description
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and imagetext alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE and DALL-E 2, and find that human raters prefer Imagen over other models in side-byside comparisons, both in terms of sample quality and image-text alignment.
F
Semantic Image-Text-Classes
data.uni-hannover.de
jsonl, partaa, partab +48
Updated Jan 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2022). Semantic Image-Text-Classes [Dataset]. https://data.uni-hannover.de/dataset/image-text-classes
Explore at:
partar(1000000000), partbb(1000000000), partae(1000000000), partaq(1000000000), partau(1000000000), partam(1000000000), partbl(1000000000), partbo(1000000000), partab(1000000000), partai(1000000000), partbk(1000000000), partbw(532254720), partbu(1000000000), partbf(1000000000), partbn(1000000000), partas(1000000000), partad(1000000000), partbr(1000000000), partao(1000000000), partbv(1000000000), partaa(1000000000), partav(1000000000), partbe(1000000000), partbq(1000000000), partay(1000000000), jsonl(145621225), partax(1000000000), partap(1000000000), partaj(1000000000), partbd(1000000000), partbs(1000000000), partaz(1000000000), partbp(1000000000), partaw(1000000000), partah(1000000000), partbh(1000000000), tar(163174400), partaf(1000000000), partan(1000000000), partbi(1000000000), partbt(1000000000), partba(1000000000), partbm(1000000000), partbc(1000000000), partbj(1000000000), partat(1000000000), jsonl(1161897), partbg(1000000000), partal(1000000000), partac(1000000000), partag(1000000000), partak(1000000000)Available download formats
Dataset updated
Jan 20, 2022
Dataset authored and provided by
TIB
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Description
This dataset is introduced by the paper "Understanding, Categorizing and Predicting Semantic Image-Text Relations".

If you are using this dataset it in your work, please cite:

@inproceedings{otto2019understanding, title={Understanding, Categorizing and Predicting Semantic Image-Text Relations}, author={Otto, Christian and Springstein, Matthias and Anand, Avishek and Ewerth, Ralph}, booktitle={In Proceedings of ACM International Conference on Multimedia Retrieval (ICMR 2019)}, year={2019} }

To create the full tar use the following command in the command line:

cat train.tar.part* > train_concat.tar

Then simply untar it via

tar -xf train_concat.tar

The jsonl files contain metadata of the following format:

id, origin, CMI, SC, STAT, ITClass, text, tagged text, image_path

License Information:

This dataset is composed of various open access sources as described in the paper. We thank all the original authors for their work.

Pitt Image Ads Dataset: http://people.cs.pitt.edu/~kovashka/ads/

Image-Net challenge: http://image-net.org/

Visual Storytelling Dataset (VIST): http://visionandlanguage.net/VIST/

Wikipedia: https://www.wikipedia.org/

Microsoft COCO: http://cocodataset.org/#home

Facebook

Twitter

Click to copy link

Link copied

Cite

Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie (2021). COCO-Text Dataset [Dataset]. https://paperswithcode.com/dataset/coco-text

COCO-Text Dataset

Explore at:

Dataset updated

Feb 2, 2021

Authors

Andreas Veit; Tomas Matera; Lukas Neumann; Jiri Matas; Serge Belongie

Description

The COCO-Text dataset is a dataset for text detection and recognition. It is based on the MS COCO dataset, which contains images of complex everyday scenes. The COCO-Text dataset contains non-text images, legible text images and illegible text images. In total there are 22184 training images and 7026 validation images with at least one instance of legible text.

Clear search

Close search

Google apps

Main menu

COCO-Text Dataset

COCO (Common Objects in Context) Dataset

COCO 2017

COCO image-text pair

Dataset

Contents

Spoken-COCO - Dataset - LDM

COCO

SPEECH-COCO Dataset

COCO, LVIS, Open Images V4 classes mapping

COCO-Text

coco2017

Microsoft COCO (Zhao et al 2017)

Context

Content

Acknowledgements

FS-COCO - Dataset - LDM

Face Features Test Dataset

COST Dataset

SPEECH-COCO

COCO14-CC12M

Mixed Image-Caption Dataset (COCO2014 + CC12M)

📦 Included

📄 License

🧠 Use Cases

🙏 Acknowledgements

ref_coco

COCO Captions Dataset

DrawBench

Semantic Image-Text-Classes

COCO-Text Dataset