31 datasets found

h
RefCOCO
huggingface.co
Updated Jun 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LMMs-Lab (2024). RefCOCO [Dataset]. https://huggingface.co/datasets/lmms-lab/RefCOCO
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 17, 2024
Dataset authored and provided by
LMMs-Lab
Description
Large-scale Multi-modality Models Evaluation Suite

Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval

🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets

This Dataset

This is a formatted version of RefCOCO. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @inproceedings{kazemzadeh-etal-2014-referitgame, title = "{R}efer{I}t{G}ame: Referring to Objects in… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/RefCOCO.

ref_coco

tensorflow.org
opendatalab.com

Updated May 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco

Explore at:

Dataset updated

May 31, 2024

Description

A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

dataset	partition	split	refs	images
refcoco	google	train	40000	19213
refcoco	google	val	5000	4559
refcoco	google	test	5000	4527
refcoco	unc	train	42404	16994
refcoco	unc	val	3811	1500
refcoco	unc	testA	1975	750
refcoco	unc	testB	1810	750
refcoco+	unc	train	42278	16992
refcoco+	unc	val	3805	1500
refcoco+	unc	testA	1975	750
refcoco+	unc	testB	1798	750
refcocog	google	train	44822	24698
refcocog	google	val	5000	4650
refcocog	umd	train	42226	21899
refcocog	umd	val	2573	1300
refcocog	umd	test	5023	2600

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

h
refcoco-m
huggingface.co
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
moondream (2025). refcoco-m [Dataset]. https://huggingface.co/datasets/moondream/refcoco-m
Explore at:
Dataset updated
Nov 18, 2025
Dataset authored and provided by
moondream
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
RefCOCO-M: Refined Referring Expression Segmentation

RefCOCO has long been a standard benchmark for referring expression segmentation, but it has two major issues: poor mask quality and harmful referring expressions. Modern models now produce masks that are more accurate than the ground-truth annotations, which makes RefCOCO an imprecise measure of segmentation quality. RefCOCO-M is a cleaned version of the RefCOCO (UNC) validation split. We replace the original instance masks with… See the full description on the dataset page: https://huggingface.co/datasets/moondream/refcoco-m.
t
RefCOCO - Dataset - LDM
service.tib.eu
resodate.org
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). RefCOCO - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/refcoco
Explore at:
Dataset updated
Dec 2, 2024
Description
The dataset used in the paper is a benchmark for referring expression grounding, containing 142,210 referring expressions for 50,000 referents in 19,994 images.
h
refcoco
huggingface.co
Updated Aug 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Linkangheng (2024). refcoco [Dataset]. https://huggingface.co/datasets/Kangheng/refcoco
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 2, 2024
Authors
Linkangheng
Description
Kangheng/refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
h
RefCOCO-M
huggingface.co
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Voxel51 (2025). RefCOCO-M [Dataset]. https://huggingface.co/datasets/Voxel51/RefCOCO-M
Explore at:
Dataset updated
Nov 18, 2025
Dataset authored and provided by
Voxel51
Description
Dataset Card for RefCOCO-M

This is a FiftyOne dataset with 1190 samples.

Installation

If you haven't already, install FiftyOne: pip install -U fiftyone

Usage

import fiftyone as fo from fiftyone.utils.huggingface import load_from_hub

Load the dataset

Note: other available arguments include 'max_samples', etc

dataset = load_from_hub("Voxel51/RefCOCO-M")

Launch the App

session = fo.launch_app(dataset)

Dataset Details Dataset… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/RefCOCO-M.
t
RefCOCO dataset - Dataset - LDM
service.tib.eu
resodate.org
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). RefCOCO dataset - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/refcoco-dataset
Explore at:
Dataset updated
Dec 2, 2024
Description
The authors used the RefCOCO dataset, a large-scale dataset for object detection and scene understanding, to train and evaluate their models.
h
RefCOCO
huggingface.co
Updated Oct 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PaDT (2025). RefCOCO [Dataset]. https://huggingface.co/datasets/PaDT-MLLM/RefCOCO
Explore at:
Dataset updated
Oct 9, 2025
Authors
PaDT
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

[🔗 Released Code] [🤗 Datasets] [🤗 Checkpoints] [📄 Tech Report] [🤗 Paper]

Figure A. PaDT pipeline.

🌟 Introduction

We are pleased to introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables multimodal large language models (MLLMs) to directly generate both textual and visual outputs.At the core of PaDT are Visual Reference Tokens (VRTs). Unlike conventional MLLMs that represent… See the full description on the dataset page: https://huggingface.co/datasets/PaDT-MLLM/RefCOCO.
r
RefCOCO, RefCOCO+, and RefCOCOg
resodate.org
service.tib.eu
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yucheng Suo; Linchao Zhu; Yi Yang (2024). RefCOCO, RefCOCO+, and RefCOCOg [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvcmVmY29jby0tcmVmY29jby0tLWFuZC1yZWZjb2NvZw==
Explore at:
Dataset updated
Dec 2, 2024
Dataset provided by
Leibniz Data Manager
Authors
Yucheng Suo; Linchao Zhu; Yi Yang
Description
Visual Grounding is a task that aims to locate a target object according to a natural language expression. The dataset used in this paper is RefCOCO, RefCOCO+, and RefCOCOg.
h
refcoco
huggingface.co
Updated Aug 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VDebugger (2024). refcoco [Dataset]. https://huggingface.co/datasets/VDebugger/refcoco
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2024
Dataset authored and provided by
VDebugger
Description
VDebugger/refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
r
RefCOCO, RefCOCO+, Flickr30k
resodate.org
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ayush Jain; Nikolaos Gkanatsios; Ishita Mediratta; Katerina Fragkiadaki (2024). RefCOCO, RefCOCO+, Flickr30k [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9zZXJ2aWNlLnRpYi5ldS9sZG1zZXJ2aWNlL2RhdGFzZXQvcmVmY29jby0tcmVmY29jby0tLWZsaWNrcjMwaw==
Explore at:
Dataset updated
Dec 3, 2024
Dataset provided by
Leibniz Data Manager
Authors
Ayush Jain; Nikolaos Gkanatsios; Ishita Mediratta; Katerina Fragkiadaki
Description
RefCOCO, RefCOCO+, Flickr30k
h
refcoco-m-metadata
huggingface.co
Updated Nov 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Quentin Lhoest (2025). refcoco-m-metadata [Dataset]. https://huggingface.co/datasets/lhoestq/refcoco-m-metadata
Explore at:
Dataset updated
Nov 20, 2025
Authors
Quentin Lhoest
Description
lhoestq/refcoco-m-metadata dataset hosted on Hugging Face and contributed by the HF Datasets community
h
refcoco
huggingface.co
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Voyage_Wang (2025). refcoco [Dataset]. https://huggingface.co/datasets/VoyageWang/refcoco
Explore at:
Dataset updated
Oct 3, 2025
Authors
Voyage_Wang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
VoyageWang/refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
f
Experimental results of TBKIN on RefCOCOg.
figshare.com
xls
Updated Jun 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihan Guo; Xiang Shen; Chongqing Chen (2025). Experimental results of TBKIN on RefCOCOg. [Dataset]. http://doi.org/10.1371/journal.pone.0325543.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325543.t005
Dataset updated
Jun 10, 2025
Dataset provided by
PLOS ONE
Authors
Zihan Guo; Xiang Shen; Chongqing Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model’s ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.
h
RefCOCO-Triplets
huggingface.co
Updated Aug 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zeyu Han (2024). RefCOCO-Triplets [Dataset]. https://huggingface.co/datasets/CresCat01/RefCOCO-Triplets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 12, 2024
Authors
Zeyu Han
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for RefCOCO Triplets

This dataset contains annotations derived from using ChatGPT to decompose the referring expressions (captions) of the RefCOCO/+/g dataset into triples (subject, predicate, object).

Dataset Details Dataset Description

Curated by: Zeyu Han Language(s) (NLP): English License: cc-by-4.0

Dataset Sources

Repository: https://github.com/Show-han/Zeroshot_REC Paper: Zero-shot Referring Expression Comprehension via… See the full description on the dataset page: https://huggingface.co/datasets/CresCat01/RefCOCO-Triplets.
f
Experimental results of TBKIN on RefCOCO+.
datasetcatalog.nlm.nih.gov
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
Dataset updated
Jun 10, 2025
Authors
Chen, Chongqing; Shen, Xiang; Guo, Zihan
Description
Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model’s ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.
h
RAS-refcoco
huggingface.co
Updated Oct 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shengcao Cao (2025). RAS-refcoco [Dataset]. https://huggingface.co/datasets/Shengcao1006/RAS-refcoco
Explore at:
Dataset updated
Oct 21, 2025
Authors
Shengcao Cao
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Shengcao1006/RAS-refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
Comparisons with state-of-the-art models on VQA and REC.
plos.figshare.com
xls
Updated Jun 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zihan Guo; Xiang Shen; Chongqing Chen (2025). Comparisons with state-of-the-art models on VQA and REC. [Dataset]. http://doi.org/10.1371/journal.pone.0325543.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325543.t001
Dataset updated
Jun 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zihan Guo; Xiang Shen; Chongqing Chen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparisons with state-of-the-art models on VQA and REC.
refcoco_plus
kaggle.com
zip
Updated Mar 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kenji0901 (2024). refcoco_plus [Dataset]. https://www.kaggle.com/kenji0901/refcoco-plus
Explore at:
zip(8915774737 bytes)Available download formats
Dataset updated
Mar 6, 2024
Authors
kenji0901
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by kenji0901

Released under MIT

Contents
h
refcoco-coco2017
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
junhyungkwak, refcoco-coco2017 [Dataset]. https://huggingface.co/datasets/jhkwak-bp/refcoco-coco2017
Explore at:
Authors
junhyungkwak
Description
refcoco with COCO 2017 Image Paths

This dataset is a version of the original refcoco dataset that uses COCO 2017 image paths instead of COCO 2014.

Changes from Original

Image paths updated from COCO 2014 format to COCO 2017 format Images loaded from COCO 2017 directory structure All other annotations remain unchanged

Usage

from datasets import load_dataset

dataset = load_dataset("jhkwak-bp/refcoco-coco2017")

Citation

Please cite the original… See the full description on the dataset page: https://huggingface.co/datasets/jhkwak-bp/refcoco-coco2017.

Facebook

Twitter

Click to copy link

Link copied

Cite

LMMs-Lab (2024). RefCOCO [Dataset]. https://huggingface.co/datasets/lmms-lab/RefCOCO

RefCOCO

lmms-lab/RefCOCO

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 17, 2024

Dataset authored and provided by

LMMs-Lab

Description

Large-scale Multi-modality Models Evaluation Suite

Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval

🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets

  This Dataset

This is a formatted version of RefCOCO. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @inproceedings{kazemzadeh-etal-2014-referitgame, title = "{R}efer{I}t{G}ame: Referring to Objects in… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/RefCOCO.

Clear search

Close search

Google apps

Main menu

RefCOCO

ref_coco

refcoco-m

RefCOCO - Dataset - LDM

refcoco

RefCOCO-M

Load the dataset

Note: other available arguments include 'max_samples', etc

Launch the App

RefCOCO dataset - Dataset - LDM

RefCOCO

RefCOCO, RefCOCO+, and RefCOCOg

refcoco

RefCOCO, RefCOCO+, Flickr30k

refcoco-m-metadata

refcoco

Experimental results of TBKIN on RefCOCOg.

RefCOCO-Triplets

Experimental results of TBKIN on RefCOCO+.

RAS-refcoco

Comparisons with state-of-the-art models on VQA and REC.

refcoco_plus

Dataset

Contents

refcoco-coco2017

RefCOCOSee More Versions

lmms-lab/RefCOCO

RefCOCO