Facebook
TwitterLarge-scale Multi-modality Models Evaluation Suite
Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval
🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets
This Dataset
This is a formatted version of RefCOCO. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @inproceedings{kazemzadeh-etal-2014-referitgame, title = "{R}efer{I}t{G}ame: Referring to Objects in… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/RefCOCO.
Facebook
TwitterA collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.
RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.
Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".
Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):
| dataset | partition | split | refs | images |
|---|---|---|---|---|
| refcoco | train | 40000 | 19213 | |
| refcoco | val | 5000 | 4559 | |
| refcoco | test | 5000 | 4527 | |
| refcoco | unc | train | 42404 | 16994 |
| refcoco | unc | val | 3811 | 1500 |
| refcoco | unc | testA | 1975 | 750 |
| refcoco | unc | testB | 1810 | 750 |
| refcoco+ | unc | train | 42278 | 16992 |
| refcoco+ | unc | val | 3805 | 1500 |
| refcoco+ | unc | testA | 1975 | 750 |
| refcoco+ | unc | testB | 1798 | 750 |
| refcocog | train | 44822 | 24698 | |
| refcocog | val | 5000 | 4650 | |
| refcocog | umd | train | 42226 | 21899 |
| refcocog | umd | val | 2573 | 1300 |
| refcocog | umd | test | 5023 | 2600 |
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RefCOCO-M: Refined Referring Expression Segmentation
RefCOCO has long been a standard benchmark for referring expression segmentation, but it has two major issues: poor mask quality and harmful referring expressions. Modern models now produce masks that are more accurate than the ground-truth annotations, which makes RefCOCO an imprecise measure of segmentation quality. RefCOCO-M is a cleaned version of the RefCOCO (UNC) validation split. We replace the original instance masks with… See the full description on the dataset page: https://huggingface.co/datasets/moondream/refcoco-m.
Facebook
TwitterThe dataset used in the paper is a benchmark for referring expression grounding, containing 142,210 referring expressions for 50,000 referents in 19,994 images.
Facebook
TwitterKangheng/refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Card for RefCOCO-M
This is a FiftyOne dataset with 1190 samples.
Installation
If you haven't already, install FiftyOne: pip install -U fiftyone
Usage
import fiftyone as fo from fiftyone.utils.huggingface import load_from_hub
dataset = load_from_hub("Voxel51/RefCOCO-M")
session = fo.launch_app(dataset)
Dataset Details
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/Voxel51/RefCOCO-M.
Facebook
TwitterThe authors used the RefCOCO dataset, a large-scale dataset for object detection and scene understanding, to train and evaluate their models.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs
[🔗 Released Code] [🤗 Datasets] [🤗 Checkpoints] [📄 Tech Report] [🤗 Paper]
Figure A. PaDT pipeline.
🌟 Introduction
We are pleased to introduce Patch-as-Decodable Token (PaDT), a unified paradigm that enables multimodal large language models (MLLMs) to directly generate both textual and visual outputs.At the core of PaDT are Visual Reference Tokens (VRTs). Unlike conventional MLLMs that represent… See the full description on the dataset page: https://huggingface.co/datasets/PaDT-MLLM/RefCOCO.
Facebook
TwitterVisual Grounding is a task that aims to locate a target object according to a natural language expression. The dataset used in this paper is RefCOCO, RefCOCO+, and RefCOCOg.
Facebook
TwitterVDebugger/refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterRefCOCO, RefCOCO+, Flickr30k
Facebook
Twitterlhoestq/refcoco-m-metadata dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
VoyageWang/refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model’s ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for RefCOCO Triplets
This dataset contains annotations derived from using ChatGPT to decompose the referring expressions (captions) of the RefCOCO/+/g dataset into triples (subject, predicate, object).
Dataset Details
Dataset Description
Curated by: Zeyu Han Language(s) (NLP): English License: cc-by-4.0
Dataset Sources
Repository: https://github.com/Show-han/Zeroshot_REC Paper: Zero-shot Referring Expression Comprehension via… See the full description on the dataset page: https://huggingface.co/datasets/CresCat01/RefCOCO-Triplets.
Facebook
TwitterVision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-modal knowledge while systematically mitigating the impact of extraneous information. TBKIN employs unified scene graph structures and advanced masking strategies to strengthen semantic alignments and introduces a fine-tuning strategy based on threshold selection to eliminate noise. Comprehensive experimental evaluations demonstrate the efficacy of TBKIN, with our best model achieving state-of-the-art accuracy of 73.90% on the VQA 2.0 dataset and 84.60% on the RefCOCO dataset. Attention visualization and detailed result analysis further validate the robustness of TBKIN in tackling vision-language tasks. The model’s ability to reduce interference while enhancing semantic alignments underscores its potential for advancing multi-modal learning. Extensive experiments across four widely-used benchmark datasets confirm its superior performance on two typical vision-language tasks, offering a practical and effective solution for real-world applications.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Shengcao1006/RAS-refcoco dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparisons with state-of-the-art models on VQA and REC.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by kenji0901
Released under MIT
Facebook
Twitterrefcoco with COCO 2017 Image Paths
This dataset is a version of the original refcoco dataset that uses COCO 2017 image paths instead of COCO 2014.
Changes from Original
Image paths updated from COCO 2014 format to COCO 2017 format Images loaded from COCO 2017 directory structure All other annotations remain unchanged
Usage
from datasets import load_dataset
dataset = load_dataset("jhkwak-bp/refcoco-coco2017")
Citation
Please cite the original… See the full description on the dataset page: https://huggingface.co/datasets/jhkwak-bp/refcoco-coco2017.
Facebook
TwitterLarge-scale Multi-modality Models Evaluation Suite
Accelerating the development of large-scale multi-modality models (LMMs) with lmms-eval
🏠 Homepage | 📚 Documentation | 🤗 Huggingface Datasets
This Dataset
This is a formatted version of RefCOCO. It is used in our lmms-eval pipeline to allow for one-click evaluations of large multi-modality models. @inproceedings{kazemzadeh-etal-2014-referitgame, title = "{R}efer{I}t{G}ame: Referring to Objects in… See the full description on the dataset page: https://huggingface.co/datasets/lmms-lab/RefCOCO.