100+ datasets found

h
Textual-Image-Caption-Dataset
huggingface.co
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Sabir (2022). Textual-Image-Caption-Dataset [Dataset]. https://huggingface.co/datasets/AhmedSSabir/Textual-Image-Caption-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2022
Authors
Ahmed Sabir
Description
Update: OCT-2023

Add v2 with recent SoTA model swinV2 classifier for both soft/hard-label visual_caption_cosine_score_v2 with person label (0.2, 0.3 and 0.4)

Introduction

Modern image captaining relies heavily on extracting knowledge, from images such as objects, to capture the concept of static story in the image. In this paper, we propose a textual visual context dataset for captioning, where the publicly available dataset COCO caption (Lin et al., 2014) has been… See the full description on the dataset page: https://huggingface.co/datasets/AhmedSSabir/Textual-Image-Caption-Dataset.
RSICD Image Caption Dataset
kaggle.com
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). RSICD Image Caption Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/rsicd-image-caption-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
RSICD Image Caption Dataset

RSICD Image Caption Dataset

By Arto (From Huggingface) [source]

About this dataset

The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.

Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.

Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.

Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.

Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »

How to use the dataset

Overview of the Dataset

The dataset consists of three primary files: train.csv, test.csv, and valid.csv. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.

Understanding the Files

train.csv: This file contains filenames (filename column) and their corresponding captions (captions column) for training your image captioning model.

test.csv: The test set is included in this file, which contains a similar structure as that of train.csv. The purpose of this file is to evaluate your trained models on unseen data.

valid.csv: This validation set provides images with their respective filenames (filename) and captions (captions). It allows you to fine-tune your models based on performance during evaluation.

Getting Started

To begin utilizing this dataset effectively, follow these steps:

Extract the zip file containing all relevant data files onto your local machine or cloud environment.

Familiarize yourself with each CSV file's structure: train.csv, test.csv, and valid.csv. Understand how information like filename(s) (filename) corresponds with its respective caption(s) (captions).

Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).

Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.

Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.

Split the data into training, validation, and test sets according to your experimental design requirements.

Use appropriate algorithms and techniques to train your image captioning models on the provided data.

Enhancing Model Performance

To optimize model performance using this dataset, consider these tips:

Explore different architectures and pre-trained models specifically designed for image captioning tasks.

Experiment with various natural language

Research Ideas

Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.

Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.

Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
h
my-image-captioning-dataset
huggingface.co
Updated Apr 11, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Park (2023). my-image-captioning-dataset [Dataset]. https://huggingface.co/datasets/SKyu/my-image-captioning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 11, 2023
Authors
Park
Description
Dataset Card for "my-image-captioning-dataset"

More Information needed
j
Structural-Damage Image Captioning Dataset
jstagedata.jst.go.jp
txt
Updated Dec 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OSAMA ABBAS; Ji DANG (2023). Structural-Damage Image Captioning Dataset [Dataset]. http://doi.org/10.60336/data.jsceiiai.24736914.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.60336/data.jsceiiai.24736914.v2
Dataset updated
Dec 15, 2023
Dataset provided by
Japan Society of Civil Engineers
Authors
OSAMA ABBAS; Ji DANG
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset consists of 2000 images used from (PEER Hub Image-Net) and a caption file in text format.

The caption uses a very limited vocabulary set designed to get better results while also describing plenty of information with a length of 18 vocab.
F
Gujarati Image Captioning Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Gujarati Image Captioning Dataset [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/gujarati-image-caption-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Gujarati Language Image Captioning Dataset! A collection of Images with associated text captions to facilitate the development of AI models capable of generating high-quality captions for images. This dataset is meticulously crafted to support research and innovation in computer vision and natural language processing.
Image Data
This dataset features over 5,000 high-resolution images sourced from diverse categories and scenes. Each image is meticulously selected to encompass a wide array of contexts, objects, and environments, ensuring comprehensive coverage for training robust image captioning models.
•
Sources: Images are sourced from public databases and proprietary collections.

•
Clarity and Relevance: Each image is vetted for visual clarity and relevance, ensuring it accurately represents real-world scenarios.

•
Copyright: All selected images are free from copyright restrictions, allowing for unrestricted use in research and development.

•
Format: Images in the dataset are available in various formats like JPEG, PNG, and HEIC.

•
Image Categories: The dataset spans a wide range of image categories to ensure thorough training, fine-tuning, and testing of image captioning models. categories include:

•
Daily Life: Images about household objects, activities, and daily routines.

•
Nature and Environment: Images related to natural scenes, plants, animals, and weather.

•
Technology and Gadgets: Images about electronic devices, tools, and machinery.

•
Human Activities: Images about people, their actions, professions, and interactions.

•
Geography and Landmarks: Images related to specific locations, landmarks, and geographic features.

•
Food and Dining: Images about different foods, meals, and dining settings.

•
Education: Images related to educational settings, materials, and activities.

•
Sports and Recreation: Images about various sports, games, and recreational activities.

•
Transportation: Images about vehicles, travel methods, and transportation infrastructure.

•
Cultural and Historical: Images about cultural artifacts, historical events, and traditions.

Caption Data
Each image in the dataset is paired with a high-quality descriptive caption. These captions are carefully crafted to provide detailed and contextually rich descriptions of the images, enhancing the dataset's utility for training sophisticated image captioning algorithms.
•Caption Details:
•
Human Generated: Each caption is generated by native Gujarati people.

•
Quality Assurance: Captions are meticulously reviewed for linguistic accuracy, coherence, and relevance to the corresponding images.

•
Contextual Relevance: Captions are generated by keeping the visual insights like objects, scenes, actions, and settings depicted in the images.

Metadata
Each image-caption pair is accompanied by comprehensive metadata to facilitate informed decision-making in model development:
•Image File Name
•Category
•Caption
Usage and Applications
The Image Captioning Dataset serves various applications across different domains:
•
Training Image Captioning Models: Provides high-quality data for training and fine-tuning Generative AI models to generate accurate and
Image Data (Object Detection and Captioning)
kaggle.com
Updated Apr 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arunesh (2024). Image Data (Object Detection and Captioning) [Dataset]. https://www.kaggle.com/datasets/aruneshhh/object-detection-images
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Arunesh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🌟 Unlock the potential of advanced computer vision tasks with our comprehensive dataset comprising 15,000 high-quality images. Whether you're delving into segmentation, object detection, or image captioning, our dataset offers a diverse array of visual data to fuel your machine learning models.

🔍 Our dataset is meticulously curated to encompass a wide range of streams, ensuring versatility and applicability across various domains. From natural landscapes to urban environments, from wildlife to everyday objects, our collection captures the richness and diversity of visual content.

📊 Dataset Overview:

Total Images Training Set (70%) Testing Set (30%)
15,000 10,500 4,500

🔢 Image Details:

Format: JPG

Size Range: Approximately 150 to 300 KB per image

Embark on your computer vision journey and leverage our dataset to develop cutting-edge algorithms, advance research, and push the boundaries of what's possible in visual recognition tasks. Join us in shaping the future of AI-powered image analysis.
i
Dataset for Image Captioning System (in Bangla)
ieee-dataport.org
Updated Aug 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toshiba Kamruzzaman (2021). Dataset for Image Captioning System (in Bangla) [Dataset]. https://ieee-dataport.org/documents/dataset-image-captioning-system-bangla
Explore at:
Dataset updated
Aug 2, 2021
Authors
Toshiba Kamruzzaman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
etc.
h
my-image-captioning-dataset
huggingface.co
Updated Oct 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lyllah (2024). my-image-captioning-dataset [Dataset]. https://huggingface.co/datasets/tuy20212521/my-image-captioning-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 31, 2024
Authors
Lyllah
Description
tuy20212521/my-image-captioning-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
F
Danish Image Captioning Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Danish Image Captioning Dataset [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/danish-image-caption-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Danish Language Image Captioning Dataset! A collection of Images with associated text captions to facilitate the development of AI models capable of generating high-quality captions for images. This dataset is meticulously crafted to support research and innovation in computer vision and natural language processing.
Image Data
This dataset features over 5,000 high-resolution images sourced from diverse categories and scenes. Each image is meticulously selected to encompass a wide array of contexts, objects, and environments, ensuring comprehensive coverage for training robust image captioning models.
•
Sources: Images are sourced from public databases and proprietary collections.

•
Clarity and Relevance: Each image is vetted for visual clarity and relevance, ensuring it accurately represents real-world scenarios.

•
Copyright: All selected images are free from copyright restrictions, allowing for unrestricted use in research and development.

•
Format: Images in the dataset are available in various formats like JPEG, PNG, and HEIC.

•
Image Categories: The dataset spans a wide range of image categories to ensure thorough training, fine-tuning, and testing of image captioning models. categories include:

•
Daily Life: Images about household objects, activities, and daily routines.

•
Nature and Environment: Images related to natural scenes, plants, animals, and weather.

•
Technology and Gadgets: Images about electronic devices, tools, and machinery.

•
Human Activities: Images about people, their actions, professions, and interactions.

•
Geography and Landmarks: Images related to specific locations, landmarks, and geographic features.

•
Food and Dining: Images about different foods, meals, and dining settings.

•
Education: Images related to educational settings, materials, and activities.

•
Sports and Recreation: Images about various sports, games, and recreational activities.

•
Transportation: Images about vehicles, travel methods, and transportation infrastructure.

•
Cultural and Historical: Images about cultural artifacts, historical events, and traditions.

Caption Data
Each image in the dataset is paired with a high-quality descriptive caption. These captions are carefully crafted to provide detailed and contextually rich descriptions of the images, enhancing the dataset's utility for training sophisticated image captioning algorithms.
•Caption Details:
•
Human Generated: Each caption is generated by native Danish people.

•
Quality Assurance: Captions are meticulously reviewed for linguistic accuracy, coherence, and relevance to the corresponding images.

•
Contextual Relevance: Captions are generated by keeping the visual insights like objects, scenes, actions, and settings depicted in the images.

Metadata
Each image-caption pair is accompanied by comprehensive metadata to facilitate informed decision-making in model development:
•Image File Name
•Category
•Caption
Usage and Applications
The Image Captioning Dataset serves various applications across different domains:
•
Training Image Captioning Models: Provides high-quality data for training and fine-tuning Generative AI models to generate accurate and
h
GTA-Image-Captioning-Dataset
huggingface.co
Updated Aug 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vipul Maheshwari (2025). GTA-Image-Captioning-Dataset [Dataset]. https://huggingface.co/datasets/vipulmaheshwari/GTA-Image-Captioning-Dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2025
Authors
Vipul Maheshwari
Description
vipulmaheshwari/GTA-Image-Captioning-Dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
F
French Image Captioning Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). French Image Captioning Dataset [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/french-image-caption-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the French Language Image Captioning Dataset! A collection of Images with associated text captions to facilitate the development of AI models capable of generating high-quality captions for images. This dataset is meticulously crafted to support research and innovation in computer vision and natural language processing.
Image Data
This dataset features over 5,000 high-resolution images sourced from diverse categories and scenes. Each image is meticulously selected to encompass a wide array of contexts, objects, and environments, ensuring comprehensive coverage for training robust image captioning models.
•
Sources: Images are sourced from public databases and proprietary collections.

•
Clarity and Relevance: Each image is vetted for visual clarity and relevance, ensuring it accurately represents real-world scenarios.

•
Copyright: All selected images are free from copyright restrictions, allowing for unrestricted use in research and development.

•
Format: Images in the dataset are available in various formats like JPEG, PNG, and HEIC.

•
Image Categories: The dataset spans a wide range of image categories to ensure thorough training, fine-tuning, and testing of image captioning models. categories include:

•
Daily Life: Images about household objects, activities, and daily routines.

•
Nature and Environment: Images related to natural scenes, plants, animals, and weather.

•
Technology and Gadgets: Images about electronic devices, tools, and machinery.

•
Human Activities: Images about people, their actions, professions, and interactions.

•
Geography and Landmarks: Images related to specific locations, landmarks, and geographic features.

•
Food and Dining: Images about different foods, meals, and dining settings.

•
Education: Images related to educational settings, materials, and activities.

•
Sports and Recreation: Images about various sports, games, and recreational activities.

•
Transportation: Images about vehicles, travel methods, and transportation infrastructure.

•
Cultural and Historical: Images about cultural artifacts, historical events, and traditions.

Caption Data
Each image in the dataset is paired with a high-quality descriptive caption. These captions are carefully crafted to provide detailed and contextually rich descriptions of the images, enhancing the dataset's utility for training sophisticated image captioning algorithms.
•Caption Details:
•
Human Generated: Each caption is generated by native French people.

•
Quality Assurance: Captions are meticulously reviewed for linguistic accuracy, coherence, and relevance to the corresponding images.

•
Contextual Relevance: Captions are generated by keeping the visual insights like objects, scenes, actions, and settings depicted in the images.

Metadata
Each image-caption pair is accompanied by comprehensive metadata to facilitate informed decision-making in model development:
•Image File Name
•Category
•Caption
Usage and Applications
The Image Captioning Dataset serves various applications across different domains:
•
Training Image Captioning Models: Provides high-quality data for training and fine-tuning Generative AI models to generate accurate and
Swahili Image Captioning Dataset
zenodo.org
csv
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grevians Onula; Shikali Casper; Omala Andrew; Grevians Onula; Shikali Casper; Omala Andrew (2023). Swahili Image Captioning Dataset [Dataset]. http://doi.org/10.5281/zenodo.8087852
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8087852
Dataset updated
Jun 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Grevians Onula; Shikali Casper; Omala Andrew; Grevians Onula; Shikali Casper; Omala Andrew
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SwaFlickr8k dataset is an extension of the well-known Flickr8k dataset, specifically designed for image captioning tasks. It includes a collection of images and corresponding captions written in Swahili. With 8,091 unique images and 40,455 captions, this dataset provides a valuable resource for research and development in the field of image understanding and language processing, particularly in the context of Swahili language.
Data and codes to reproduce the study, titled:" Understanding remote sensing...
figshare.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous Contributor (2023). Data and codes to reproduce the study, titled:" Understanding remote sensing imagery like reading a text document: What can remote sensing image captioning offer?" [Dataset]. http://doi.org/10.6084/m9.figshare.24665631.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24665631.v2
Dataset updated
Nov 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Anonymous Contributor
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Remote sensing imagery offers intricate and nuanced data, emphasizing the need for a profound understanding of the relationships among varied geographical elements and events. In this study, we explore the transitions from the image domain to the text domain by employing four state-of-the-art image captioning algorithms, i.e., BLIP, mPLUG, OFA, and X-VLM. Specifically, we investigate 1) the stability of these image captioning algorithms for remote sensing image captioning, 2) the preservation of similarity between images and their corresponding captions, and 3) the characteristics of their caption embedding spaces. The results suggest a moderate consistency across generated captions from different image captioning models, with observable variations contingent upon the urban entities presented. In addition, a dynamic relationship emerges between image space and the corresponding caption space, evidenced by their fluctuated correlation coefficient. Most importantly, patterns within the caption embedding space align with the observed land cover and land use in the image patches, reaffirming the potential of our pilot work as an impactful analytical approach in future remote sensing analytics. We advocate that integrating image captioning techniques with remote sensing imagery paves the way for an innovative data extraction and interpretation approach with diverse applications. This dataset contains the data and code to reproduce this study.
z
Image captioning dataset for human activities
zenodo.org
jpeg, txt
Updated Jan 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TU; TU (2021). Image captioning dataset for human activities [Dataset]. http://doi.org/10.5281/zenodo.4453525
Explore at:
jpeg, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4453525
Dataset updated
Jan 21, 2021
Dataset provided by
Zenodo
Authors
TU; TU
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
An image captioning dataset including images of humans performing various activities. The included images include the following activities: walking, running, sleeping, swimming, sitting, jumping, riding, climbing, drinking and reading.
Image Captioning Dataset(without links)
kaggle.com
zip
Updated Sep 7, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nabi Nabiyev (2020). Image Captioning Dataset(without links) [Dataset]. https://www.kaggle.com/nabiyevnabi/image-captioning-datasetwithout-links
Explore at:
zip(33855315 bytes)Available download formats
Dataset updated
Sep 7, 2020
Authors
Nabi Nabiyev
Description
Dataset

This dataset was created by Nabi Nabiyev

Contents

It contains the following files:
R
Riset Image Captioning Dataset
universe.roboflow.com
zip
Updated Jan 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riset Image (2024). Riset Image Captioning Dataset [Dataset]. https://universe.roboflow.com/riset-image/riset-image-captioning/model/4
Explore at:
zipAvailable download formats
Dataset updated
Jan 16, 2024
Dataset authored and provided by
Riset Image
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Trotoar Bounding Boxes
Description
Here are a few use cases for this project:

Urban Planning and Development: This computer vision model could be used by urban planning professionals or architects to understand and analyze sidewalk activity. With its ability to detect various classes such as pedestrians, bicycles, cars, and trees, the software can provide insights into how urban spaces are used and help in designing more efficient and safe environments.

Smart City Applications: The model can be used in Smart City initiatives to automatically analyze and monitor public spaces. For example, monitoring the usage pattern of benches, bikes, or bus stops for intelligent management or detecting any unusual activities on streets, like obstructions due to fallen trees or incorrectly parked vehicles.

Traffic Management and Control: Traffic control systems can use this model to monitor and control traffic flow based on real-time data related to cars, motorbikes, bicycles, buses, and pedestrian movements detected on the zebra crossing and sidewalks.

Accessibility Assessment: NGOs or government agencies focusing on public accessibility and pedestrian safety can use this model to analyze cities' sidewalks. The model can detect elements like benches, trash cans, plant pots, posts, bollards which are essential for assessing sidewalk accessibility, especially for the disabled or elderly citizens.

Augmented Reality (AR) Apps: AR applications developers can use this computer vision model to create more immersive and realistic AR experiences within urban environments. Recognizing real-world objects like trees, people, vehicles, benches, and more could help anchor digital enhancements in physical spaces.
F
Polish Image Captioning Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Polish Image Captioning Dataset [Dataset]. https://www.futurebeeai.com/dataset/multi-modal-dataset/polish-image-caption-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Polish Language Image Captioning Dataset! A collection of Images with associated text captions to facilitate the development of AI models capable of generating high-quality captions for images. This dataset is meticulously crafted to support research and innovation in computer vision and natural language processing.
Image Data
This dataset features over 5,000 high-resolution images sourced from diverse categories and scenes. Each image is meticulously selected to encompass a wide array of contexts, objects, and environments, ensuring comprehensive coverage for training robust image captioning models.
•
Sources: Images are sourced from public databases and proprietary collections.

•
Clarity and Relevance: Each image is vetted for visual clarity and relevance, ensuring it accurately represents real-world scenarios.

•
Copyright: All selected images are free from copyright restrictions, allowing for unrestricted use in research and development.

•
Format: Images in the dataset are available in various formats like JPEG, PNG, and HEIC.

•
Image Categories: The dataset spans a wide range of image categories to ensure thorough training, fine-tuning, and testing of image captioning models. categories include:

•
Daily Life: Images about household objects, activities, and daily routines.

•
Nature and Environment: Images related to natural scenes, plants, animals, and weather.

•
Technology and Gadgets: Images about electronic devices, tools, and machinery.

•
Human Activities: Images about people, their actions, professions, and interactions.

•
Geography and Landmarks: Images related to specific locations, landmarks, and geographic features.

•
Food and Dining: Images about different foods, meals, and dining settings.

•
Education: Images related to educational settings, materials, and activities.

•
Sports and Recreation: Images about various sports, games, and recreational activities.

•
Transportation: Images about vehicles, travel methods, and transportation infrastructure.

•
Cultural and Historical: Images about cultural artifacts, historical events, and traditions.

Caption Data
Each image in the dataset is paired with a high-quality descriptive caption. These captions are carefully crafted to provide detailed and contextually rich descriptions of the images, enhancing the dataset's utility for training sophisticated image captioning algorithms.
•Caption Details:
•
Human Generated: Each caption is generated by native Polish people.

•
Quality Assurance: Captions are meticulously reviewed for linguistic accuracy, coherence, and relevance to the corresponding images.

•
Contextual Relevance: Captions are generated by keeping the visual insights like objects, scenes, actions, and settings depicted in the images.

Metadata
Each image-caption pair is accompanied by comprehensive metadata to facilitate informed decision-making in model development:
•Image File Name
•Category
•Caption
Usage and Applications
The Image Captioning Dataset serves various applications across different domains:
•
Training Image Captioning Models: Provides high-quality data for training and fine-tuning Generative AI models to generate accurate and
t
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for...
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/conceptual-captions--a-cleaned--hypernymed--image-alt-text-dataset-for-automatic-image-captioning
Explore at:
Dataset updated
Dec 3, 2024
Description
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning.
R
Safevision:image Captioning Dataset
universe.roboflow.com
zip
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
VLMBLIP (2025). Safevision:image Captioning Dataset [Dataset]. https://universe.roboflow.com/vlmblip/safevision-image-captioning/dataset/1
Explore at:
zipAvailable download formats
Dataset updated
May 8, 2025
Dataset authored and provided by
VLMBLIP
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables measured
Safety Descriptions
Description
SafeVision:IMAGE CAPTIONING

## Overview SafeVision:IMAGE CAPTIONING is a dataset for vision language (multimodal) tasks - it contains Safety annotations for 314 images. ## Getting Started You can download this dataset for use within your own projects, or fork it into a workspace on Roboflow to create your own model. ## License This dataset is available under the [CC BY 4.0 license](https://creativecommons.org/licenses/CC BY 4.0).
a
Flickr30k
datasets.activeloop.ai
opendatalab.com
+1more
deeplake
Updated Mar 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaie. (2022). Flickr30k [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/flickr30k-dataset/
Explore at:
deeplakeAvailable download formats
Dataset updated
Mar 30, 2022
Authors
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaie.
License
Attribution 2.0 (CC BY 2.0)https://creativecommons.org/licenses/by/2.0/
License information was derived automatically
Description
A dataset of 30,000 images with 5 captions per image. The dataset was created by researchers at Stanford University and is used for research in machine learning and natural language processing tasks such as image captioning and visual question answering.

Total Images	Training Set (70%)	Testing Set (30%)
15,000	10,500	4,500

Facebook

Twitter

Click to copy link

Link copied

Cite

Ahmed Sabir (2022). Textual-Image-Caption-Dataset [Dataset]. https://huggingface.co/datasets/AhmedSSabir/Textual-Image-Caption-Dataset

Textual-Image-Caption-Dataset

AhmedSSabir/Textual-Image-Caption-Dataset

image captioning language grounding visual semantic

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 21, 2022

Authors

Ahmed Sabir

Description

Update: OCT-2023

Add v2 with recent SoTA model swinV2 classifier for both soft/hard-label visual_caption_cosine_score_v2 with person label (0.2, 0.3 and 0.4)

  Introduction

Modern image captaining relies heavily on extracting knowledge, from images such as objects, to capture the concept of static story in the image. In this paper, we propose a textual visual context dataset for captioning, where the publicly available dataset COCO caption (Lin et al., 2014) has been… See the full description on the dataset page: https://huggingface.co/datasets/AhmedSSabir/Textual-Image-Caption-Dataset.

Clear search

Close search

Google apps

Main menu

Textual-Image-Caption-Dataset

RSICD Image Caption Dataset

RSICD Image Caption Dataset

RSICD Image Caption Dataset

About this dataset

How to use the dataset

Overview of the Dataset

Understanding the Files

Getting Started

Enhancing Model Performance

Research Ideas

my-image-captioning-dataset

Structural-Damage Image Captioning Dataset

Gujarati Image Captioning Dataset

Introduction

Image Data

Caption Data

Metadata

Usage and Applications

Image Data (Object Detection and Captioning)

Dataset for Image Captioning System (in Bangla)

my-image-captioning-dataset

Danish Image Captioning Dataset

Introduction

Image Data

Caption Data

Metadata

Usage and Applications

GTA-Image-Captioning-Dataset

French Image Captioning Dataset

Introduction

Image Data

Caption Data

Metadata

Usage and Applications

Swahili Image Captioning Dataset

Data and codes to reproduce the study, titled:" Understanding remote sensing...

Image captioning dataset for human activities

Image Captioning Dataset(without links)

Dataset

Contents

Riset Image Captioning Dataset

Polish Image Captioning Dataset

Introduction

Image Data

Caption Data

Metadata

Usage and Applications

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for...

Safevision:image Captioning Dataset

SafeVision:IMAGE CAPTIONING

Flickr30k

Textual-Image-Caption-Dataset

AhmedSSabir/Textual-Image-Caption-Dataset

image captioning language grounding visual semantic