29 datasets found

Image Pre-processing for Model Training
kaggle.com
Updated Apr 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Habib Mrad (2021). Image Pre-processing for Model Training [Dataset]. https://www.kaggle.com/datasets/habibmrad1983/image-preprocessing-for-model-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Habib Mrad
Description
Dataset

This dataset was created by Habib Mrad

Contents
bert baseline pre and post process
kaggle.com
Updated Feb 5, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
prvi (2020). bert baseline pre and post process [Dataset]. https://www.kaggle.com/datasets/prokaj/bert-baseline-pre-and-post-process
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 5, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
prvi
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by prvi

Released under CC0: Public Domain

Contents
The CloudCast Dataset (small)
kaggle.com
Updated Oct 22, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Lillelund (2021). The CloudCast Dataset (small) [Dataset]. https://www.kaggle.com/datasets/christianlillelund/the-cloudcast-dataset-small
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 22, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Christian Lillelund
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
https://vision.eng.au.dk/wp-content/uploads/2020/07/example_obs-1024x206-1024x206.jpg" alt="">

CloudCast: A large-scale dataset and baseline for forecasting clouds

The CloudCast dataset contains 70080 cloud-labeled satellite images with 10 different cloud types corresponding to multiple layers of the atmosphere. The raw satellite images come from a satellite constellation in geostationary orbit centred at zero degrees longitude and arrive in 15-minute intervals from the European Organisationfor Meteorological Satellites (EUMETSAT). The resolution of these images is 3712 x 3712 pixels for the full-disk of Earth, which implies that every pixel corresponds to a space of dimensions 3x3km. This is the highest possible resolution from European geostationary satellites when including infrared channels. Some pre- and post-processing of the raw satellite images are also being done by EUMETSAT before being exposed to the public, such as removing airplanes. We collect all the raw multispectral satellite images and annotate them individually on a pixel-level using a segmentation algorithm. The full dataset then has a spatial resolution of 928 x 1530 pixels recorded with 15-min intervals for the period 2017-2018, where each pixel represents an area of 3×3 km. To enable standardized datasets for benchmarking computer vision methods, this includes a full-resolution gray-scaled dataset centered and projected dataset over Europe (128×128).

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

If you use this dataset in your research or elsewhere, please cite/reference the following paper: CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds

Data dictionary

There are 24 folders in the dataset containing the following information:

| File | Definition | Note | | --- | --- | | X.npy | Numpy encoded array containing the actual 128x128 image with pixel values as labels, see below. | | | GEO.npz| Numpy array containing geo coordinates where the image was taken (latitude and longitude). | | | TIMESTAMPS.npy| Numpy array containing timestamps for each captured image. | Images are captured in 15-minute intervals. |

Cloud types

0 = No clouds or missing data 1 = Very low clouds 2 = Low clouds 3 = Mid-level clouds 4 = High opaque clouds 5 = Very high opaque clouds 6 = Fractional clouds 7 = High semitransparant thin clouds 8 = High semitransparant moderately thick clouds 9 = High semitransparant thick clouds 10 = High semitransparant above low or medium clouds

Examples

https://i.ibb.co/NFv55QW/cloudcast4.png" alt=""> https://i.ibb.co/3FhHzMT/cloudcast3.png" alt=""> https://i.ibb.co/9wCsJhR/cloudcast2.png" alt=""> https://i.ibb.co/9T5dbSH/cloudcast1.png" alt="">
Pistachio Dataset
kaggle.com
Updated Apr 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Murat KOKLU (2022). Pistachio Dataset [Dataset]. https://www.kaggle.com/datasets/muratkokludataset/pistachio-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Murat KOKLU
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Pistachio Image Dataset https://www.kaggle.com/datasets/muratkokludataset/pistachio-image-dataset

DATASET: https://www.muratkoklu.com/datasets/

Citation Request :

OZKAN IA., KOKLU M. and SARACOGLU R. (2021). Classification of Pistachio Species Using Improved K-NN Classifier. Progress in Nutrition, Vol. 23, N. 2, pp. DOI:10.23751/pn.v23i2.9686. (Open Access) https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178

SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)

Article Download (PDF): 1: https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178 2: https://doi.org/10.3390/electronics11070981

ABSTRACT: In order to keep the economic value of pistachio nuts which have an important place in the agricultural economy, the efficiency of post-harvest industrial processes is very important. To provide this efficiency, new methods and technologies are needed for the separation and classification of pistachios. Different pistachio species address different markets, which increases the need for the classification of pistachio species. In this study, it is aimed to develop a classification model different from traditional separation methods, based on image processing and artificial intelligence which are capable to provide the required classification. A computer vision system has been developed to distinguish two different species of pistachios with different characteristics that address different market types. 2148 sample image for these two kinds of pistachios were taken with a high-resolution camera. The image processing techniques, segmentation and feature extraction were applied on the obtained images of the pistachio samples. A pistachio dataset that has sixteen attributes was created. An advanced classifier based on k-NN method, which is a simple and successful classifier, and principal component analysis was designed on the obtained dataset. In this study; a multi-level system including feature extraction, dimension reduction and dimension weighting stages has been proposed. Experimental results showed that the proposed approach achieved a classification success of 94.18%. The presented high-performance classification model provides an important need for the separation of pistachio species and increases the economic value of species. In addition, the developed model is important in terms of its application to similar studies. Keywords: Classification, Image processing, k nearest neighbor classifier, Pistachio species

SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)

ABSTRACT: Pistachio is a shelled fruit from the anacardiaceae family. The homeland of pistachio is the Middle East. The Kirmizi pistachios and Siirt pistachios are the major types grown and exported in Turkey. Since the prices, tastes, and nutritional values of these types differs, the type of pistachio becomes important when it comes to trade. This study aims to identify these two types of pistachios, which are frequently grown in Turkey, by classifying them via convolutional neural networks. Within the scope of the study, images of Kirmizi and Siirt pistachio types were obtained through the computer vision system. The pre-trained dataset includes a total of 2148 images, 1232 of Kirmizi type and 916 of Siirt type. Three different convolutional neural network models were used to classify these images. Models were trained by using the transfer learning method, with AlexNet and the pre-trained models VGG16 and VGG19. The dataset is divided as 80% training and 20% test. As a result of the performed classifications, the success rates obtained from the AlexNet, VGG16, and VGG19 models are 94.42%, 98.84%, and 98.14%, respectively. Models’ performances were evaluated through sensitivity, specificity, precision, and F-1 score metrics. In addition, ROC curves and AUC values were used in the performance evaluation. The highest classification success was achieved with the VGG16 model. The obtained results reveal that these methods can be used successfully in the determination of pistachio types. Keywords: pistachio; genetic varieties; machine learning; deep learning; food recognition

https://www.muratkoklu.com/datasets/
P
Alpaca Dataset Image Classification Dataset
paperswithcode.com
gts.ai
Updated Jun 26, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). Alpaca Dataset Image Classification Dataset [Dataset]. https://paperswithcode.com/dataset/alpaca-dataset-image-classification
Explore at:
Dataset updated
Jun 26, 2025
Description
Description:

👉 Download the dataset here

The Alpaca Dataset is a collection of JPEG images designed for binary image classification tasks, specifically classifying images as “Alpaca” or “Not Alpaca”. This dataset is ideal for training and fine-tuning machine learning models using transfer learning techniques.

Download Dataset

Context

This small dataset is perfect for educational purposes, initial model testing, and developing proof-of-concept applications in image classification. Due to its limited size, it is most beneficial when used in conjunction with transfer learning to leverage pre-trained models for improved accuracy.

Content

The dataset is organized into two primary directories:

Alpaca: Contains images that include alpacas.

Not Alpaca: Contains images without alpacas, featuring subjects that may resemble alpacas but are not.

Additional Information

Format: All images are in JPEG format, ensuring compatibility with a wide range of image processing libraries and tools.

Usage: This dataset can be utilized in various machine learning frameworks such as TensorFlow, PyTorch, and Keras for building and testing classification models.

Applications: Potential applications include animal recognition systems, educational tools, and development of AI-driven content moderation systems.

Data Statistics

Total Images: X (Number of images in the dataset)

Alpaca Images: Y (Number of images in the Alpaca directory)

Not Alpaca Images: Z (Number of images in the Not Alpaca directory)

Image Resolution: Varies, with most images having a resolution suitable for quick model training and evaluation.

This dataset is sourced from Kaggle.
m
Bangla Natural Language Image to Text (BNLIT)
data.mendeley.com
dataverse.harvard.edu
+2more
Updated Feb 15, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Asifuzzaman Jishan (2020). Bangla Natural Language Image to Text (BNLIT) [Dataset]. http://doi.org/10.17632/ws3r82gnm8.4
Explore at:
Unique identifier
https://doi.org/10.17632/ws3r82gnm8.4
Dataset updated
Feb 15, 2020
Authors
Md. Asifuzzaman Jishan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We represented a new Bangla dataset with a Hybrid Recurrent Neural Network model which generated Bangla natural language description of images. This dataset achieved by a large number of images with classification and containing natural language process of images. We conducted experiments on our self-made Bangla Natural Language Image to Text (BNLIT) dataset. Our dataset contained 8,743 images. We made this dataset using Bangladesh perspective images. We used one annotation for each image. In our repository, we added two types of pre-processed data which is 224 × 224 and 500 × 375 respectively alongside annotations of full dataset. We also added CNN features file of whole dataset in our repository which is features.pkl.
Fruit Image Dataset: 22 Classes
kaggle.com
Updated Oct 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md. Sagor Ahmed (2023). Fruit Image Dataset: 22 Classes [Dataset]. https://www.kaggle.com/datasets/mdsagorahmed/fruit-image-dataset-22-classes/versions/1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 8, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Md. Sagor Ahmed
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Welcome to the Fruit Image Dataseton Kaggle! This dataset contains over 8700 uncleaned images belonging to*** 22 different classes***, consisting of 11 ripe and 11 unripe fruits. This diverse collection of images is a valuable resource for anyone interested in image processing and computer vision tasks, particularly image classification projects.

Whether you're a beginner looking to start your journey in computer vision or an experienced data scientist working on a low-configuration PC, this dataset offers a wide range of possibilities. You can use these images for:

Image Classification: Train machine learning models to accurately classify fruits as ripe or unripe. Object Detection: Build object detection models to identify and locate fruits in images. Image Enhancement: Apply image preprocessing techniques to clean and enhance the dataset for improved model training. Transfer Learning: Leverage pre-trained models to fine-tune and optimize fruit classification tasks. Feel free to download this dataset from my Kaggle account and explore the world of fruit image analysis. Don't forget to share your findings and contributions with the Kaggle community. Happy coding!
m
SoyNet: Indian Soybean Image dataset with quality images captured from the...
data.mendeley.com
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Arpan Singh Rajput (2023). SoyNet: Indian Soybean Image dataset with quality images captured from the agriculture field ( healthy and disease Images) [Dataset]. http://doi.org/10.17632/w2r855hpx8.2
Explore at:
Unique identifier
https://doi.org/10.17632/w2r855hpx8.2
Dataset updated
Jun 2, 2023
Authors
Arpan Singh Rajput
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
High-quality images of soybean leaf are required to solve soybean disease and healthy leaves classification and recognition problems. To build the machine learning models, deep learning models with neat and clean dataset is the elementary requirement in research. With this objective, this data set is created, which consists of healthy and disease-quality images of soybean named “SoyNet”. This dataset consists of 9000+ high-quality images of soybeans (healthy and Disease quality) with different angles and Images captured direct from the soybean agriculture field to analyze the real problem in research. The images are divided into 2 sub-folders 1) Raw SoyNet Data and 2) Pre-processing SoyNet Data. Each Sub folder contains a digital camera Click, which contains healthy and disease image folders, and 2) Mobile Phone Click, which contains disease images. The Pre-processing SoyNet Data contains folders of 256*256 resized images and grayscale images in a similar manner to disease and healthy data. A Digital-Camera and a Mobile phone with a high-end resolution camera were used to capture the images. The images were taken at the soybean cultivation field in different lighting conditions and backgrounds. The proposed dataset can be used for training, testing, and validation of soybean classification or reorganization models.
p
Data from: MIMIC-CXR-JPG - chest radiographs with structured labels
physionet.org
Updated Mar 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Matthew Lungren; Yifan Peng; Zhiyong Lu; Roger Mark; Seth Berkowitz; Steven Horng (2024). MIMIC-CXR-JPG - chest radiographs with structured labels [Dataset]. http://doi.org/10.13026/jsn5-t979
Explore at:
Unique identifier
https://doi.org/10.13026/jsn5-t979
Dataset updated
Mar 12, 2024
Authors
Alistair Johnson; Matthew Lungren; Yifan Peng; Zhiyong Lu; Roger Mark; Seth Berkowitz; Steven Horng
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
The MIMIC Chest X-ray JPG (MIMIC-CXR-JPG) Database v2.0.0 is a large publicly available dataset of chest radiographs in JPG format with structured labels derived from free-text radiology reports. The MIMIC-CXR-JPG dataset is wholly derived from MIMIC-CXR, providing JPG format files derived from the DICOM images and structured labels derived from the free-text reports. The aim of MIMIC-CXR-JPG is to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. The dataset contains 377,110 JPG format images and structured labels derived from the 227,827 free-text radiology reports associated with these images. The dataset is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. The dataset is intended to support a wide body of research in medicine including image understanding, natural language processing, and decision support.
Raisin
kaggle.com
Updated Sep 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
basharath ali (2022). Raisin [Dataset]. https://www.kaggle.com/datasets/basharath123/raisin
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 17, 2022
Dataset provided by
Kaggle
Authors
basharath ali
Description
The government of india is trying to study the two varieties of raisins. These varieties of raisin are of great value and are thus important. Research is being done and the images of both varieties are obtained with CVS. The images are subjected to various stages of pre-processing and 7 morphological features are extracted.
Sign Language Gesture Images Dataset
kaggle.com
zip
Updated Sep 10, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Khan (2019). Sign Language Gesture Images Dataset [Dataset]. https://www.kaggle.com/datasets/ahmedkhanak1995/sign-language-gesture-images-dataset
Explore at:
zip(199984313 bytes)Available download formats
Dataset updated
Sep 10, 2019
Authors
Ahmed Khan
License
https://ec.europa.eu/info/legal-notice_enhttps://ec.europa.eu/info/legal-notice_en
Description
Context

Sign Language is a communication language just like any other language which is used among deaf community. This dataset is a complete set of gestures which are used in sign language and can be used by other normal people for better understanding of the sign language gestures .

Content

The dataset consists of 37 different hand sign gestures which includes A-Z alphabet gestures, 0-9 number gestures and also a gesture for space which means how the deaf or dumb people represent space between two letter or two words while communicating. The dataset has two parts, that is two folders (1)-Gesture Image Data - which consists of the colored images of the hands for different gestures. Each gesture image is of size 50X50 and is in its specified folder name that is A-Z folders consists of A-Z gestures images and 0-9 folders consists of 0-9 gestures respectively, '_' folder consists of images of the gesture for space. Each gesture has 1500 images, so all together there are 37 gestures which means there 55,500 images for all gestures in the 1st folder and in the 2nd folder that is (2)-Gesture Image Pre-Processed Data which has the same number of folders and same number of images that is 55,500. The difference here is these images are threshold binary converted images for training and testing purpose. Convolutional Neural Network is well suited for this dataset for model training purpose and gesture prediction.

Acknowledgements

I wouldn't be here without the help of others. As this dataset is being created with the help of references of the work done on sign language in data science and also references from the work done on image processing.
Job Vacancy Tweets
kaggle.com
Updated Apr 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prasad Patil (2023). Job Vacancy Tweets [Dataset]. https://www.kaggle.com/datasets/prasad22/job-vacancy-tweets/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Prasad Patil
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 50,000 tweets related to job vacancies and hiring, extracted using the keywords 'Job Vacancy,' 'We are Hiring,' and 'We're Hiring'. The tweets were collected between January 1, 2019, and April 10, 2023, with the help of snscrape library of Python and are provided in a CSV format.

The purpose behind this dataset

To explore text pre-processing and test NLP skills

Draw interesting insights on Job Market from Job Postings.

Analyse company/role requirements if possible

The dataset includes the following information for each tweet: ID: The unique identifier for the tweet. Timestamp: The date and time when the tweet was posted. User: The Twitter handle of the user who posted the tweet. Text: The content of the tweet. Hashtag: The hashtags included in the tweet, if any. Retweets: The number of times the tweet has been retweeted as of the time it was scraped. Likes: The number of likes the tweet has received as of the time it was scraped. Replies: The number of replies to the tweet as of the time it was scraped. Source: The source application or device used to post the tweet. Location: The location listed on the user's Twitter profile, if any. Verified_Account: A Boolean value indicating whether the user's Twitter account has been verified. Followers: The number of followers the user has as of the time the tweet was scraped. Following: The number of accounts the user is following as of the time the tweet was scraped
A Curated List of Image Deblurring Datasets
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jishnu Parayil Shibu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
Eye Image Dataset
kaggle.com
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sumit R Washimkar (2025). Eye Image Dataset [Dataset]. https://www.kaggle.com/datasets/sumit17125/eye-image-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sumit R Washimkar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Right Eye Disease Classification Dataset

Introduction

This dataset consists of right eye images along with a CSV file containing image names and corresponding disease labels. It is designed for disease classification tasks using deep learning and computer vision techniques.

Dataset Information

The dataset contains right eye images captured from various individuals.

The accompanying CSV file includes the image filename and the disease label.

Additional columns provide relevant metadata or medical attributes.

CSV File Columns

Image Name: The filename of the corresponding right eye image.

Disease Labels:

N: Normal (No Disease)

D: Diabetic Retinopathy

G: Glaucoma

C: Cataract

A: Age-Related Macular Degeneration

H: Hypertensive Retinopathy

M: Myopia

O: Other Eye Diseases

Additional columns may include patient details (if available), image capture conditions, or severity levels.

Possible Use Cases

Deep Learning for Medical Imaging: Training CNN models for automated disease classification.

Image Processing & Feature Extraction: Analyzing retinal features for disease detection.

Transfer Learning & Fine-Tuning: Using pre-trained models (e.g., ResNet, VGG) for improving classification performance.

Medical AI Research: Developing AI-driven solutions for ophthalmology.

Acknowledgments

This dataset is designed for medical AI research and educational purposes. Proper handling of medical data is advised.
Bengali Digit Recognition in the Wild (BDRW)
kaggle.com
Updated Aug 18, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DebdootSheet (2016). Bengali Digit Recognition in the Wild (BDRW) [Dataset]. https://www.kaggle.com/debdoot/bdrw/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 18, 2016
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DebdootSheet
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context: BDRW is a real-world image dataset for developing machine learning and vision algorithms with minimal requirement on data pre-processing and formatting to identify digits of the decimal number system appearing in Bengali script. It can be seen as similar in flavor to SVHN (e.g., the images are of small cropped digits), but incorporates higher visual heterogeneity and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). BDRW is obtained from numbers appearing in photographs, printed materials, sign boards, wall writings, calendar or book pages, etc.

File: BDRW_train.zip (contains BDRW_train_1.zip, BDRW_train_2.zip)

The data in the two zip files are to be used together and together contain a set of .jpg images of different sized which are cropped from different photographs, magazine prints, wall writing images, etc. Each image represents a digit from the decimal number system written in Bengali (https://en.wikipedia.org/wiki/Bengali_numerals). The file labels.xls contains the number represented in each image which can be used as the ground truth labels for training a learning based system to recognize the Bengali numbers.

Inspiration: This dataset is released for a machine vision challenge being hosted at IEEE TechSym 2016. The challenge will also include a testing set which includes samples not present in the training set released here and would be released after the challenge is closed.
Twitter Tweets Sentiment Dataset
kaggle.com
Updated Apr 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 8, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
M Yasser H
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

Description:

Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

Columns:

textID - unique ID for each piece of text

text - the text of the tweet

sentiment - the general sentiment of the tweet

Acknowledgement:

The dataset is download from Kaggle Competetions:
https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

Objective:

Understand the Dataset & cleanup (if required).

Build classification models to predict the twitter sentiments.

Compare the evaluation metrics of vaious classification algorithms.
Data from: Mushroom classification
kaggle.com
Updated Feb 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mathieu DUVERNE (2024). Mushroom classification [Dataset]. https://www.kaggle.com/datasets/mathieuduverne/mushroom-classification
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 4, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mathieu DUVERNE
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset includes 8857 images. Mushroom are annotated in COCO format.

The following pre-processing was applied to each image: * Auto-orientation of pixel data (with EXIF-orientation stripping) * Resize to 640x640 (Stretch)

The following augmentation was applied to create 3 versions of each source image: * 50% probability of horizontal flip * 50% probability of vertical flip

The structure:

dataset-directory/ ├─ README.dataset.txt ├─ README.roboflow.txt ├─ train │ ├─ train-image-1.jpg │ ├─ train-image-1.jpg │ ├─ ... │ └─ _annotations.coco.json ├─ test │ ├─ test-image-1.jpg │ ├─ test-image-1.jpg │ ├─ ... │ └─ _annotations.coco.json └─ valid ├─ valid-image-1.jpg ├─ valid-image-1.jpg ├─ ... └─ _annotations.coco.json

To convert the format to YOLO annotations, go to roboflow.

UFO sightings since 1906

kaggle.com

Updated Feb 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Hassan-sv (2025). UFO sightings since 1906 [Dataset]. https://www.kaggle.com/datasets/hassansv/ufo-sightings-since-1906

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 24, 2025

Dataset provided by

Kaggle

Authors

Hassan-sv

License

ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically

Description

Overview of the Dataset

The UFO sightings dataset contains records of UFO sightings reported globally since 1906. The dataset includes the following columns:

datetime: The date and time of the sighting.

day: The day of the week when the sighting occurred.

city: The city where the sighting was reported.

state: The state or region where the sighting occurred.

country: The country where the sighting was reported.

shape: The shape or form of the UFO observed.

duration (seconds): The duration of the sighting in seconds.

duration (hours/min): The duration of the sighting in hours and minutes.

comments: Additional comments or descriptions provided by the witness.

day_posted: The day the sighting was reported or posted.

date posted: The date the sighting was reported or posted.

latitude: The latitude coordinate of the sighting location.

longitude: The longitude coordinate of the sighting location.

days_count: The number of days between the sighting and when it was posted

Analysis Process

Data Cleaning and Preparation (Excel):

  Removed duplicate entries and handled missing values.

  Standardized formats for dates, times, and categorical variables (e.g., shapes, countries).

  Calculated additional metrics such as days_count (time between sighting and posting).

Exploratory Data Analysis (SQL):

  Aggregated data to analyze trends, such as the number of sightings per country, state, or city.

  Calculated average durations of sightings by UFO shape.

  Identified the most common UFO shapes and their distribution across countries.

  Analyzed temporal trends, such as sightings per day or over time.

Visualization (Tableau):

  Created interactive dashboards to visualize key insights.

  Developed charts such as:

    Average Duration of Sightings by Shape: Highlighting which UFO shapes were observed for the longest durations.

    UFO Shapes by Country: Showing the distribution of UFO shapes across different countries.

    UFO Shapes Total: A global overview of the most commonly reported UFO shapes.

    UFO Sightings in All Countries: A map or bar chart showing the number of sightings per country.

    UFO Sightings per Day: A time series analysis of sightings over days.

    UFO Sightings in the USA: A focused analysis of sightings in the United States, broken down by state or city.

Key Insights and Conclusions

Most Common UFO Shapes:

  The most frequently reported UFO shapes include lights, circles, and triangles.

  These shapes are consistent across multiple countries, suggesting common patterns in UFO sightings.

Geographical Distribution:

  The United States has the highest number of reported UFO sightings, followed by Canada and the United Kingdom.

  Within the U.S., states like California, Florida, and Texas report the most sightings.

Temporal Trends:

  Sightings have increased significantly since the mid-20th century, with a peak in the 2000s.

  Certain days of the week (e.g., weekends) show higher reporting rates, possibly due to increased outdoor activity.

Duration of Sightings:

  The average duration of sightings varies by shape. For example, cigar-shaped UFOs tend to be observed for longer periods compared to light or disk shapes.

  Most sightings last less than a minute, but some reports describe durations of several hours.

Reporting Delays:

  The days_count column reveals that many sightings are reported weeks or even months after they occur, indicating potential delays in witness reporting or data collection.

Global Patterns:

  While the U.S. dominates the dataset, other countries show unique patterns in terms of UFO shapes and sighting frequencies.

  For example, Australia and Germany report a higher proportion of triangular UFOs compared to other shapes.

Recommendations for Further Analysis

Geospatial Analysis: Use latitude and longitude data to create heatmaps of sightings and identify potential hotspots.

Text Analysis: Analyze the comments column using natural language processing (NLP) to extract common themes or keywords.

Correlation with External Data: Investigate whether UFO sightings correlate with astronomical events, military activity, or other phenomena.

Machine Learning: Build predictive models to identify patterns or classify sightings based on shape, duration, or location.

Conclusion

The UFO sightings dataset provides a fascinating glimpse into global reports of unidentified flying objects. Through careful analysis, I identified key trends in UFO shapes, durations, and geographical distribution. The United States emerges as the epicenter of UFO sightings, with lights and ...

Discursos presidentes civis do Brasil - 1985-2022
kaggle.com
Updated Aug 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pascoal Gonçalves (2024). Discursos presidentes civis do Brasil - 1985-2022 [Dataset]. http://doi.org/10.34740/kaggle/ds/5566939
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/5566939
Dataset updated
Aug 20, 2024
Dataset provided by
Kaggle
Authors
Pascoal Gonçalves
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Essa base de dados foi organizada a partir do repositório da presidência da república e compreende a totalidade dos discursos dos presidentes brasileiros após a redemocratização do país, em 1985, até o fim de 2022. Todos os discursos estão disponibilizados em formato .txt e sua nomeação corresponde a sua datação em formato americano. Também podem ser encontrados os códigos que foram usados nas etapas de pré e pós processamento, disponíveis em formato .py e documentações que desmembram os códigos. O projeto contou com a apoio financeiro do Programa de Iniciação Teológica da UFPB.

This database was organized from the repository of the Presidency of the Republic and includes all the speeches made by Brazilian presidents after the country's re-democratization in 1985 until the end of 2022. All the speeches are available in .txt format and their naming corresponds to their dating in American format. You can also find the codes that were used in the pre- and post-processing stages, available in .py format, and documentation that breaks down the codes. The project received financial support from the UFPB Theological Initiation Program.

Autoria (authorship):

Pascoal Teófilo Carvalho Gonçalves https://orcid.org/0000-0002-1336-3148 http://lattes.cnpq.br/8913105744643795

Carla Suzana Gomes Meira https://orcid.org/0009-0003-9924-1401

Romberg de Sá Gondim https://orcid.org/0000-0002-8857-9795

NIH Chest X ray 14 (224x224 resized)

kaggle.com

zip

Updated Jul 8, 2020

Facebook

Twitter

Click to copy link

Link copied

Cite

Khan Fashee Monowar (Sawrup) (2020). NIH Chest X ray 14 (224x224 resized) [Dataset]. https://www.kaggle.com/khanfashee/nih-chest-x-ray-14-224x224-resized

Explore at:

zip(2468882507 bytes)Available download formats

Dataset updated

Jul 8, 2020

Authors

Khan Fashee Monowar (Sawrup)

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

National Institutes of Health Chest X-Ray Dataset

Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

Data limitations:

The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv)
Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation

File contents

Image format: 112,120 total images with size 1024 x 1024

images_001.zip: Contains 4999 images

images_002.zip: Contains 10,000 images

images_003.zip: Contains 10,000 images

images_004.zip: Contains 10,000 images

images_005.zip: Contains 10,000 images

images_006.zip: Contains 10,000 images

images_007.zip: Contains 10,000 images

images_008.zip: Contains 10,000 images

images_009.zip: Contains 10,000 images

images_010.zip: Contains 10,000 images

images_011.zip: Contains 10,000 images

images_012.zip: Contains 7,121 images

README_ChestXray.pdf: Original README file

BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels
  Image Index: File name
  Finding Label: Disease type (Class label)
  Bbox x
  Bbox y
  Bbox w
  Bbox h

Dataentry2017.csv: Class labels and patient data for the entire dataset
  Image Index: File name
  Finding Labels: Disease type (Class label)
  Follow-up #
  Patient ID
  Patient Age
  Patient Gender
  View Position: X-ray orientation
  OriginalImageWidth
  OriginalImageHeight
  OriginalImagePixelSpacing_x
  OriginalImagePixelSpacing_y

Class descriptions

There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:

Atelectasis
Consolidation
Infiltration
Pneumothorax
Edema
Emphysema
Fibrosis
Effusion
Pneumonia
Pleural_thickening
Cardiomegaly
Nodule Mass
Hernia

Full Dataset Content

There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.

Sample: sample.zip

Modifications to original data

Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform

CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory

Citations

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8Hospital-ScaleChestCVPR2017_paper.pdf

NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community

Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345

Facebook

Twitter

Click to copy link

Link copied

Cite

Habib Mrad (2021). Image Pre-processing for Model Training [Dataset]. https://www.kaggle.com/datasets/habibmrad1983/image-preprocessing-for-model-training

Image Pre-processing for Model Training

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Apr 23, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Habib Mrad

Description

Dataset

This dataset was created by Habib Mrad

Clear search

Close search

Google apps

Main menu

Image Pre-processing for Model Training

Dataset

Contents

bert baseline pre and post process

Dataset

Contents

The CloudCast Dataset (small)

CloudCast: A large-scale dataset and baseline for forecasting clouds

License

Citation

Data dictionary

Cloud types

Examples

Pistachio Dataset

Alpaca Dataset Image Classification Dataset

Bangla Natural Language Image to Text (BNLIT)

Fruit Image Dataset: 22 Classes

SoyNet: Indian Soybean Image dataset with quality images captured from the...

Data from: MIMIC-CXR-JPG - chest radiographs with structured labels

Raisin

Sign Language Gesture Images Dataset

Context

Content

Acknowledgements

Job Vacancy Tweets

The purpose behind this dataset

A Curated List of Image Deblurring Datasets

Eye Image Dataset

Right Eye Disease Classification Dataset

Introduction

Dataset Information

CSV File Columns

Possible Use Cases

Acknowledgments

Bengali Digit Recognition in the Wild (BDRW)

Twitter Tweets Sentiment Dataset

Description:

Columns:

Acknowledgement:

Objective:

Data from: Mushroom classification

UFO sightings since 1906

Discursos presidentes civis do Brasil - 1985-2022

NIH Chest X ray 14 (224x224 resized)

National Institutes of Health Chest X-Ray Dataset

Data limitations:

File contents

Class descriptions

Full Dataset Content

Modifications to original data

Citations

Image Pre-processing for Model Training

Dataset

Contents