29 datasets found
  1. Image Pre-processing for Model Training

    • kaggle.com
    Updated Apr 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Habib Mrad (2021). Image Pre-processing for Model Training [Dataset]. https://www.kaggle.com/datasets/habibmrad1983/image-preprocessing-for-model-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Habib Mrad
    Description

    Dataset

    This dataset was created by Habib Mrad

    Contents

  2. bert baseline pre and post process

    • kaggle.com
    Updated Feb 5, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    prvi (2020). bert baseline pre and post process [Dataset]. https://www.kaggle.com/datasets/prokaj/bert-baseline-pre-and-post-process
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    prvi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by prvi

    Released under CC0: Public Domain

    Contents

  3. The CloudCast Dataset (small)

    • kaggle.com
    Updated Oct 22, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christian Lillelund (2021). The CloudCast Dataset (small) [Dataset]. https://www.kaggle.com/datasets/christianlillelund/the-cloudcast-dataset-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Christian Lillelund
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    https://vision.eng.au.dk/wp-content/uploads/2020/07/example_obs-1024x206-1024x206.jpg" alt="">

    CloudCast: A large-scale dataset and baseline for forecasting clouds

    The CloudCast dataset contains 70080 cloud-labeled satellite images with 10 different cloud types corresponding to multiple layers of the atmosphere. The raw satellite images come from a satellite constellation in geostationary orbit centred at zero degrees longitude and arrive in 15-minute intervals from the European Organisationfor Meteorological Satellites (EUMETSAT). The resolution of these images is 3712 x 3712 pixels for the full-disk of Earth, which implies that every pixel corresponds to a space of dimensions 3x3km. This is the highest possible resolution from European geostationary satellites when including infrared channels. Some pre- and post-processing of the raw satellite images are also being done by EUMETSAT before being exposed to the public, such as removing airplanes. We collect all the raw multispectral satellite images and annotate them individually on a pixel-level using a segmentation algorithm. The full dataset then has a spatial resolution of 928 x 1530 pixels recorded with 15-min intervals for the period 2017-2018, where each pixel represents an area of 3×3 km. To enable standardized datasets for benchmarking computer vision methods, this includes a full-resolution gray-scaled dataset centered and projected dataset over Europe (128×128).

    License

    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

    Citation

    If you use this dataset in your research or elsewhere, please cite/reference the following paper: CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds

    Data dictionary

    There are 24 folders in the dataset containing the following information:

    | File | Definition | Note | | --- | --- | | X.npy | Numpy encoded array containing the actual 128x128 image with pixel values as labels, see below. | | | GEO.npz| Numpy array containing geo coordinates where the image was taken (latitude and longitude). | | | TIMESTAMPS.npy| Numpy array containing timestamps for each captured image. | Images are captured in 15-minute intervals. |

    Cloud types

    0 = No clouds or missing data 1 = Very low clouds 2 = Low clouds 3 = Mid-level clouds 4 = High opaque clouds 5 = Very high opaque clouds 6 = Fractional clouds 7 = High semitransparant thin clouds 8 = High semitransparant moderately thick clouds 9 = High semitransparant thick clouds 10 = High semitransparant above low or medium clouds

    Examples

    https://i.ibb.co/NFv55QW/cloudcast4.png" alt=""> https://i.ibb.co/3FhHzMT/cloudcast3.png" alt=""> https://i.ibb.co/9wCsJhR/cloudcast2.png" alt=""> https://i.ibb.co/9T5dbSH/cloudcast1.png" alt="">

  4. Pistachio Dataset

    • kaggle.com
    Updated Apr 3, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Murat KOKLU (2022). Pistachio Dataset [Dataset]. https://www.kaggle.com/datasets/muratkokludataset/pistachio-dataset/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Murat KOKLU
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Pistachio Image Dataset https://www.kaggle.com/datasets/muratkokludataset/pistachio-image-dataset

    DATASET: https://www.muratkoklu.com/datasets/

    Citation Request :

    1. OZKAN IA., KOKLU M. and SARACOGLU R. (2021). Classification of Pistachio Species Using Improved K-NN Classifier. Progress in Nutrition, Vol. 23, N. 2, pp. DOI:10.23751/pn.v23i2.9686. (Open Access) https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178

    2. SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)

    Article Download (PDF): 1: https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178 2: https://doi.org/10.3390/electronics11070981

    ABSTRACT: In order to keep the economic value of pistachio nuts which have an important place in the agricultural economy, the efficiency of post-harvest industrial processes is very important. To provide this efficiency, new methods and technologies are needed for the separation and classification of pistachios. Different pistachio species address different markets, which increases the need for the classification of pistachio species. In this study, it is aimed to develop a classification model different from traditional separation methods, based on image processing and artificial intelligence which are capable to provide the required classification. A computer vision system has been developed to distinguish two different species of pistachios with different characteristics that address different market types. 2148 sample image for these two kinds of pistachios were taken with a high-resolution camera. The image processing techniques, segmentation and feature extraction were applied on the obtained images of the pistachio samples. A pistachio dataset that has sixteen attributes was created. An advanced classifier based on k-NN method, which is a simple and successful classifier, and principal component analysis was designed on the obtained dataset. In this study; a multi-level system including feature extraction, dimension reduction and dimension weighting stages has been proposed. Experimental results showed that the proposed approach achieved a classification success of 94.18%. The presented high-performance classification model provides an important need for the separation of pistachio species and increases the economic value of species. In addition, the developed model is important in terms of its application to similar studies. Keywords: Classification, Image processing, k nearest neighbor classifier, Pistachio species

    1. SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)

    ABSTRACT: Pistachio is a shelled fruit from the anacardiaceae family. The homeland of pistachio is the Middle East. The Kirmizi pistachios and Siirt pistachios are the major types grown and exported in Turkey. Since the prices, tastes, and nutritional values of these types differs, the type of pistachio becomes important when it comes to trade. This study aims to identify these two types of pistachios, which are frequently grown in Turkey, by classifying them via convolutional neural networks. Within the scope of the study, images of Kirmizi and Siirt pistachio types were obtained through the computer vision system. The pre-trained dataset includes a total of 2148 images, 1232 of Kirmizi type and 916 of Siirt type. Three different convolutional neural network models were used to classify these images. Models were trained by using the transfer learning method, with AlexNet and the pre-trained models VGG16 and VGG19. The dataset is divided as 80% training and 20% test. As a result of the performed classifications, the success rates obtained from the AlexNet, VGG16, and VGG19 models are 94.42%, 98.84%, and 98.14%, respectively. Models’ performances were evaluated through sensitivity, specificity, precision, and F-1 score metrics. In addition, ROC curves and AUC values were used in the performance evaluation. The highest classification success was achieved with the VGG16 model. The obtained results reveal that these methods can be used successfully in the determination of pistachio types. Keywords: pistachio; genetic varieties; machine learning; deep learning; food recognition

    https://www.muratkoklu.com/datasets/

  5. P

    Alpaca Dataset Image Classification Dataset

    • paperswithcode.com
    • gts.ai
    Updated Jun 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Alpaca Dataset Image Classification Dataset [Dataset]. https://paperswithcode.com/dataset/alpaca-dataset-image-classification
    Explore at:
    Dataset updated
    Jun 26, 2025
    Description

    Description:

    👉 Download the dataset here

    The Alpaca Dataset is a collection of JPEG images designed for binary image classification tasks, specifically classifying images as “Alpaca” or “Not Alpaca”. This dataset is ideal for training and fine-tuning machine learning models using transfer learning techniques.

    Download Dataset

    Context

    This small dataset is perfect for educational purposes, initial model testing, and developing proof-of-concept applications in image classification. Due to its limited size, it is most beneficial when used in conjunction with transfer learning to leverage pre-trained models for improved accuracy.

    Content

    The dataset is organized into two primary directories:

    Alpaca: Contains images that include alpacas.

    Not Alpaca: Contains images without alpacas, featuring subjects that may resemble alpacas but are not.

    Additional Information

    Format: All images are in JPEG format, ensuring compatibility with a wide range of image processing libraries and tools.

    Usage: This dataset can be utilized in various machine learning frameworks such as TensorFlow, PyTorch, and Keras for building and testing classification models.

    Applications: Potential applications include animal recognition systems, educational tools, and development of AI-driven content moderation systems.

    Data Statistics

    Total Images: X (Number of images in the dataset)

    Alpaca Images: Y (Number of images in the Alpaca directory)

    Not Alpaca Images: Z (Number of images in the Not Alpaca directory)

    Image Resolution: Varies, with most images having a resolution suitable for quick model training and evaluation.

    This dataset is sourced from Kaggle.

  6. m

    Bangla Natural Language Image to Text (BNLIT)

    • data.mendeley.com
    • dataverse.harvard.edu
    • +2more
    Updated Feb 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Asifuzzaman Jishan (2020). Bangla Natural Language Image to Text (BNLIT) [Dataset]. http://doi.org/10.17632/ws3r82gnm8.4
    Explore at:
    Dataset updated
    Feb 15, 2020
    Authors
    Md. Asifuzzaman Jishan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We represented a new Bangla dataset with a Hybrid Recurrent Neural Network model which generated Bangla natural language description of images. This dataset achieved by a large number of images with classification and containing natural language process of images. We conducted experiments on our self-made Bangla Natural Language Image to Text (BNLIT) dataset. Our dataset contained 8,743 images. We made this dataset using Bangladesh perspective images. We used one annotation for each image. In our repository, we added two types of pre-processed data which is 224 × 224 and 500 × 375 respectively alongside annotations of full dataset. We also added CNN features file of whole dataset in our repository which is features.pkl.

  7. Fruit Image Dataset: 22 Classes

    • kaggle.com
    Updated Oct 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md. Sagor Ahmed (2023). Fruit Image Dataset: 22 Classes [Dataset]. https://www.kaggle.com/datasets/mdsagorahmed/fruit-image-dataset-22-classes/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 8, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Md. Sagor Ahmed
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Welcome to the Fruit Image Dataseton Kaggle! This dataset contains over 8700 uncleaned images belonging to*** 22 different classes***, consisting of 11 ripe and 11 unripe fruits. This diverse collection of images is a valuable resource for anyone interested in image processing and computer vision tasks, particularly image classification projects.

    Whether you're a beginner looking to start your journey in computer vision or an experienced data scientist working on a low-configuration PC, this dataset offers a wide range of possibilities. You can use these images for:

    Image Classification: Train machine learning models to accurately classify fruits as ripe or unripe. Object Detection: Build object detection models to identify and locate fruits in images. Image Enhancement: Apply image preprocessing techniques to clean and enhance the dataset for improved model training. Transfer Learning: Leverage pre-trained models to fine-tune and optimize fruit classification tasks. Feel free to download this dataset from my Kaggle account and explore the world of fruit image analysis. Don't forget to share your findings and contributions with the Kaggle community. Happy coding!

  8. m

    SoyNet: Indian Soybean Image dataset with quality images captured from the...

    • data.mendeley.com
    Updated Jun 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arpan Singh Rajput (2023). SoyNet: Indian Soybean Image dataset with quality images captured from the agriculture field ( healthy and disease Images) [Dataset]. http://doi.org/10.17632/w2r855hpx8.2
    Explore at:
    Dataset updated
    Jun 2, 2023
    Authors
    Arpan Singh Rajput
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    High-quality images of soybean leaf are required to solve soybean disease and healthy leaves classification and recognition problems. To build the machine learning models, deep learning models with neat and clean dataset is the elementary requirement in research. With this objective, this data set is created, which consists of healthy and disease-quality images of soybean named “SoyNet”. This dataset consists of 9000+ high-quality images of soybeans (healthy and Disease quality) with different angles and Images captured direct from the soybean agriculture field to analyze the real problem in research. The images are divided into 2 sub-folders 1) Raw SoyNet Data and 2) Pre-processing SoyNet Data. Each Sub folder contains a digital camera Click, which contains healthy and disease image folders, and 2) Mobile Phone Click, which contains disease images. The Pre-processing SoyNet Data contains folders of 256*256 resized images and grayscale images in a similar manner to disease and healthy data. A Digital-Camera and a Mobile phone with a high-end resolution camera were used to capture the images. The images were taken at the soybean cultivation field in different lighting conditions and backgrounds. The proposed dataset can be used for training, testing, and validation of soybean classification or reorganization models.

  9. p

    Data from: MIMIC-CXR-JPG - chest radiographs with structured labels

    • physionet.org
    Updated Mar 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair Johnson; Matthew Lungren; Yifan Peng; Zhiyong Lu; Roger Mark; Seth Berkowitz; Steven Horng (2024). MIMIC-CXR-JPG - chest radiographs with structured labels [Dataset]. http://doi.org/10.13026/jsn5-t979
    Explore at:
    Dataset updated
    Mar 12, 2024
    Authors
    Alistair Johnson; Matthew Lungren; Yifan Peng; Zhiyong Lu; Roger Mark; Seth Berkowitz; Steven Horng
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    The MIMIC Chest X-ray JPG (MIMIC-CXR-JPG) Database v2.0.0 is a large publicly available dataset of chest radiographs in JPG format with structured labels derived from free-text radiology reports. The MIMIC-CXR-JPG dataset is wholly derived from MIMIC-CXR, providing JPG format files derived from the DICOM images and structured labels derived from the free-text reports. The aim of MIMIC-CXR-JPG is to provide a convenient processed version of MIMIC-CXR, as well as to provide a standard reference for data splits and image labels. The dataset contains 377,110 JPG format images and structured labels derived from the 227,827 free-text radiology reports associated with these images. The dataset is de-identified to satisfy the US Health Insurance Portability and Accountability Act of 1996 (HIPAA) Safe Harbor requirements. Protected health information (PHI) has been removed. The dataset is intended to support a wide body of research in medicine including image understanding, natural language processing, and decision support.

  10. Raisin

    • kaggle.com
    Updated Sep 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    basharath ali (2022). Raisin [Dataset]. https://www.kaggle.com/datasets/basharath123/raisin
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2022
    Dataset provided by
    Kaggle
    Authors
    basharath ali
    Description

    The government of india is trying to study the two varieties of raisins. These varieties of raisin are of great value and are thus important. Research is being done and the images of both varieties are obtained with CVS. The images are subjected to various stages of pre-processing and 7 morphological features are extracted.

  11. Sign Language Gesture Images Dataset

    • kaggle.com
    zip
    Updated Sep 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Khan (2019). Sign Language Gesture Images Dataset [Dataset]. https://www.kaggle.com/datasets/ahmedkhanak1995/sign-language-gesture-images-dataset
    Explore at:
    zip(199984313 bytes)Available download formats
    Dataset updated
    Sep 10, 2019
    Authors
    Ahmed Khan
    License

    https://ec.europa.eu/info/legal-notice_enhttps://ec.europa.eu/info/legal-notice_en

    Description

    Context

    Sign Language is a communication language just like any other language which is used among deaf community. This dataset is a complete set of gestures which are used in sign language and can be used by other normal people for better understanding of the sign language gestures .

    Content

    The dataset consists of 37 different hand sign gestures which includes A-Z alphabet gestures, 0-9 number gestures and also a gesture for space which means how the deaf or dumb people represent space between two letter or two words while communicating. The dataset has two parts, that is two folders (1)-Gesture Image Data - which consists of the colored images of the hands for different gestures. Each gesture image is of size 50X50 and is in its specified folder name that is A-Z folders consists of A-Z gestures images and 0-9 folders consists of 0-9 gestures respectively, '_' folder consists of images of the gesture for space. Each gesture has 1500 images, so all together there are 37 gestures which means there 55,500 images for all gestures in the 1st folder and in the 2nd folder that is (2)-Gesture Image Pre-Processed Data which has the same number of folders and same number of images that is 55,500. The difference here is these images are threshold binary converted images for training and testing purpose. Convolutional Neural Network is well suited for this dataset for model training purpose and gesture prediction.

    Acknowledgements

    I wouldn't be here without the help of others. As this dataset is being created with the help of references of the work done on sign language in data science and also references from the work done on image processing.

  12. Job Vacancy Tweets

    • kaggle.com
    Updated Apr 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prasad Patil (2023). Job Vacancy Tweets [Dataset]. https://www.kaggle.com/datasets/prasad22/job-vacancy-tweets/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prasad Patil
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 50,000 tweets related to job vacancies and hiring, extracted using the keywords 'Job Vacancy,' 'We are Hiring,' and 'We're Hiring'. The tweets were collected between January 1, 2019, and April 10, 2023, with the help of snscrape library of Python and are provided in a CSV format.

    The purpose behind this dataset

    • To explore text pre-processing and test NLP skills
    • Draw interesting insights on Job Market from Job Postings.
    • Analyse company/role requirements if possible

    The dataset includes the following information for each tweet: ID: The unique identifier for the tweet. Timestamp: The date and time when the tweet was posted. User: The Twitter handle of the user who posted the tweet. Text: The content of the tweet. Hashtag: The hashtags included in the tweet, if any. Retweets: The number of times the tweet has been retweeted as of the time it was scraped. Likes: The number of likes the tweet has received as of the time it was scraped. Replies: The number of replies to the tweet as of the time it was scraped. Source: The source application or device used to post the tweet. Location: The location listed on the user's Twitter profile, if any. Verified_Account: A Boolean value indicating whether the user's Twitter account has been verified. Followers: The number of followers the user has as of the time the tweet was scraped. Following: The number of accounts the user is following as of the time the tweet was scraped

  13. A Curated List of Image Deblurring Datasets

    • kaggle.com
    Updated Mar 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jishnu Parayil Shibu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

    Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

    To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

    Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
    - TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018

  14. Eye Image Dataset

    • kaggle.com
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sumit R Washimkar (2025). Eye Image Dataset [Dataset]. https://www.kaggle.com/datasets/sumit17125/eye-image-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Sumit R Washimkar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Right Eye Disease Classification Dataset

    Introduction

    This dataset consists of right eye images along with a CSV file containing image names and corresponding disease labels. It is designed for disease classification tasks using deep learning and computer vision techniques.

    Dataset Information

    • The dataset contains right eye images captured from various individuals.
    • The accompanying CSV file includes the image filename and the disease label.
    • Additional columns provide relevant metadata or medical attributes.

    CSV File Columns

    • Image Name: The filename of the corresponding right eye image.
    • Disease Labels:
      • N: Normal (No Disease)
      • D: Diabetic Retinopathy
      • G: Glaucoma
      • C: Cataract
      • A: Age-Related Macular Degeneration
      • H: Hypertensive Retinopathy
      • M: Myopia
      • O: Other Eye Diseases
    • Additional columns may include patient details (if available), image capture conditions, or severity levels.

    Possible Use Cases

    • Deep Learning for Medical Imaging: Training CNN models for automated disease classification.
    • Image Processing & Feature Extraction: Analyzing retinal features for disease detection.
    • Transfer Learning & Fine-Tuning: Using pre-trained models (e.g., ResNet, VGG) for improving classification performance.
    • Medical AI Research: Developing AI-driven solutions for ophthalmology.

    Acknowledgments

    This dataset is designed for medical AI research and educational purposes. Proper handling of medical data is advised.

  15. Bengali Digit Recognition in the Wild (BDRW)

    • kaggle.com
    Updated Aug 18, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DebdootSheet (2016). Bengali Digit Recognition in the Wild (BDRW) [Dataset]. https://www.kaggle.com/debdoot/bdrw/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 18, 2016
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DebdootSheet
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Context: BDRW is a real-world image dataset for developing machine learning and vision algorithms with minimal requirement on data pre-processing and formatting to identify digits of the decimal number system appearing in Bengali script. It can be seen as similar in flavor to SVHN (e.g., the images are of small cropped digits), but incorporates higher visual heterogeneity and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). BDRW is obtained from numbers appearing in photographs, printed materials, sign boards, wall writings, calendar or book pages, etc.

    File: BDRW_train.zip (contains BDRW_train_1.zip, BDRW_train_2.zip)

    The data in the two zip files are to be used together and together contain a set of .jpg images of different sized which are cropped from different photographs, magazine prints, wall writing images, etc. Each image represents a digit from the decimal number system written in Bengali (https://en.wikipedia.org/wiki/Bengali_numerals). The file labels.xls contains the number represented in each image which can be used as the ground truth labels for training a learning based system to recognize the Bengali numbers.

    Inspiration: This dataset is released for a machine vision challenge being hosted at IEEE TechSym 2016. The challenge will also include a testing set which includes samples not present in the training set released here and would be released after the challenge is closed.

  16. Twitter Tweets Sentiment Dataset

    • kaggle.com
    Updated Apr 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    M Yasser H (2022). Twitter Tweets Sentiment Dataset [Dataset]. https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 8, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    M Yasser H
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    https://raw.githubusercontent.com/Masterx-AI/Project_Twitter_Sentiment_Analysis_/main/twitt.jpg" alt="">

    Description:

    Twitter is an online Social Media Platform where people share their their though as tweets. It is observed that some people misuse it to tweet hateful content. Twitter is trying to tackle this problem and we shall help it by creating a strong NLP based-classifier model to distinguish the negative tweets & block such tweets. Can you build a strong classifier model to predict the same?

    Each row contains the text of a tweet and a sentiment label. In the training set you are provided with a word or phrase drawn from the tweet (selected_text) that encapsulates the provided sentiment.

    Make sure, when parsing the CSV, to remove the beginning / ending quotes from the text field, to ensure that you don't include them in your training.

    You're attempting to predict the word or phrase from the tweet that exemplifies the provided sentiment. The word or phrase should include all characters within that span (i.e. including commas, spaces, etc.)

    Columns:

    1. textID - unique ID for each piece of text
    2. text - the text of the tweet
    3. sentiment - the general sentiment of the tweet

    Acknowledgement:

    The dataset is download from Kaggle Competetions:
    https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

    Objective:

    • Understand the Dataset & cleanup (if required).
    • Build classification models to predict the twitter sentiments.
    • Compare the evaluation metrics of vaious classification algorithms.
  17. Data from: Mushroom classification

    • kaggle.com
    Updated Feb 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mathieu DUVERNE (2024). Mushroom classification [Dataset]. https://www.kaggle.com/datasets/mathieuduverne/mushroom-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 4, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Mathieu DUVERNE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset includes 8857 images. Mushroom are annotated in COCO format.

    The following pre-processing was applied to each image: * Auto-orientation of pixel data (with EXIF-orientation stripping) * Resize to 640x640 (Stretch)

    The following augmentation was applied to create 3 versions of each source image: * 50% probability of horizontal flip * 50% probability of vertical flip

    The structure:

    dataset-directory/
    ├─ README.dataset.txt
    ├─ README.roboflow.txt
    ├─ train
    │ ├─ train-image-1.jpg
    │ ├─ train-image-1.jpg
    │ ├─ ...
    │ └─ _annotations.coco.json
    ├─ test
    │ ├─ test-image-1.jpg
    │ ├─ test-image-1.jpg
    │ ├─ ...
    │ └─ _annotations.coco.json
    └─ valid
      ├─ valid-image-1.jpg
      ├─ valid-image-1.jpg
      ├─ ...
      └─ _annotations.coco.json
    

    To convert the format to YOLO annotations, go to roboflow.

  18. UFO sightings since 1906

    • kaggle.com
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hassan-sv (2025). UFO sightings since 1906 [Dataset]. https://www.kaggle.com/datasets/hassansv/ufo-sightings-since-1906
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 24, 2025
    Dataset provided by
    Kaggle
    Authors
    Hassan-sv
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    Overview of the Dataset

    The UFO sightings dataset contains records of UFO sightings reported globally since 1906. The dataset includes the following columns:

    datetime: The date and time of the sighting.
    
    day: The day of the week when the sighting occurred.
    
    city: The city where the sighting was reported.
    
    state: The state or region where the sighting occurred.
    
    country: The country where the sighting was reported.
    
    shape: The shape or form of the UFO observed.
    
    duration (seconds): The duration of the sighting in seconds.
    
    duration (hours/min): The duration of the sighting in hours and minutes.
    
    comments: Additional comments or descriptions provided by the witness.
    
    day_posted: The day the sighting was reported or posted.
    
    date posted: The date the sighting was reported or posted.
    
    latitude: The latitude coordinate of the sighting location.
    
    longitude: The longitude coordinate of the sighting location.
    
    days_count: The number of days between the sighting and when it was posted
    

    Analysis Process

    Data Cleaning and Preparation (Excel):
    
      Removed duplicate entries and handled missing values.
    
      Standardized formats for dates, times, and categorical variables (e.g., shapes, countries).
    
      Calculated additional metrics such as days_count (time between sighting and posting).
    
    Exploratory Data Analysis (SQL):
    
      Aggregated data to analyze trends, such as the number of sightings per country, state, or city.
    
      Calculated average durations of sightings by UFO shape.
    
      Identified the most common UFO shapes and their distribution across countries.
    
      Analyzed temporal trends, such as sightings per day or over time.
    
    Visualization (Tableau):
    
      Created interactive dashboards to visualize key insights.
    
      Developed charts such as:
    
        Average Duration of Sightings by Shape: Highlighting which UFO shapes were observed for the longest durations.
    
        UFO Shapes by Country: Showing the distribution of UFO shapes across different countries.
    
        UFO Shapes Total: A global overview of the most commonly reported UFO shapes.
    
        UFO Sightings in All Countries: A map or bar chart showing the number of sightings per country.
    
        UFO Sightings per Day: A time series analysis of sightings over days.
    
        UFO Sightings in the USA: A focused analysis of sightings in the United States, broken down by state or city.
    

    Key Insights and Conclusions

    Most Common UFO Shapes:
    
      The most frequently reported UFO shapes include lights, circles, and triangles.
    
      These shapes are consistent across multiple countries, suggesting common patterns in UFO sightings.
    
    Geographical Distribution:
    
      The United States has the highest number of reported UFO sightings, followed by Canada and the United Kingdom.
    
      Within the U.S., states like California, Florida, and Texas report the most sightings.
    
    Temporal Trends:
    
      Sightings have increased significantly since the mid-20th century, with a peak in the 2000s.
    
      Certain days of the week (e.g., weekends) show higher reporting rates, possibly due to increased outdoor activity.
    
    Duration of Sightings:
    
      The average duration of sightings varies by shape. For example, cigar-shaped UFOs tend to be observed for longer periods compared to light or disk shapes.
    
      Most sightings last less than a minute, but some reports describe durations of several hours.
    
    Reporting Delays:
    
      The days_count column reveals that many sightings are reported weeks or even months after they occur, indicating potential delays in witness reporting or data collection.
    
    Global Patterns:
    
      While the U.S. dominates the dataset, other countries show unique patterns in terms of UFO shapes and sighting frequencies.
    
      For example, Australia and Germany report a higher proportion of triangular UFOs compared to other shapes.
    

    Recommendations for Further Analysis

    Geospatial Analysis: Use latitude and longitude data to create heatmaps of sightings and identify potential hotspots.
    
    Text Analysis: Analyze the comments column using natural language processing (NLP) to extract common themes or keywords.
    
    Correlation with External Data: Investigate whether UFO sightings correlate with astronomical events, military activity, or other phenomena.
    
    Machine Learning: Build predictive models to identify patterns or classify sightings based on shape, duration, or location.
    

    Conclusion

    The UFO sightings dataset provides a fascinating glimpse into global reports of unidentified flying objects. Through careful analysis, I identified key trends in UFO shapes, durations, and geographical distribution. The United States emerges as the epicenter of UFO sightings, with lights and ...

  19. Discursos presidentes civis do Brasil - 1985-2022

    • kaggle.com
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pascoal Gonçalves (2024). Discursos presidentes civis do Brasil - 1985-2022 [Dataset]. http://doi.org/10.34740/kaggle/ds/5566939
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2024
    Dataset provided by
    Kaggle
    Authors
    Pascoal Gonçalves
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Essa base de dados foi organizada a partir do repositório da presidência da república e compreende a totalidade dos discursos dos presidentes brasileiros após a redemocratização do país, em 1985, até o fim de 2022. Todos os discursos estão disponibilizados em formato .txt e sua nomeação corresponde a sua datação em formato americano. Também podem ser encontrados os códigos que foram usados nas etapas de pré e pós processamento, disponíveis em formato .py e documentações que desmembram os códigos. O projeto contou com a apoio financeiro do Programa de Iniciação Teológica da UFPB.

    This database was organized from the repository of the Presidency of the Republic and includes all the speeches made by Brazilian presidents after the country's re-democratization in 1985 until the end of 2022. All the speeches are available in .txt format and their naming corresponds to their dating in American format. You can also find the codes that were used in the pre- and post-processing stages, available in .py format, and documentation that breaks down the codes. The project received financial support from the UFPB Theological Initiation Program.

    Autoria (authorship):

    Pascoal Teófilo Carvalho Gonçalves https://orcid.org/0000-0002-1336-3148 http://lattes.cnpq.br/8913105744643795

    Carla Suzana Gomes Meira https://orcid.org/0009-0003-9924-1401

    Romberg de Sá Gondim https://orcid.org/0000-0002-8857-9795

  20. NIH Chest X ray 14 (224x224 resized)

    • kaggle.com
    zip
    Updated Jul 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khan Fashee Monowar (Sawrup) (2020). NIH Chest X ray 14 (224x224 resized) [Dataset]. https://www.kaggle.com/khanfashee/nih-chest-x-ray-14-224x224-resized
    Explore at:
    zip(2468882507 bytes)Available download formats
    Dataset updated
    Jul 8, 2020
    Authors
    Khan Fashee Monowar (Sawrup)
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    National Institutes of Health Chest X-Ray Dataset

    Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

    This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

    Data limitations:

    The image labels are NLP extracted so there could be some erroneous labels but the NLP labeling accuracy is estimated to be >90%.
    Very limited numbers of disease region bounding boxes (See BBoxlist2017.csv)
    Chest x-ray radiology reports are not anticipated to be publicly shared. Parties who use this public dataset are encouraged to share their “updated” image labels and/or new bounding boxes in their own studied later, maybe through manual annotation
    

    File contents

    Image format: 112,120 total images with size 1024 x 1024
    
    images_001.zip: Contains 4999 images
    
    images_002.zip: Contains 10,000 images
    
    images_003.zip: Contains 10,000 images
    
    images_004.zip: Contains 10,000 images
    
    images_005.zip: Contains 10,000 images
    
    images_006.zip: Contains 10,000 images
    
    images_007.zip: Contains 10,000 images
    
    images_008.zip: Contains 10,000 images
    
    images_009.zip: Contains 10,000 images
    
    images_010.zip: Contains 10,000 images
    
    images_011.zip: Contains 10,000 images
    
    images_012.zip: Contains 7,121 images
    
    README_ChestXray.pdf: Original README file
    
    BBoxlist2017.csv: Bounding box coordinates. Note: Start at x,y, extend horizontally w pixels, and vertically h pixels
      Image Index: File name
      Finding Label: Disease type (Class label)
      Bbox x
      Bbox y
      Bbox w
      Bbox h
    
    Dataentry2017.csv: Class labels and patient data for the entire dataset
      Image Index: File name
      Finding Labels: Disease type (Class label)
      Follow-up #
      Patient ID
      Patient Age
      Patient Gender
      View Position: X-ray orientation
      OriginalImageWidth
      OriginalImageHeight
      OriginalImagePixelSpacing_x
      OriginalImagePixelSpacing_y
    

    Class descriptions

    There are 15 classes (14 diseases, and one for "No findings"). Images can be classified as "No findings" or one or more disease classes:

    Atelectasis
    Consolidation
    Infiltration
    Pneumothorax
    Edema
    Emphysema
    Fibrosis
    Effusion
    Pneumonia
    Pleural_thickening
    Cardiomegaly
    Nodule Mass
    Hernia
    

    Full Dataset Content

    There are 12 zip files in total and range from ~2 gb to 4 gb in size. Additionally, we randomly sampled 5% of these images and created a smaller dataset for use in Kernels. The random sample contains 5606 X-ray images and class labels.

    Sample: sample.zip
    

    Modifications to original data

    Original TAR archives were converted to ZIP archives to be compatible with the Kaggle platform
    
    CSV headers slightly modified to be more explicit in comma separation and also to allow fields to be self-explanatory
    

    Citations

    Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. IEEE CVPR 2017, ChestX-ray8Hospital-ScaleChestCVPR2017_paper.pdf
    
    NIH News release: NIH Clinical Center provides one of the largest publicly available chest x-ray datasets to scientific community
    
    Original source files and documents: https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345
    
Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Habib Mrad (2021). Image Pre-processing for Model Training [Dataset]. https://www.kaggle.com/datasets/habibmrad1983/image-preprocessing-for-model-training
Organization logo

Image Pre-processing for Model Training

Explore at:
4 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 23, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Habib Mrad
Description

Dataset

This dataset was created by Habib Mrad

Contents

Search
Clear search
Close search
Google apps
Main menu