91 datasets found

Imbalanced Cifar-10
kaggle.com
zip
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
Explore at:
zip(807146485 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Akhil Theerthala
Description
This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.
Data from: SOMOTE_EASY: AN ALGORITHM TO TREAT THE CLASSIFICATION ISSUE IN...
scielo.figshare.com
jpeg
Updated Jun 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugo Leonardo Pereira Rufino; Antônio Cláudio Paschoarelli Veiga; Paula Teixeira Nakamoto (2023). SOMOTE_EASY: AN ALGORITHM TO TREAT THE CLASSIFICATION ISSUE IN REAL DATABASES [Dataset]. http://doi.org/10.6084/m9.figshare.14287861.v1
Explore at:
jpegAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.14287861.v1
Dataset updated
Jun 11, 2023
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Hugo Leonardo Pereira Rufino; Antônio Cláudio Paschoarelli Veiga; Paula Teixeira Nakamoto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT Most classification tools assume that data distribution be balanced or with similar costs, when not properly classified. Nevertheless, in practical terms, the existence of database where unbalanced classes occur is commonplace, such as in the diagnosis of diseases, in which the confirmed cases are usually rare when compared with a healthy population. Other examples are the detection of fraudulent calls and the detection of system intruders. In these cases, the improper classification of a minority class (for instance, to diagnose a person with cancer as healthy) may result in more serious consequences that incorrectly classify a majority class. Therefore, it is important to treat the database where unbalanced classes occur. This paper presents the SMOTE_Easy algorithm, which can classify data, even if there is a high level of unbalancing between different classes. In order to prove its efficiency, a comparison with the main algorithms to treat classification issues was made, where unbalanced data exist. This process was successful in nearly all tested databases
Class distribution for 5-class classification.
plos.figshare.com
xls
Updated May 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Class distribution for 5-class classification. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320955.t002
Dataset updated
May 15, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.
Multi-class Weather Dataset
kaggle.com
zip
Updated Jul 26, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prateek Srivastava (2020). Multi-class Weather Dataset [Dataset]. https://www.kaggle.com/pratik2901/multiclass-weather-dataset
Explore at:
zip(95798762 bytes)Available download formats
Dataset updated
Jul 26, 2020
Authors
Prateek Srivastava
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Multi-class Weather Dataset for Image Classification

Multi-class weather dataset(MWD) for image classification is a valuable dataset used in the research paper entitled “Multi-class weather recognition from the still image using heterogeneous ensemble method”.

Inspiration behind your dataset

The dataset provides a platform for outdoor weather analysis by extracting various features for recognizing different weather conditions.

Please note we have updated the folder structure for the dataset folder, just to facilitate the data load prodecure

Class Distribution

Class # of Images
Sunrise 357
Shine 253
Rain 215
Cloudy 300

Data Publication

The dataset was published on Mendeley Data

Cite: Ajayi, Gbeminiyi (2018), Multi-class Weather Dataset for Image Classification, Mendeley Data, v1

DOI

http://dx.doi.org/10.17632/4drtyfjtfy.1

Published:

2018-09-13

Institutions:

University of South Africa - Science Campus

Tags:

Earth and Nature

Multiclass Classification

Image Processing

Deep Learning

Machine Learning

Computer Vision

Licence

CC BY 4.0
Z
Data from: Dataset of lightning flashovers on medium voltage distribution...
data.niaid.nih.gov
Updated Jan 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarajcev, Petar (2023). Dataset of lightning flashovers on medium voltage distribution lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7382547
Explore at:
Dataset updated
Jan 31, 2023
Dataset provided by
University of Split, FESB
Authors
Sarajcev, Petar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This synthetic dataset was generated from Monte Carlo simulations of lightning flashovers on medium voltage (MV) distribution lines. It is suitable for training machine learning models for classifying lightning flashovers on distribution lines. The dataset is hierarchical in nature (see below for more information) and class imbalanced.

Following five different types of lightning interaction with the MV distribution line have been simulated: (1) direct strike to phase conductor (when there is no shield wire present on the line), (2) direct strike to phase conductor with shield wire(s) present on the line (i.e. shielding failure), (3) direct strike to shield wire with backflashover event, (4) indirect near-by lightning strike to ground where shield wire is not present, and (5) indirect near-by lightning strike to ground where shield wire is present on the line. Last two types of lightning interactions induce overvoltage on the phase conductors by radiating EM fields from the strike channel that are coupled to the line conductors. Three different methods of indirect strike analysis have been implemented, as follows: Rusck's model, Chowdhuri-Gross model and Liew-Mar model. Shield wire(s) provide shielding effects to direct, as well as screening effects to indirect, lightning strikes.

Dataset consists of two independent distribution lines, with heights of 12 m and 15 m, each with a flat configuration of phase conductors. Twin shield wires, if present, are 1.5 m above the phase conductors and 3 m apart [2]. CFO level of the 12 m distribution line is 150 kV and that of the 15 m distribution line is 160 kV. Dataset consists of 10,000 simulations for each of the distribution lines.

Dataset contains following variables (features):

'dist': perpendicular distance of the lightning strike location from the distribution line axis (m), generated from the Uniform distribution [0, 500] m,

'ampl': lightning current amplitude of the strike (kA), generated from the Log-Normal distribution (see IEC 60071 for additional information),

'front': lightning current wave-front time (us), generated from the Log-Normal distribution; it needs to be emphasized that amplitudes (ampl) and wave-front times (front), as random variables, have been generated from the appropriate bivariate probability distribution which includes statistical correlation between these variates,

'veloc': velocity of the lightning return-stroke current defined indirectly through the parameter "w" that is generated from the Uniform distribution [50, 500] m/us, which is then used for computing the velocity from the following relation: v = c/sqrt(1+w/I), where "c" is the speed of light in free space (300 m/us) and "I" is the lightning-current amplitude,

'shield': binary indicator that signals presence or absence of the shield wire(s) on the line (0/1), generated from the Bernoulli distribution with a 50% probability,

'Ri': average value of the impulse impedance of the tower's grounding (Ohm), generated from the Normal distribution (clipped at zero on the left side) with median value of 50 Ohm and standard deviation of 12.5 Ohm; it should be mentioned that the impulse impedance is often much larger than the associated grounding resistance value, which is why a rather high value of 50 Ohm have been used here,

'EGM': electrogeometric model used for analyzing striking distances of the distribution line's tower; following options are available: 'Wagner', 'Young', 'AW', 'BW', 'Love', and 'Anderson', where 'AW' stands for Armstrong & Whitehead, while 'BW' means Brown & Whitehead model; statistical distribution of EGM models follows a user-defined discrete categorical distribution with respective probabilities: p = [0.1, 0.2, 0.1, 0.1, 0.3, 0.2],

'ind': indirect stroke model used for analyzing near-by indirect lightning strikes; following options were implemented: 'rusk' for the Rusck's model, 'chow' for the Chowdhuri-Gross model (with Jakubowski modification) and 'liew' for the Liew-Mar model; statistical distribution of these three models follows a user-defined discrete categorical distribution with respective probabilities: p = [0.6, 0.2, 0.2],

'CFO': critical flashover voltage level of the distribution line's insulation (kV),

'height': height of the phase conductors of the distribution line (m),

'flash': binary indicator that signals if the flashover has been recorded (1) or not (0). This variable is the outcome/label (i.e. binary class).

Mathematical background used for the analysis of lightning interaction with the MV distribution line can be found in the references cited below.

References:

A. R. Hileman, "Insulation Coordination for Power Systems", CRC Press, Boca Raton, FL, 1999.

J. A. Martinez and F. Gonzalez-Molina, "Statistical evaluation of lightning overvoltages on overhead distribution lines using neural networks," in IEEE Transactions on Power Delivery, vol. 20, no. 3, pp. 2219-2226, July 2005.

A. Borghetti, C. A. Nucci and M. Paolone, An Improved Procedure for the Assessment of Overhead Line Indirect Lightning Performance and Its Comparison with the IEEE Std. 1410 Method, IEEE Transactions on Power Delivery, Vol. 22, No. 1, 2007, pp. 684-692.
fruit quality dataset
kaggle.com
zip
Updated Jan 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamad Alhammoud (2024). fruit quality dataset [Dataset]. https://www.kaggle.com/datasets/mohamadalhammoud/fruit-quality-dataset
Explore at:
zip(938456722 bytes)Available download formats
Dataset updated
Jan 25, 2024
Authors
Mohamad Alhammoud
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Fruit Quality Detection Dataset

This dataset is meticulously curated to facilitate the training of machine learning models, such as YOLOv8, for fruit quality detection. It includes labeled images of fruits classified into categories such as 'bad apple', 'bad banana', 'bad orange', 'bad pomegranate', 'good apple', 'good banana', 'good orange', and 'good pomegranate'.

Dataset Versions and Updates:

Version 1: Initial Release Sourced from Roboflow under the title "Rotten Fruit Detector ver 2 Computer Vision Project", this initial version required minimal modifications. Key changes include an update to the data.yaml file where the matrix of names was adjusted by shifting the row indexes down (the first index was deleted), and labels were updated accordingly. The dataset comprises 3,078 training images (70%), 878 validation images (20%), and 442 test images (10%). This version faced challenges with unbalanced class distribution, as illustrated in the distribution graph below:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14431819%2F4503fc6ecba32d263eb72d6471dcbeb4%2Fversion%201.png?generation=1712755938967609&alt=media">

Version 4: Data Augmentation To address the imbalance, several augmentation techniques were applied:

90-degree rotations (none, clockwise, counter-clockwise, upside-down).

Random rotations between -5 and +5 degrees.

Exposure adjustments from -10 to +10 percent.

Gaussian blur ranging from 0 to 1 pixel.

These modifications improved the balance slightly, reflected in the revised counts of 8,318 training images (85%), 924 validation images (10%), and 438 test images (5%), and in the updated distribution graph:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14431819%2Fb472a450991aba5f5ef8d715f3fa0831%2Fversion%202.png?generation=1712756217334823&alt=media">

Version 5: Further Balancing Further enhancements were made to improve data balance. This latest version consists of 6,570 training images (85%), 730 validation images (10%), and 438 test images (5%). The distribution of these images has been optimized for a more balanced dataset, as shown in the graph below:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14431819%2F88bd8e95f9ba87894985a969b216f3aa%2Fversion%203.png?generation=1712756424447707&alt=media">
o
Sport and leisure facilities
data.opendatascience.eu
Updated Jan 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Sport and leisure facilities [Dataset]. https://data.opendatascience.eu/geonetwork/srv/search?type=dataset
Explore at:
Dataset updated
Jan 2, 2021
Description
Overview: 142: Areas used for sports, leisure and recreation purposes. Traceability (lineage): This dataset was produced with a machine learning framework with several input datasets, specified in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ) Scientific methodology: The single-class probability layers were generated with a spatiotemporal ensemble machine learning framework detailed in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ). The single-class uncertainty layers were calculated by taking the standard deviation of the three single-class probabilities predicted by the three components of the ensemble. The HCL (hard class) layers represents the class with the highest probability as predicted by the ensemble. Usability: The HCL layers have a decreasing average accuracy (weighted F1-score) at each subsequent level in the CLC hierarchy. These metrics are 0.83 at level 1 (5 classes):, 0.63 at level 2 (14 classes), and 0.49 at level 3 (43 classes). This means that the hard-class maps are more reliable when aggregating classes to a higher level in the hierarchy (e.g. 'Discontinuous Urban Fabric' and 'Continuous Urban Fabric' to 'Urban Fabric'). Some single-class probabilities may more closely represent actual patterns for some classes that were overshadowed by unequal sample point distributions. Users are encouraged to set their own thresholds when postprocessing these datasets to optimize the accuracy for their specific use case. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: The LULC classification was validated through spatial 5-fold cross-validation as detailed in the accompanying publication. Completeness: The dataset has chunks of empty predictions in regions with complex coast lines (e.g. the Zeeland province in the Netherlands and the Mar da Palha bay area in Portugal). These are artifacts that will be avoided in subsequent versions of the LULC product. Consistency: The accuracy of the predictions was compared per year and per 30km*30km tile across europe to derive temporal and spatial consistency by calculating the standard deviation. The standard deviation of annual weighted F1-score was 0.135, while the standard deviation of weighted F1-score per tile was 0.150. This means the dataset is more consistent through time than through space: Predictions are notably less accurate along the Mediterrranean coast. The accompanying publication contains additional information and visualisations. Positional accuracy: The raster layers have a resolution of 30m, identical to that of the Landsat data cube used as input features for the machine learning framework that predicted it. Temporal accuracy: The dataset contains predictions and uncertainty layers for each year between 2000 and 2019. Thematic accuracy: The maps reproduce the Corine Land Cover classification system, a hierarchical legend that consists of 5 classes at the highest level, 14 classes at the second level, and 44 classes at the third level. Class 523: Oceans was omitted due to computational constraints.
f
Data_Sheet_1_Mild cognitive impairment prediction and cognitive score...
datasetcatalog.nlm.nih.gov
Updated Jan 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rutkowski, Tomasz M.; Otake-Matsuura, Mihoko; Komendziński, Tomasz (2024). Data_Sheet_1_Mild cognitive impairment prediction and cognitive score regression in the elderly using EEG topological data analysis and machine learning with awareness assessed in affective reminiscent paradigm.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001273208
Explore at:
Dataset updated
Jan 4, 2024
Authors
Rutkowski, Tomasz M.; Otake-Matsuura, Mihoko; Komendziński, Tomasz
Description
IntroductionThe main objective of this study is to evaluate working memory and determine EEG biomarkers that can assist in the field of health neuroscience. Our ultimate goal is to utilize this approach to predict the early signs of mild cognitive impairment (MCI) in healthy elderly individuals, which could potentially lead to dementia. The advancements in health neuroscience research have revealed that affective reminiscence stimulation is an effective method for developing EEG-based neuro-biomarkers that can detect the signs of MCI.MethodsWe use topological data analysis (TDA) on multivariate EEG data to extract features that can be used for unsupervised clustering, subsequent machine learning-based classification, and cognitive score regression. We perform EEG experiments to evaluate conscious awareness in affective reminiscent photography settings.ResultsWe use EEG and interior photography to distinguish between healthy cognitive aging and MCI. Our clustering UMAP and random forest application accurately predict MCI stage and MoCA scores.DiscussionOur team has successfully implemented TDA feature extraction, MCI classification, and an initial regression of MoCA scores. However, our study has certain limitations due to a small sample size of only 23 participants and an unbalanced class distribution. To enhance the accuracy and validity of our results, future research should focus on expanding the sample size, ensuring gender balance, and extending the study to a cross-cultural context.
Waste Classfication Dataset
kaggle.com
Updated Jun 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaan Çerkez (2025). Waste Classfication Dataset [Dataset]. https://www.kaggle.com/datasets/kaanerkez/waste-classfication-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaan Çerkez
License
https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Description
Balanced Waste Classification Dataset - E-Waste & Mixed Materials

🎯 Dataset Overview

This dataset contains a comprehensive collection of waste images designed for training machine learning models to classify different types of waste materials, with a strong focus on electronic waste (e-waste) and mixed materials. The dataset includes 7 electronic device categories alongside traditional recyclable materials, making it ideal for modern waste management challenges where electronic devices constitute a significant portion of waste streams. The dataset has been carefully curated and balanced to ensure optimal performance for multi-category waste classification tasks using deep learning approaches.

📊 Dataset Statistics

Total Classes: 17 different waste categories

Images per Class: 400 (balanced)

Total Images: 6,800

Image Format: RGB (3 channels)

Recommended Input Size: 224×224 pixels

Data Structure: Single balanced dataset (not pre-split)

🗂️ Waste Categories

The dataset includes 17 distinct waste categories covering various types of materials commonly found in waste management scenarios:

Battery - Various types of batteries

Cardboard - Cardboard packaging and boxes

Glass - Glass containers and bottles

Keyboard - Computer keyboards and input devices

Metal - Metal cans and metallic waste

Microwave - Microwave ovens and similar appliances

Mobile - Mobile phones and smartphones

Mouse - Computer mice and peripherals

Organic - Biodegradable organic waste

Paper - Paper products and documents

PCB - Printed Circuit Boards (electronic components)

Plastic - Plastic containers and packaging

Player - Media players and entertainment devices

Printer - Printers and printing equipment

Television - TV sets and display devices

Trash - General mixed waste

Washing Machine - Washing machines and large appliances

🛠️ Data Processing Pipeline

1. Data Balancing

Undersampling: Applied to classes with >400 images

Data Augmentation: Applied to classes with <400 images

Target: Exactly 400 images per class for balanced training

2. Data Augmentation Techniques

Rotation: ±20 degrees

Width/Height Shift: ±20%

Shear Range: 20%

Zoom Range: 20%

Horizontal Flip: Enabled

Fill Mode: Nearest neighbor

3. Quality Assurance

Consistent image dimensions

Proper file format validation

Balanced class distribution

Clean data structure

🎯 Recommended Use Cases

Primary Applications

E-Waste Classification: Specialized in electronic devices (Mobile, Keyboard, Mouse, PCB, etc.)

Mixed Waste Sorting: Traditional recyclables (Paper, Plastic, Glass, Metal, Cardboard)

Smart Recycling Systems: Automated waste sorting for both organic and electronic materials

Environmental Monitoring: Multi-category waste identification

Appliance Recycling: Large appliance classification (Microwave, TV, Washing Machine)

Special Features

Electronic Waste Focus: Strong representation of e-waste categories (7 out of 17 classes)

Diverse Material Types: From organic waste to complex electronic devices

Real-world Categories: Practical classification for actual waste management scenarios

Appliance Recognition: Specialized in identifying large household appliances

Model Architectures

Convolutional Neural Networks (CNN)

Transfer Learning with MobileNetV2, ResNet, EfficientNet

Vision Transformers (ViT)

Custom architectures for waste classification

📁 Dataset Structure

balanced_waste_images/ ├── category_1/ │ ├── image_001.jpg │ ├── image_002.jpg │ └── ... (400 images) ├── category_2/ │ ├── image_001.jpg │ └── ... (400 images) └── ... (17 categories total)

Note: Dataset is not pre-split. Users need to create train/validation/test splits as needed.

🚀 Getting Started

Step 1: Data Splitting

Since the dataset is not pre-split, you'll need to create train/validation/test splits:

import splitfolders # Split dataset: 80% train, 10% val, 10% test splitfolders.ratio( input='balanced_waste_images', output='split_data', seed=42, ratio=(.8, .1, .1), group_prefix=None, move=False )

Step 2: Data Loading & Preprocessing

from tensorflow.keras.preprocessing.image import ImageDataGenerator # Data generators with preprocessing train_datagen = ImageDataGenerator(rescale=1./255) val_datagen = ImageDataGenerator(rescale=1./255) train_generator = train_datagen.flow_from_directory( 'split_data/train/', target_size=(224, 224), batch_size=32, class_mode='categorical' ) val_generator = val_datagen.flow_from_director...
Edelweiss Image Dataset
kaggle.com
zip
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fransiscus Rolanda Malau (2022). Edelweiss Image Dataset [Dataset]. https://www.kaggle.com/datasets/ndomalau/edelweis-flower
Explore at:
zip(12912266177 bytes)Available download formats
Dataset updated
Jun 19, 2022
Authors
Fransiscus Rolanda Malau
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context

Image classification is one of the fundamental tasks in computer vision and machine learning. High-quality datasets are crucial for training robust models that can accurately identify different species. This dataset focuses on three distinct species commonly found in mountainous regions, providing a balanced collection of images for both training and evaluation purposes.

Content

This dataset contains 4,550 high-quality images distributed across three categories: - Training set: 3,500 images (approximately 1,167 images per class) - Test set: 1,050 images (350 images per class)

The dataset is organized in a structured format with separate directories for: 1. Anaphalis Javanica 2. Leontopodium Alpinum 3. Leucogenes Grandiceps

Each image in the dataset has been carefully prepared to ensure consistency and quality for machine learning applications. The balanced distribution between classes helps prevent bias during model training.

Applications

Species classification and identification

Computer vision model development

Educational purposes in botany and biodiversity studies

Benchmarking machine learning algorithms

The dataset's clean split between training and test sets makes it ideal for developing and evaluating classification models while following machine learning best practices.
Presence-Absence Points for Tree Species Distribution Modelling for Europe
zenodo.org
data.niaid.nih.gov
application/gzip, bin +2
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carmelo Bonannella; Carmelo Bonannella; Tomislav Hengl; Tomislav Hengl; Johannes Heisig; Johannes Heisig; Leandro Leal Parente; Leandro Leal Parente; Marvin Wright; Marvin Wright; Martin Herold; Martin Herold; Sytze de Bruin; Sytze de Bruin (2024). Presence-Absence Points for Tree Species Distribution Modelling for Europe [Dataset]. http://doi.org/10.5281/zenodo.5818022
Explore at:
bin, pdf, png, application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5818022
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Carmelo Bonannella; Carmelo Bonannella; Tomislav Hengl; Tomislav Hengl; Johannes Heisig; Johannes Heisig; Leandro Leal Parente; Leandro Leal Parente; Marvin Wright; Marvin Wright; Martin Herold; Martin Herold; Sytze de Bruin; Sytze de Bruin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Europe
Description
The dataset is a collection of presence and absence points for forest tree species for Europe. Each unique combination of longitude, latitude and year was considered as an independent sample. Presence data was obtained from the harmonized tree species occurrence dataset by Heising and Hengl (2020) and absence data from the LUCAS (in-situ source) dataset.

A set of 50 different forest tree species was selected from the harmonized tree species dataset and data lacking a temporal observation was overlaid with yearly forest masks derived from land cover maps produced by Parente et al. (2021). We overlaid the points with the probability maps for the classes:

311: Broad-leaved forest,

312: Coniferous forest,

313: Mixed forest,

323: Sclerophyllous forest,

324: Transitional woodland-shrub,

333: Sparsely vegetated area.

Points were included in the dataset only if the probability value extracted for at least one of the above classes was ≥ 50% for all the years considered. An additional quality flag was added to distinguish points coming from this operation and the points with original year of observation coming from source datasets.

The final dataset contains 4,359,999 observations for and a total of 630 columns.

The first 8 columns of the dataset contain metadata information used to uniquely identify the points:

id: unique point identifier,

year: year of observation,

postprocess: quality flag to identify if the temporal reference of an observation comes from the original dataset or is the result of spatiotemporal overlay with forest masks,

Tile_ID: contains the tile id from a 30 km grid,

easting: longitude coordinates in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035),

northing: latitude coordinates in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035),

Atlas_class: name of the tree species according to the European Atlas of Forest Tree Species or NULL in case of absence point,

lc1: contains original LUCAS land cover class or NULL if it's a presence point.

The remaining columns contain the extracted values of a series of predictor variables (temperature, precipitation, elevation, topographical information, spectral reflectance) useful for species distribution modeling applications. These points were used to model the potential and realized distribution of a series of 16 target species for the period 2000 - 2020. The approach involved training three ML models to predict probability of presence (i.e. Random Forest, XGBoost, GLM), which served as input to train a linear meta-model (i.e. Logistic regression classifier), responsible for predicting the final probability of presence for each species.

The 10 most important variables used by each of the three base models are available in the "variable importance" plots for both potential and realized distribution in a PDF format.

The RDS file is created from a data.table object and suitable for fast reading in the R-programming environment. The CSV.GZ file contains records as a table with easting and northing in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035) and can be fed in a GIS after being unzipped.

To access the predictions of the meta-model (probabilities and uncertainties) produced for these species access:

Open Data Science Europe viewer: https://maps.opendatascience.eu

If you would like to know more about the creation of this dataset and the modeling, watch the talk at Open Data Science Workshop 2021 (TIB AV-PORTAL)

A publication describing, in detail, all processing steps, accuracy assessment and general analysis of species distribution maps is under preparation. To suggest any improvement/fix use https://gitlab.com/geoharmonizer_inea/spatial-layers/-/issues.
Lemon Leaf Classification Data Set.
kaggle.com
zip
Updated Sep 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankur Ray Chayan (2024). Lemon Leaf Classification Data Set. [Dataset]. https://www.kaggle.com/datasets/ankurray00/lemon-leaf-class-classification-data-set
Explore at:
zip(8901539069 bytes)Available download formats
Dataset updated
Sep 14, 2024
Authors
Ankur Ray Chayan
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Lemon Leaf Classification Dataset: A carefully curated dataset consisting of five distinct classes of lemon leaves, designed for robust image classification tasks. Each class represents unique variations in leaf characteristics, including shape, texture, and disease conditions. This dataset is ideal for developing and testing machine learning and deep learning models, contributing to the advancement of agricultural research. The balanced class distribution ensures a reliable foundation for classification models, enhancing precision in identifying different types of lemon leaves and promoting disease detection.
Outline of class distribution in the dataset.
plos.figshare.com
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lucas Teoh; Achintha Avin Ihalage; Srooley Harp; Zahra F. Al-Khateeb; Adina T. Michael-Titus; Jordi L. Tremoleda; Yang Hao (2023). Outline of class distribution in the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0268962.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0268962.t001
Dataset updated
Jun 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Lucas Teoh; Achintha Avin Ihalage; Srooley Harp; Zahra F. Al-Khateeb; Adina T. Michael-Titus; Jordi L. Tremoleda; Yang Hao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Outline of class distribution in the dataset.
H
Data from: Data augmentation for disruption prediction via robust surrogate...
dataverse.harvard.edu
osti.gov
Updated Aug 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/FMJCAD
Dataset updated
Aug 31, 2024
Dataset provided by
Harvard Dataverse
Authors
Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.
Finance-sensitivity-LLM-fintuning
kaggle.com
zip
Updated Nov 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aditya Mac (2023). Finance-sensitivity-LLM-fintuning [Dataset]. https://www.kaggle.com/datasets/adityamac/finance-sensitivity-llm-fintuning/code
Explore at:
zip(554864 bytes)Available download formats
Dataset updated
Nov 14, 2023
Authors
Aditya Mac
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a collection of tweets related to financial institutions such as banks, credit unions, and other financial service providers. Each tweet has been manually labeled with its corresponding sentiment, either positive, negative, or neutral. The dataset contains a mix of complaints and praises directed towards these financial institutions, providing a balanced perspective on public opinion.

Data Quality: The dataset has been carefully curated to ensure high-quality data. Tweets with incomplete or ambiguous information were excluded, and special attention was given to ensuring that the labels accurately reflect the sentiment expressed in the corresponding tweet.

Use Cases: This dataset can be used for various applications, such as: - Sentiment analysis research: Dataset provides a rich resource for studying the opinions and perceptions people hold towards financial institutions. Researchers can use this dataset to investigate factors influencing sentiment, compare sentiments across different demographics or institutions and analyze trends over time. - Machine learning model development: With its balanced class distribution, dataset offers an excellent opportunity to train and evaluate machine learning models for sentiment classification tasks. Models developed using this dataset can potentially achieve high accuracy and generalize well to new, unseen data. - Business intelligence: Financial institutions can leverage insights gained from dataset to identify areas where they excel or struggle in terms of customer satisfaction. By analyzing the feedback expressed in the tweets, institutions can improve their services, address common concerns, and enhance overall customer experience.

Overall, this dataset represents a valuable asset for anyone interested in exploring the complex dynamics of public sentiment towards financial institutions. Its diverse range of opinions and topics provides a fertile ground for research, model development, and practical applications in the finance industry.
f
Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Nov 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varotto, Giulia; Franceschetti, Silvana; Susi, Gianluca; Tassi, Laura; Panzica, Ferruccio; Gozzo, Francesca (2021). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000861498
Explore at:
Dataset updated
Nov 19, 2021
Authors
Varotto, Giulia; Franceschetti, Silvana; Susi, Gianluca; Tassi, Laura; Panzica, Ferruccio; Gozzo, Francesca
Description
Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.
MSL Curiosity Rover Images with Science and Engineering Classes
zenodo.org
data.niaid.nih.gov
zip
Updated Sep 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. http://doi.org/10.5281/zenodo.4033453
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4033453
Dataset updated
Sep 17, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

Data Set Description

The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

Directory Contents

images - contains all 6,820 images

class_map.csv - string-integer class mappings

train-set-v2.1.txt - label file for the training set

val-set-v2.1.txt - label file for the validation set

test-set-v2.1.txt - label file for the test set

The label files are formatted as below:

"Image-file-name class_in_integer_representation"

Labeling Process

Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

If all three labels agree with each other, then use the label as the final label.

If the three labels do not agree with each other, then we manually review the labels and decide the final label.

We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

Classes

There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

Class name, counts (training set), counts (validation set), counts (test set), integer representation

Arm cover, 10, 1, 4, 0

Other rover part, 190, 11, 10, 1

Artifact, 680, 62, 132, 2

Nearby surface, 1554, 74, 187, 3

Close-up rock, 1422, 50, 84, 4

DRT, 8, 4, 6, 5

DRT spot, 214, 1, 7, 6

Distant landscape, 342, 14, 34, 7

Drill hole, 252, 5, 12, 8

Night sky, 40, 3, 4, 9

Float, 190, 5, 1, 10

Layers, 182, 21, 17, 11

Light-toned veins, 42, 4, 27, 12

Mastcam cal target, 122, 12, 29, 13

Sand, 228, 19, 16, 14

Sun, 182, 5, 19, 15

Wheel, 212, 5, 5, 16

Wheel joint, 62, 1, 5, 17

Wheel tracks, 26, 3, 1, 18

Image Augmentation

Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

90 degrees clockwise rotation (file name ends with -r90.jpg)

180 degrees clockwise rotation (file name ends with -r180.jpg)

270 degrees clockwise rotation (file name ends with -r270.jpg)

Horizontal flip (file name ends with -fh.jpg)

Vertical flip (file name ends with -fv.jpg)

Acknowledgment

The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.
m
BananaImageBD: An Extensive Image Dataset of Common Bangladeshi Banana...
data.mendeley.com
Updated Sep 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Hasanul Ferdaus (2024). BananaImageBD: An Extensive Image Dataset of Common Bangladeshi Banana Varieties with Different Ripeness Levels [Dataset]. http://doi.org/10.17632/ptfscwtnyz.1
Explore at:
Unique identifier
https://doi.org/10.17632/ptfscwtnyz.1
Dataset updated
Sep 4, 2024
Authors
Md Hasanul Ferdaus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Bangladesh
Description
Type of data: 256x256 px Banana images. Data format: JPEG Contents of the dataset: Banana cultivars and ripeness stages.

Number of classes: (1) Four Most Popular Banana cultivars in Bangladesh - Bangla Kola, Chompa Kola, Sabri Kola, and Sagor Kola, and (2) Four Ripeness Stages - Green, Semi-ripe, Ripe, and Overripe

Number of images: (1) Total original (raw) images of banana cultivars = 2512, Augmented to 7536 images, and (2) Total original (raw) images of ripeness stages = 825, Augmented to 2460 images.

Distribution of instances: (1) Original (raw) images in each class of banana cultivars: Bangla Kola = 444, Champa Kola = 1035, Sabri Kola = 509, and Sagor Kola = 524. (2) Augmented images in each class of banana cultivars: Bangla Kola = 1332, Chompa Kola = 3105, Sabri Kola = 1527, Sagor Kola = 1572. (3) Original (raw) images in each class of Ripeness stages: Green = 213, Semi-ripe = 205, Ripe = 204, and Overripe = 203. (4) Augmented images in each class of Ripeness stages: Green = 639, Semi-ripe = 612, Ripe = 600, and Overripe = 609.

Dataset Size: (1) Total size of the original (raw) banana cultivars dataset = 17.5 MB. (2) Total size of the augmented banana cultivars dataset = 80.1 MB. (3) Total size of the original (raw) ripeness stages dataset = 5.58 MB, and (4) Total size of the augmented ripeness stages dataset = 25.4 MB.

Data Acuisition Process: Images of bananas are captured using mobile phone cameras. Data Source Location: Local banana wholesale markets and retail fruit shops from different places of Bangladesh. Where applicable: Training machine learning and deep learning models to distinguish popular banana cultivars of Bangladesh and the ripeness stages of bananas.
f
Table_1_Deep Learning-Based Multilevel Classification of Alzheimer’s Disease...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Apr 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gwak, Jeonghwan; Song, Jong-In; Ho, Thi Kieu Khanh; Jeon, Younghun; Lee, Kun Ho; Kim, Minhee; Kim, Byeong C.; Kim, Jae Gwan (2022). Table_1_Deep Learning-Based Multilevel Classification of Alzheimer’s Disease Using Non-invasive Functional Near-Infrared Spectroscopy.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000231964
Explore at:
Dataset updated
Apr 26, 2022
Authors
Gwak, Jeonghwan; Song, Jong-In; Ho, Thi Kieu Khanh; Jeon, Younghun; Lee, Kun Ho; Kim, Minhee; Kim, Byeong C.; Kim, Jae Gwan
Description
The timely diagnosis of Alzheimer’s disease (AD) and its prodromal stages is critically important for the patients, who manifest different neurodegenerative severity and progression risks, to take intervention and early symptomatic treatments before the brain damage is shaped. As one of the promising techniques, functional near-infrared spectroscopy (fNIRS) has been widely employed to support early-stage AD diagnosis. This study aims to validate the capability of fNIRS coupled with Deep Learning (DL) models for AD multi-class classification. First, a comprehensive experimental design, including the resting, cognitive, memory, and verbal tasks was conducted. Second, to precisely evaluate the AD progression, we thoroughly examined the change of hemodynamic responses measured in the prefrontal cortex among four subject groups and among genders. Then, we adopted a set of DL architectures on an extremely imbalanced fNIRS dataset. The results indicated that the statistical difference between subject groups did exist during memory and verbal tasks. This presented the correlation of the level of hemoglobin activation and the degree of AD severity. There was also a gender effect on the hemoglobin changes due to the functional stimulation in our study. Moreover, we demonstrated the potential of distinguished DL models, which boosted the multi-class classification performance. The highest accuracy was achieved by Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) using the original dataset of three hemoglobin types (0.909 ± 0.012 on average). Compared to conventional machine learning algorithms, DL models produced a better classification performance. These findings demonstrated the capability of DL frameworks on the imbalanced class distribution analysis and validated the great potential of fNIRS-based approaches to be further contributed to the development of AD diagnosis systems.
o
Data from: Peat bogs
data.opendatascience.eu
data.europa.eu
Updated Jan 2, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Peat bogs [Dataset]. https://data.opendatascience.eu/geonetwork/srv/search?keyword=Environment
Explore at:
Dataset updated
Jan 2, 2021
Description
Overview: 412: Wetlands with accumulation of considerable amount of decomposed moss (mostly Sphagnum)and vegetation matter. Both natural and exploited peat bogs Traceability (lineage): This dataset was produced with a machine learning framework with several input datasets, specified in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ) Scientific methodology: The single-class probability layers were generated with a spatiotemporal ensemble machine learning framework detailed in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ). The single-class uncertainty layers were calculated by taking the standard deviation of the three single-class probabilities predicted by the three components of the ensemble. The HCL (hard class) layers represents the class with the highest probability as predicted by the ensemble. Usability: The HCL layers have a decreasing average accuracy (weighted F1-score) at each subsequent level in the CLC hierarchy. These metrics are 0.83 at level 1 (5 classes):, 0.63 at level 2 (14 classes), and 0.49 at level 3 (43 classes). This means that the hard-class maps are more reliable when aggregating classes to a higher level in the hierarchy (e.g. 'Discontinuous Urban Fabric' and 'Continuous Urban Fabric' to 'Urban Fabric'). Some single-class probabilities may more closely represent actual patterns for some classes that were overshadowed by unequal sample point distributions. Users are encouraged to set their own thresholds when postprocessing these datasets to optimize the accuracy for their specific use case. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: The LULC classification was validated through spatial 5-fold cross-validation as detailed in the accompanying publication. Completeness: The dataset has chunks of empty predictions in regions with complex coast lines (e.g. the Zeeland province in the Netherlands and the Mar da Palha bay area in Portugal). These are artifacts that will be avoided in subsequent versions of the LULC product. Consistency: The accuracy of the predictions was compared per year and per 30km*30km tile across europe to derive temporal and spatial consistency by calculating the standard deviation. The standard deviation of annual weighted F1-score was 0.135, while the standard deviation of weighted F1-score per tile was 0.150. This means the dataset is more consistent through time than through space: Predictions are notably less accurate along the Mediterrranean coast. The accompanying publication contains additional information and visualisations. Positional accuracy: The raster layers have a resolution of 30m, identical to that of the Landsat data cube used as input features for the machine learning framework that predicted it. Temporal accuracy: The dataset contains predictions and uncertainty layers for each year between 2000 and 2019. Thematic accuracy: The maps reproduce the Corine Land Cover classification system, a hierarchical legend that consists of 5 classes at the highest level, 14 classes at the second level, and 44 classes at the third level. Class 523: Oceans was omitted due to computational constraints.

Class	# of Images
Sunrise	357
Shine	253
Rain	215
Cloudy	300

Facebook

Twitter

Click to copy link

Link copied

Cite

Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10

Imbalanced Cifar-10

A synthetically imbalanced version of CIFAR 10 for multi-class classification

Explore at:

187 scholarly articles cite this dataset (View in Google Scholar)

zip(807146485 bytes)Available download formats

Dataset updated

Jun 17, 2023

Authors

Akhil Theerthala

Description

This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

Clear search

Close search

Google apps

Main menu

Imbalanced Cifar-10

Data from: SOMOTE_EASY: AN ALGORITHM TO TREAT THE CLASSIFICATION ISSUE IN...

Class distribution for 5-class classification.

Multi-class Weather Dataset

Multi-class Weather Dataset for Image Classification

Inspiration behind your dataset

Class Distribution

Data Publication

DOI

Published:

Institutions:

Tags:

Licence

Data from: Dataset of lightning flashovers on medium voltage distribution...

fruit quality dataset

Sport and leisure facilities

Data_Sheet_1_Mild cognitive impairment prediction and cognitive score...

Waste Classfication Dataset

Balanced Waste Classification Dataset - E-Waste & Mixed Materials

🎯 Dataset Overview

📊 Dataset Statistics

🗂️ Waste Categories

🛠️ Data Processing Pipeline

1. Data Balancing

2. Data Augmentation Techniques

3. Quality Assurance

🎯 Recommended Use Cases

Primary Applications

Special Features

Model Architectures

📁 Dataset Structure

🚀 Getting Started

Step 1: Data Splitting

Step 2: Data Loading & Preprocessing

Edelweiss Image Dataset

Context

Content

Applications

Presence-Absence Points for Tree Species Distribution Modelling for Europe

Lemon Leaf Classification Data Set.

Outline of class distribution in the dataset.

Data from: Data augmentation for disruption prediction via robust surrogate...

Finance-sensitivity-LLM-fintuning

Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

MSL Curiosity Rover Images with Science and Engineering Classes

BananaImageBD: An Extensive Image Dataset of Common Bangladeshi Banana...

Table_1_Deep Learning-Based Multilevel Classification of Alzheimer’s Disease...

Data from: Peat bogs

Imbalanced Cifar-10

A synthetically imbalanced version of CIFAR 10 for multi-class classification