91 datasets found
  1. Imbalanced Cifar-10

    • kaggle.com
    zip
    Updated Jun 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
    Explore at:
    zip(807146485 bytes)Available download formats
    Dataset updated
    Jun 17, 2023
    Authors
    Akhil Theerthala
    Description

    This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

    The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

    The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

    This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

    Usage Information:

    The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

    License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

    Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

  2. Data from: SOMOTE_EASY: AN ALGORITHM TO TREAT THE CLASSIFICATION ISSUE IN...

    • scielo.figshare.com
    jpeg
    Updated Jun 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugo Leonardo Pereira Rufino; Antônio Cláudio Paschoarelli Veiga; Paula Teixeira Nakamoto (2023). SOMOTE_EASY: AN ALGORITHM TO TREAT THE CLASSIFICATION ISSUE IN REAL DATABASES [Dataset]. http://doi.org/10.6084/m9.figshare.14287861.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 11, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Hugo Leonardo Pereira Rufino; Antônio Cláudio Paschoarelli Veiga; Paula Teixeira Nakamoto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT Most classification tools assume that data distribution be balanced or with similar costs, when not properly classified. Nevertheless, in practical terms, the existence of database where unbalanced classes occur is commonplace, such as in the diagnosis of diseases, in which the confirmed cases are usually rare when compared with a healthy population. Other examples are the detection of fraudulent calls and the detection of system intruders. In these cases, the improper classification of a minority class (for instance, to diagnose a person with cancer as healthy) may result in more serious consequences that incorrectly classify a majority class. Therefore, it is important to treat the database where unbalanced classes occur. This paper presents the SMOTE_Easy algorithm, which can classify data, even if there is a high level of unbalancing between different classes. In order to prove its efficiency, a comparison with the main algorithms to treat classification issues was made, where unbalanced data exist. This process was successful in nearly all tested databases

  3. Class distribution for 5-class classification.

    • plos.figshare.com
    xls
    Updated May 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek (2025). Class distribution for 5-class classification. [Dataset]. http://doi.org/10.1371/journal.pone.0320955.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Toheeb Salahudeen; Maher Maalouf; Ibrahim (Abe) M. Elfadel; Herbert F. Jelinek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Depression presents a significant challenge to global mental health, often intertwined with factors including oxidative stress. Although the precise relationship with mitochondrial pathways remains elusive, recent advances in machine learning present an avenue for further investigation. This study employed advanced machine learning techniques to classify major depressive disorders based on clinical indicators and mitochondrial oxidative stress markers. Six machine learning algorithms, including Random Forest, were applied and their performance was investigated in balanced and unbalanced data sets with respect to binary and multiclass classification scenarios. Results indicate promising accuracy and precision, particularly with Random Forest on balanced data. RF achieved an average accuracy of 92.7% and an F1 score of 83.95% for binary classification, 90.36% and 90.1%, respectively, for the classification of three classes of severity of depression and 89.76% and 88.26%, respectively, for the classification of five classes. Including only oxidative stress markers resulted in accuracy and an F1 score of 79.52% and 80.56%, respectively. Notably, including mitochondrial peptides alongside clinical factors significantly enhances predictive capability, shedding light on the interplay between depression severity and mitochondrial oxidative stress pathways. These findings underscore the potential for machine learning models to aid clinical assessment, particularly in individuals with comorbid conditions such as hypertension, diabetes mellitus, and cardiovascular disease.

  4. Multi-class Weather Dataset

    • kaggle.com
    zip
    Updated Jul 26, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prateek Srivastava (2020). Multi-class Weather Dataset [Dataset]. https://www.kaggle.com/pratik2901/multiclass-weather-dataset
    Explore at:
    zip(95798762 bytes)Available download formats
    Dataset updated
    Jul 26, 2020
    Authors
    Prateek Srivastava
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Multi-class Weather Dataset for Image Classification

    Multi-class weather dataset(MWD) for image classification is a valuable dataset used in the research paper entitled “Multi-class weather recognition from the still image using heterogeneous ensemble method”.

    Inspiration behind your dataset

    The dataset provides a platform for outdoor weather analysis by extracting various features for recognizing different weather conditions.

    Please note we have updated the folder structure for the dataset folder, just to facilitate the data load prodecure

    Class Distribution

    Class# of Images
    Sunrise357
    Shine253
    Rain215
    Cloudy300

    Data Publication

    The dataset was published on Mendeley Data

    Cite: Ajayi, Gbeminiyi (2018), Multi-class Weather Dataset for Image Classification, Mendeley Data, v1

    DOI

    http://dx.doi.org/10.17632/4drtyfjtfy.1

    Published:

    2018-09-13

    Institutions:

    University of South Africa - Science Campus

    Tags:

    • Earth and Nature
    • Multiclass Classification
    • Image Processing
    • Deep Learning
    • Machine Learning
    • Computer Vision

    Licence

    CC BY 4.0

  5. Z

    Data from: Dataset of lightning flashovers on medium voltage distribution...

    • data.niaid.nih.gov
    Updated Jan 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarajcev, Petar (2023). Dataset of lightning flashovers on medium voltage distribution lines [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7382547
    Explore at:
    Dataset updated
    Jan 31, 2023
    Dataset provided by
    University of Split, FESB
    Authors
    Sarajcev, Petar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This synthetic dataset was generated from Monte Carlo simulations of lightning flashovers on medium voltage (MV) distribution lines. It is suitable for training machine learning models for classifying lightning flashovers on distribution lines. The dataset is hierarchical in nature (see below for more information) and class imbalanced.

    Following five different types of lightning interaction with the MV distribution line have been simulated: (1) direct strike to phase conductor (when there is no shield wire present on the line), (2) direct strike to phase conductor with shield wire(s) present on the line (i.e. shielding failure), (3) direct strike to shield wire with backflashover event, (4) indirect near-by lightning strike to ground where shield wire is not present, and (5) indirect near-by lightning strike to ground where shield wire is present on the line. Last two types of lightning interactions induce overvoltage on the phase conductors by radiating EM fields from the strike channel that are coupled to the line conductors. Three different methods of indirect strike analysis have been implemented, as follows: Rusck's model, Chowdhuri-Gross model and Liew-Mar model. Shield wire(s) provide shielding effects to direct, as well as screening effects to indirect, lightning strikes.

    Dataset consists of two independent distribution lines, with heights of 12 m and 15 m, each with a flat configuration of phase conductors. Twin shield wires, if present, are 1.5 m above the phase conductors and 3 m apart [2]. CFO level of the 12 m distribution line is 150 kV and that of the 15 m distribution line is 160 kV. Dataset consists of 10,000 simulations for each of the distribution lines.

    Dataset contains following variables (features):

    'dist': perpendicular distance of the lightning strike location from the distribution line axis (m), generated from the Uniform distribution [0, 500] m,

    'ampl': lightning current amplitude of the strike (kA), generated from the Log-Normal distribution (see IEC 60071 for additional information),

    'front': lightning current wave-front time (us), generated from the Log-Normal distribution; it needs to be emphasized that amplitudes (ampl) and wave-front times (front), as random variables, have been generated from the appropriate bivariate probability distribution which includes statistical correlation between these variates,

    'veloc': velocity of the lightning return-stroke current defined indirectly through the parameter "w" that is generated from the Uniform distribution [50, 500] m/us, which is then used for computing the velocity from the following relation: v = c/sqrt(1+w/I), where "c" is the speed of light in free space (300 m/us) and "I" is the lightning-current amplitude,

    'shield': binary indicator that signals presence or absence of the shield wire(s) on the line (0/1), generated from the Bernoulli distribution with a 50% probability,

    'Ri': average value of the impulse impedance of the tower's grounding (Ohm), generated from the Normal distribution (clipped at zero on the left side) with median value of 50 Ohm and standard deviation of 12.5 Ohm; it should be mentioned that the impulse impedance is often much larger than the associated grounding resistance value, which is why a rather high value of 50 Ohm have been used here,

    'EGM': electrogeometric model used for analyzing striking distances of the distribution line's tower; following options are available: 'Wagner', 'Young', 'AW', 'BW', 'Love', and 'Anderson', where 'AW' stands for Armstrong & Whitehead, while 'BW' means Brown & Whitehead model; statistical distribution of EGM models follows a user-defined discrete categorical distribution with respective probabilities: p = [0.1, 0.2, 0.1, 0.1, 0.3, 0.2],

    'ind': indirect stroke model used for analyzing near-by indirect lightning strikes; following options were implemented: 'rusk' for the Rusck's model, 'chow' for the Chowdhuri-Gross model (with Jakubowski modification) and 'liew' for the Liew-Mar model; statistical distribution of these three models follows a user-defined discrete categorical distribution with respective probabilities: p = [0.6, 0.2, 0.2],

    'CFO': critical flashover voltage level of the distribution line's insulation (kV),

    'height': height of the phase conductors of the distribution line (m),

    'flash': binary indicator that signals if the flashover has been recorded (1) or not (0). This variable is the outcome/label (i.e. binary class).

    Mathematical background used for the analysis of lightning interaction with the MV distribution line can be found in the references cited below.

    References:

    A. R. Hileman, "Insulation Coordination for Power Systems", CRC Press, Boca Raton, FL, 1999.

    J. A. Martinez and F. Gonzalez-Molina, "Statistical evaluation of lightning overvoltages on overhead distribution lines using neural networks," in IEEE Transactions on Power Delivery, vol. 20, no. 3, pp. 2219-2226, July 2005.

    A. Borghetti, C. A. Nucci and M. Paolone, An Improved Procedure for the Assessment of Overhead Line Indirect Lightning Performance and Its Comparison with the IEEE Std. 1410 Method, IEEE Transactions on Power Delivery, Vol. 22, No. 1, 2007, pp. 684-692.

  6. fruit quality dataset

    • kaggle.com
    zip
    Updated Jan 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamad Alhammoud (2024). fruit quality dataset [Dataset]. https://www.kaggle.com/datasets/mohamadalhammoud/fruit-quality-dataset
    Explore at:
    zip(938456722 bytes)Available download formats
    Dataset updated
    Jan 25, 2024
    Authors
    Mohamad Alhammoud
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Fruit Quality Detection Dataset

    This dataset is meticulously curated to facilitate the training of machine learning models, such as YOLOv8, for fruit quality detection. It includes labeled images of fruits classified into categories such as 'bad apple', 'bad banana', 'bad orange', 'bad pomegranate', 'good apple', 'good banana', 'good orange', and 'good pomegranate'.

    Dataset Versions and Updates:

    • Version 1: Initial Release Sourced from Roboflow under the title "Rotten Fruit Detector ver 2 Computer Vision Project", this initial version required minimal modifications. Key changes include an update to the data.yaml file where the matrix of names was adjusted by shifting the row indexes down (the first index was deleted), and labels were updated accordingly. The dataset comprises 3,078 training images (70%), 878 validation images (20%), and 442 test images (10%). This version faced challenges with unbalanced class distribution, as illustrated in the distribution graph below:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14431819%2F4503fc6ecba32d263eb72d6471dcbeb4%2Fversion%201.png?generation=1712755938967609&alt=media">

    • Version 4: Data Augmentation To address the imbalance, several augmentation techniques were applied:

      • 90-degree rotations (none, clockwise, counter-clockwise, upside-down).
      • Random rotations between -5 and +5 degrees.
      • Exposure adjustments from -10 to +10 percent.
      • Gaussian blur ranging from 0 to 1 pixel.

      These modifications improved the balance slightly, reflected in the revised counts of 8,318 training images (85%), 924 validation images (10%), and 438 test images (5%), and in the updated distribution graph:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14431819%2Fb472a450991aba5f5ef8d715f3fa0831%2Fversion%202.png?generation=1712756217334823&alt=media">

    • Version 5: Further Balancing Further enhancements were made to improve data balance. This latest version consists of 6,570 training images (85%), 730 validation images (10%), and 438 test images (5%). The distribution of these images has been optimized for a more balanced dataset, as shown in the graph below:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14431819%2F88bd8e95f9ba87894985a969b216f3aa%2Fversion%203.png?generation=1712756424447707&alt=media">

  7. o

    Sport and leisure facilities

    • data.opendatascience.eu
    Updated Jan 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Sport and leisure facilities [Dataset]. https://data.opendatascience.eu/geonetwork/srv/search?type=dataset
    Explore at:
    Dataset updated
    Jan 2, 2021
    Description

    Overview: 142: Areas used for sports, leisure and recreation purposes. Traceability (lineage): This dataset was produced with a machine learning framework with several input datasets, specified in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ) Scientific methodology: The single-class probability layers were generated with a spatiotemporal ensemble machine learning framework detailed in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ). The single-class uncertainty layers were calculated by taking the standard deviation of the three single-class probabilities predicted by the three components of the ensemble. The HCL (hard class) layers represents the class with the highest probability as predicted by the ensemble. Usability: The HCL layers have a decreasing average accuracy (weighted F1-score) at each subsequent level in the CLC hierarchy. These metrics are 0.83 at level 1 (5 classes):, 0.63 at level 2 (14 classes), and 0.49 at level 3 (43 classes). This means that the hard-class maps are more reliable when aggregating classes to a higher level in the hierarchy (e.g. 'Discontinuous Urban Fabric' and 'Continuous Urban Fabric' to 'Urban Fabric'). Some single-class probabilities may more closely represent actual patterns for some classes that were overshadowed by unequal sample point distributions. Users are encouraged to set their own thresholds when postprocessing these datasets to optimize the accuracy for their specific use case. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: The LULC classification was validated through spatial 5-fold cross-validation as detailed in the accompanying publication. Completeness: The dataset has chunks of empty predictions in regions with complex coast lines (e.g. the Zeeland province in the Netherlands and the Mar da Palha bay area in Portugal). These are artifacts that will be avoided in subsequent versions of the LULC product. Consistency: The accuracy of the predictions was compared per year and per 30km*30km tile across europe to derive temporal and spatial consistency by calculating the standard deviation. The standard deviation of annual weighted F1-score was 0.135, while the standard deviation of weighted F1-score per tile was 0.150. This means the dataset is more consistent through time than through space: Predictions are notably less accurate along the Mediterrranean coast. The accompanying publication contains additional information and visualisations. Positional accuracy: The raster layers have a resolution of 30m, identical to that of the Landsat data cube used as input features for the machine learning framework that predicted it. Temporal accuracy: The dataset contains predictions and uncertainty layers for each year between 2000 and 2019. Thematic accuracy: The maps reproduce the Corine Land Cover classification system, a hierarchical legend that consists of 5 classes at the highest level, 14 classes at the second level, and 44 classes at the third level. Class 523: Oceans was omitted due to computational constraints.

  8. f

    Data_Sheet_1_Mild cognitive impairment prediction and cognitive score...

    • datasetcatalog.nlm.nih.gov
    Updated Jan 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rutkowski, Tomasz M.; Otake-Matsuura, Mihoko; Komendziński, Tomasz (2024). Data_Sheet_1_Mild cognitive impairment prediction and cognitive score regression in the elderly using EEG topological data analysis and machine learning with awareness assessed in affective reminiscent paradigm.PDF [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001273208
    Explore at:
    Dataset updated
    Jan 4, 2024
    Authors
    Rutkowski, Tomasz M.; Otake-Matsuura, Mihoko; Komendziński, Tomasz
    Description

    IntroductionThe main objective of this study is to evaluate working memory and determine EEG biomarkers that can assist in the field of health neuroscience. Our ultimate goal is to utilize this approach to predict the early signs of mild cognitive impairment (MCI) in healthy elderly individuals, which could potentially lead to dementia. The advancements in health neuroscience research have revealed that affective reminiscence stimulation is an effective method for developing EEG-based neuro-biomarkers that can detect the signs of MCI.MethodsWe use topological data analysis (TDA) on multivariate EEG data to extract features that can be used for unsupervised clustering, subsequent machine learning-based classification, and cognitive score regression. We perform EEG experiments to evaluate conscious awareness in affective reminiscent photography settings.ResultsWe use EEG and interior photography to distinguish between healthy cognitive aging and MCI. Our clustering UMAP and random forest application accurately predict MCI stage and MoCA scores.DiscussionOur team has successfully implemented TDA feature extraction, MCI classification, and an initial regression of MoCA scores. However, our study has certain limitations due to a small sample size of only 23 participants and an unbalanced class distribution. To enhance the accuracy and validity of our results, future research should focus on expanding the sample size, ensuring gender balance, and extending the study to a cross-cultural context.

  9. Waste Classfication Dataset

    • kaggle.com
    Updated Jun 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaan Çerkez (2025). Waste Classfication Dataset [Dataset]. https://www.kaggle.com/datasets/kaanerkez/waste-classfication-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kaan Çerkez
    License

    https://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/

    Description

    Balanced Waste Classification Dataset - E-Waste & Mixed Materials

    🎯 Dataset Overview

    This dataset contains a comprehensive collection of waste images designed for training machine learning models to classify different types of waste materials, with a strong focus on electronic waste (e-waste) and mixed materials. The dataset includes 7 electronic device categories alongside traditional recyclable materials, making it ideal for modern waste management challenges where electronic devices constitute a significant portion of waste streams. The dataset has been carefully curated and balanced to ensure optimal performance for multi-category waste classification tasks using deep learning approaches.

    📊 Dataset Statistics

    • Total Classes: 17 different waste categories
    • Images per Class: 400 (balanced)
    • Total Images: 6,800
    • Image Format: RGB (3 channels)
    • Recommended Input Size: 224×224 pixels
    • Data Structure: Single balanced dataset (not pre-split)

    🗂️ Waste Categories

    The dataset includes 17 distinct waste categories covering various types of materials commonly found in waste management scenarios:

    1. Battery - Various types of batteries
    2. Cardboard - Cardboard packaging and boxes
    3. Glass - Glass containers and bottles
    4. Keyboard - Computer keyboards and input devices
    5. Metal - Metal cans and metallic waste
    6. Microwave - Microwave ovens and similar appliances
    7. Mobile - Mobile phones and smartphones
    8. Mouse - Computer mice and peripherals
    9. Organic - Biodegradable organic waste
    10. Paper - Paper products and documents
    11. PCB - Printed Circuit Boards (electronic components)
    12. Plastic - Plastic containers and packaging
    13. Player - Media players and entertainment devices
    14. Printer - Printers and printing equipment
    15. Television - TV sets and display devices
    16. Trash - General mixed waste
    17. Washing Machine - Washing machines and large appliances

    🛠️ Data Processing Pipeline

    1. Data Balancing

    • Undersampling: Applied to classes with >400 images
    • Data Augmentation: Applied to classes with <400 images
    • Target: Exactly 400 images per class for balanced training

    2. Data Augmentation Techniques

    • Rotation: ±20 degrees
    • Width/Height Shift: ±20%
    • Shear Range: 20%
    • Zoom Range: 20%
    • Horizontal Flip: Enabled
    • Fill Mode: Nearest neighbor

    3. Quality Assurance

    • Consistent image dimensions
    • Proper file format validation
    • Balanced class distribution
    • Clean data structure

    🎯 Recommended Use Cases

    Primary Applications

    • E-Waste Classification: Specialized in electronic devices (Mobile, Keyboard, Mouse, PCB, etc.)
    • Mixed Waste Sorting: Traditional recyclables (Paper, Plastic, Glass, Metal, Cardboard)
    • Smart Recycling Systems: Automated waste sorting for both organic and electronic materials
    • Environmental Monitoring: Multi-category waste identification
    • Appliance Recycling: Large appliance classification (Microwave, TV, Washing Machine)

    Special Features

    • Electronic Waste Focus: Strong representation of e-waste categories (7 out of 17 classes)
    • Diverse Material Types: From organic waste to complex electronic devices
    • Real-world Categories: Practical classification for actual waste management scenarios
    • Appliance Recognition: Specialized in identifying large household appliances

    Model Architectures

    • Convolutional Neural Networks (CNN)
    • Transfer Learning with MobileNetV2, ResNet, EfficientNet
    • Vision Transformers (ViT)
    • Custom architectures for waste classification

    📁 Dataset Structure

    balanced_waste_images/
    ├── category_1/
    │  ├── image_001.jpg
    │  ├── image_002.jpg
    │  └── ... (400 images)
    ├── category_2/
    │  ├── image_001.jpg
    │  └── ... (400 images)
    └── ... (17 categories total)
    

    Note: Dataset is not pre-split. Users need to create train/validation/test splits as needed.

    🚀 Getting Started

    Step 1: Data Splitting

    Since the dataset is not pre-split, you'll need to create train/validation/test splits:

    import splitfolders
    
    # Split dataset: 80% train, 10% val, 10% test
    splitfolders.ratio(
      input='balanced_waste_images', 
      output='split_data',
      seed=42, 
      ratio=(.8, .1, .1),
      group_prefix=None,
      move=False
    )
    

    Step 2: Data Loading & Preprocessing

    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    
    # Data generators with preprocessing
    train_datagen = ImageDataGenerator(rescale=1./255)
    val_datagen = ImageDataGenerator(rescale=1./255)
    
    train_generator = train_datagen.flow_from_directory(
      'split_data/train/',
      target_size=(224, 224),
      batch_size=32,
      class_mode='categorical'
    )
    
    val_generator = val_datagen.flow_from_director...
    
  10. Edelweiss Image Dataset

    • kaggle.com
    zip
    Updated Jun 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fransiscus Rolanda Malau (2022). Edelweiss Image Dataset [Dataset]. https://www.kaggle.com/datasets/ndomalau/edelweis-flower
    Explore at:
    zip(12912266177 bytes)Available download formats
    Dataset updated
    Jun 19, 2022
    Authors
    Fransiscus Rolanda Malau
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Context

    Image classification is one of the fundamental tasks in computer vision and machine learning. High-quality datasets are crucial for training robust models that can accurately identify different species. This dataset focuses on three distinct species commonly found in mountainous regions, providing a balanced collection of images for both training and evaluation purposes.

    Content

    This dataset contains 4,550 high-quality images distributed across three categories: - Training set: 3,500 images (approximately 1,167 images per class) - Test set: 1,050 images (350 images per class)

    The dataset is organized in a structured format with separate directories for: 1. Anaphalis Javanica 2. Leontopodium Alpinum 3. Leucogenes Grandiceps

    Each image in the dataset has been carefully prepared to ensure consistency and quality for machine learning applications. The balanced distribution between classes helps prevent bias during model training.

    Applications

    • Species classification and identification
    • Computer vision model development
    • Educational purposes in botany and biodiversity studies
    • Benchmarking machine learning algorithms

    The dataset's clean split between training and test sets makes it ideal for developing and evaluating classification models while following machine learning best practices.

  11. Presence-Absence Points for Tree Species Distribution Modelling for Europe

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +2
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmelo Bonannella; Carmelo Bonannella; Tomislav Hengl; Tomislav Hengl; Johannes Heisig; Johannes Heisig; Leandro Leal Parente; Leandro Leal Parente; Marvin Wright; Marvin Wright; Martin Herold; Martin Herold; Sytze de Bruin; Sytze de Bruin (2024). Presence-Absence Points for Tree Species Distribution Modelling for Europe [Dataset]. http://doi.org/10.5281/zenodo.5818022
    Explore at:
    bin, pdf, png, application/gzipAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Carmelo Bonannella; Carmelo Bonannella; Tomislav Hengl; Tomislav Hengl; Johannes Heisig; Johannes Heisig; Leandro Leal Parente; Leandro Leal Parente; Marvin Wright; Marvin Wright; Martin Herold; Martin Herold; Sytze de Bruin; Sytze de Bruin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Europe
    Description

    The dataset is a collection of presence and absence points for forest tree species for Europe. Each unique combination of longitude, latitude and year was considered as an independent sample. Presence data was obtained from the harmonized tree species occurrence dataset by Heising and Hengl (2020) and absence data from the LUCAS (in-situ source) dataset.

    A set of 50 different forest tree species was selected from the harmonized tree species dataset and data lacking a temporal observation was overlaid with yearly forest masks derived from land cover maps produced by Parente et al. (2021). We overlaid the points with the probability maps for the classes:

    • 311: Broad-leaved forest,
    • 312: Coniferous forest,
    • 313: Mixed forest,
    • 323: Sclerophyllous forest,
    • 324: Transitional woodland-shrub,
    • 333: Sparsely vegetated area.

    Points were included in the dataset only if the probability value extracted for at least one of the above classes was ≥ 50% for all the years considered. An additional quality flag was added to distinguish points coming from this operation and the points with original year of observation coming from source datasets.

    The final dataset contains 4,359,999 observations for and a total of 630 columns.

    The first 8 columns of the dataset contain metadata information used to uniquely identify the points:

    • id: unique point identifier,
    • year: year of observation,
    • postprocess: quality flag to identify if the temporal reference of an observation comes from the original dataset or is the result of spatiotemporal overlay with forest masks,
    • Tile_ID: contains the tile id from a 30 km grid,
    • easting: longitude coordinates in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035),
    • northing: latitude coordinates in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035),
    • Atlas_class: name of the tree species according to the European Atlas of Forest Tree Species or NULL in case of absence point,
    • lc1: contains original LUCAS land cover class or NULL if it's a presence point.

    The remaining columns contain the extracted values of a series of predictor variables (temperature, precipitation, elevation, topographical information, spectral reflectance) useful for species distribution modeling applications. These points were used to model the potential and realized distribution of a series of 16 target species for the period 2000 - 2020. The approach involved training three ML models to predict probability of presence (i.e. Random Forest, XGBoost, GLM), which served as input to train a linear meta-model (i.e. Logistic regression classifier), responsible for predicting the final probability of presence for each species.

    The 10 most important variables used by each of the three base models are available in the "variable importance" plots for both potential and realized distribution in a PDF format.

    The RDS file is created from a data.table object and suitable for fast reading in the R-programming environment. The CSV.GZ file contains records as a table with easting and northing in Coordinate Reference System ETRS89 / LAEA Europe (= EPSG code 3035) and can be fed in a GIS after being unzipped.

    To access the predictions of the meta-model (probabilities and uncertainties) produced for these species access:

    If you would like to know more about the creation of this dataset and the modeling, watch the talk at Open Data Science Workshop 2021 (TIB AV-PORTAL)

    A publication describing, in detail, all processing steps, accuracy assessment and general analysis of species distribution maps is under preparation. To suggest any improvement/fix use https://gitlab.com/geoharmonizer_inea/spatial-layers/-/issues.

  12. Lemon Leaf Classification Data Set.

    • kaggle.com
    zip
    Updated Sep 14, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankur Ray Chayan (2024). Lemon Leaf Classification Data Set. [Dataset]. https://www.kaggle.com/datasets/ankurray00/lemon-leaf-class-classification-data-set
    Explore at:
    zip(8901539069 bytes)Available download formats
    Dataset updated
    Sep 14, 2024
    Authors
    Ankur Ray Chayan
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Lemon Leaf Classification Dataset: A carefully curated dataset consisting of five distinct classes of lemon leaves, designed for robust image classification tasks. Each class represents unique variations in leaf characteristics, including shape, texture, and disease conditions. This dataset is ideal for developing and testing machine learning and deep learning models, contributing to the advancement of agricultural research. The balanced class distribution ensures a reliable foundation for classification models, enhancing precision in identifying different types of lemon leaves and promoting disease detection.

  13. Outline of class distribution in the dataset.

    • plos.figshare.com
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lucas Teoh; Achintha Avin Ihalage; Srooley Harp; Zahra F. Al-Khateeb; Adina T. Michael-Titus; Jordi L. Tremoleda; Yang Hao (2023). Outline of class distribution in the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0268962.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Lucas Teoh; Achintha Avin Ihalage; Srooley Harp; Zahra F. Al-Khateeb; Adina T. Michael-Titus; Jordi L. Tremoleda; Yang Hao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Outline of class distribution in the dataset.

  14. H

    Data from: Data augmentation for disruption prediction via robust surrogate...

    • dataverse.harvard.edu
    • osti.gov
    Updated Aug 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert (2024). Data augmentation for disruption prediction via robust surrogate models [Dataset]. http://doi.org/10.7910/DVN/FMJCAD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 31, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Katharina Rath, David Rügamer, Bernd Bischl, Udo von Toussaint, Cristina Rea, Andrew Maris, Robert Granetz, Christopher G. Albert
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The goal of this work is to generate large statistically representative datasets to train machine learning models for disruption prediction provided by data from few existing discharges. Such a comprehensive training database is important to achieve satisfying and reliable prediction results in artificial neural network classifiers. Here, we aim for a robust augmentation of the training database for multivariate time series data using Student-t process regression. We apply Student-t process regression in a state space formulation via Bayesian filtering to tackle challenges imposed by outliers and noise in the training data set and to reduce the computational complexity. Thus, the method can also be used if the time resolution is high. We use an uncorrelated model for each dimension and impose correlations afterwards via coloring transformations. We demonstrate the efficacy of our approach on plasma diagnostics data of three different disruption classes from the DIII-D tokamak. To evaluate if the distribution of the generated data is similar to the training data, we additionally perform statistical analyses using methods from time series analysis, descriptive statistics, and classic machine learning clustering algorithms.

  15. Finance-sensitivity-LLM-fintuning

    • kaggle.com
    zip
    Updated Nov 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditya Mac (2023). Finance-sensitivity-LLM-fintuning [Dataset]. https://www.kaggle.com/datasets/adityamac/finance-sensitivity-llm-fintuning/code
    Explore at:
    zip(554864 bytes)Available download formats
    Dataset updated
    Nov 14, 2023
    Authors
    Aditya Mac
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a collection of tweets related to financial institutions such as banks, credit unions, and other financial service providers. Each tweet has been manually labeled with its corresponding sentiment, either positive, negative, or neutral. The dataset contains a mix of complaints and praises directed towards these financial institutions, providing a balanced perspective on public opinion.

    Data Quality: The dataset has been carefully curated to ensure high-quality data. Tweets with incomplete or ambiguous information were excluded, and special attention was given to ensuring that the labels accurately reflect the sentiment expressed in the corresponding tweet.

    Use Cases: This dataset can be used for various applications, such as: - Sentiment analysis research: Dataset provides a rich resource for studying the opinions and perceptions people hold towards financial institutions. Researchers can use this dataset to investigate factors influencing sentiment, compare sentiments across different demographics or institutions and analyze trends over time. - Machine learning model development: With its balanced class distribution, dataset offers an excellent opportunity to train and evaluate machine learning models for sentiment classification tasks. Models developed using this dataset can potentially achieve high accuracy and generalize well to new, unseen data. - Business intelligence: Financial institutions can leverage insights gained from dataset to identify areas where they excel or struggle in terms of customer satisfaction. By analyzing the feedback expressed in the tweets, institutions can improve their services, address common concerns, and enhance overall customer experience.

    Overall, this dataset represents a valuable asset for anyone interested in exploring the complex dynamics of public sentiment towards financial institutions. Its diverse range of opinions and topics provides a fertile ground for research, model development, and practical applications in the finance industry.

  16. f

    Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Nov 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varotto, Giulia; Franceschetti, Silvana; Susi, Gianluca; Tassi, Laura; Panzica, Ferruccio; Gozzo, Francesca (2021). Table_1_Comparison of Resampling Techniques for Imbalanced Datasets in Machine Learning: Application to Epileptogenic Zone Localization From Interictal Intracranial EEG Recordings in Patients With Focal Epilepsy.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000861498
    Explore at:
    Dataset updated
    Nov 19, 2021
    Authors
    Varotto, Giulia; Franceschetti, Silvana; Susi, Gianluca; Tassi, Laura; Panzica, Ferruccio; Gozzo, Francesca
    Description

    Aim: In neuroscience research, data are quite often characterized by an imbalanced distribution between the majority and minority classes, an issue that can limit or even worsen the prediction performance of machine learning methods. Different resampling procedures have been developed to face this problem and a lot of work has been done in comparing their effectiveness in different scenarios. Notably, the robustness of such techniques has been tested among a wide variety of different datasets, without considering the performance of each specific dataset. In this study, we compare the performances of different resampling procedures for the imbalanced domain in stereo-electroencephalography (SEEG) recordings of the patients with focal epilepsies who underwent surgery.Methods: We considered data obtained by network analysis of interictal SEEG recorded from 10 patients with drug-resistant focal epilepsies, for a supervised classification problem aimed at distinguishing between the epileptogenic and non-epileptogenic brain regions in interictal conditions. We investigated the effectiveness of five oversampling and five undersampling procedures, using 10 different machine learning classifiers. Moreover, six specific ensemble methods for the imbalanced domain were also tested. To compare the performances, Area under the ROC curve (AUC), F-measure, Geometric Mean, and Balanced Accuracy were considered.Results: Both the resampling procedures showed improved performances with respect to the original dataset. The oversampling procedure was found to be more sensitive to the type of classification method employed, with Adaptive Synthetic Sampling (ADASYN) exhibiting the best performances. All the undersampling approaches were more robust than the oversampling among the different classifiers, with Random Undersampling (RUS) exhibiting the best performance despite being the simplest and most basic classification method.Conclusions: The application of machine learning techniques that take into consideration the balance of features by resampling is beneficial and leads to more accurate localization of the epileptogenic zone from interictal periods. In addition, our results highlight the importance of the type of classification method that must be used together with the resampling to maximize the benefit to the outcome.

  17. MSL Curiosity Rover Images with Science and Engineering Classes

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Sep 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff (2020). MSL Curiosity Rover Images with Science and Engineering Classes [Dataset]. http://doi.org/10.5281/zenodo.4033453
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 17, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Steven Lu; Steven Lu; Kiri L. Wagstaff; Kiri L. Wagstaff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please note that the file msl-labeled-data-set-v2.1.zip below contains the latest images and labels associated with this data set.

    Data Set Description

    The data set consists of 6,820 images that were collected by the Mars Science Laboratory (MSL) Curiosity Rover by three instruments: (1) the Mast Camera (Mastcam) Left Eye; (2) the Mast Camera Right Eye; (3) the Mars Hand Lens Imager (MAHLI). With the help from Dr. Raymond Francis, a member of the MSL operations team, we identified 19 classes with science and engineering interests (see the "Classes" section for more information), and each image is assigned with 1 class label. We split the data set into training, validation, and test sets in order to train and evaluate machine learning algorithms. The training set contains 5,920 images (including augmented images; see the "Image Augmentation" section for more information); the validation set contains 300 images; the test set contains 600 images. The training set images were randomly sampled from sol (Martian day) range 1 - 948; validation set images were randomly sampled from sol range 949 - 1920; test set images were randomly sampled from sol range 1921 - 2224. All images are resized to 227 x 227 pixels without preserving the original height/width aspect ratio.

    Directory Contents

    • images - contains all 6,820 images
    • class_map.csv - string-integer class mappings
    • train-set-v2.1.txt - label file for the training set
    • val-set-v2.1.txt - label file for the validation set
    • test-set-v2.1.txt - label file for the test set

    The label files are formatted as below:

    "Image-file-name class_in_integer_representation"

    Labeling Process

    Each image was labeled with help from three different volunteers (see Contributor list). The final labels are determined using the following processes:

    • If all three labels agree with each other, then use the label as the final label.
    • If the three labels do not agree with each other, then we manually review the labels and decide the final label.
    • We also performed error analysis to correct labels as a post-processing step in order to remove noisy/incorrect labels in the data set.

    Classes

    There are 19 classes identified in this data set. In order to simplify our training and evaluation algorithms, we mapped the class names from string to integer representations. The names of classes, string-integer mappings, distributions are shown below:

    Class name, counts (training set), counts (validation set), counts (test set), integer representation

    Arm cover, 10, 1, 4, 0

    Other rover part, 190, 11, 10, 1

    Artifact, 680, 62, 132, 2

    Nearby surface, 1554, 74, 187, 3

    Close-up rock, 1422, 50, 84, 4

    DRT, 8, 4, 6, 5

    DRT spot, 214, 1, 7, 6

    Distant landscape, 342, 14, 34, 7

    Drill hole, 252, 5, 12, 8

    Night sky, 40, 3, 4, 9

    Float, 190, 5, 1, 10

    Layers, 182, 21, 17, 11

    Light-toned veins, 42, 4, 27, 12

    Mastcam cal target, 122, 12, 29, 13

    Sand, 228, 19, 16, 14

    Sun, 182, 5, 19, 15

    Wheel, 212, 5, 5, 16

    Wheel joint, 62, 1, 5, 17

    Wheel tracks, 26, 3, 1, 18

    Image Augmentation

    Only the training set contains augmented images. 3,920 of the 5,920 images in the training set are augmented versions of the remaining 2000 original training images. Images taken by different instruments were augmented differently. As shown below, we employed 5 different methods to augment images. Images taken by the Mastcam left and right eye cameras were augmented using a horizontal flipping method, and images taken by the MAHLI camera were augmented using all 5 methods. Note that one can filter based on the file names listed in the train-set.txt file to obtain a set of non-augmented images.

    • 90 degrees clockwise rotation (file name ends with -r90.jpg)
    • 180 degrees clockwise rotation (file name ends with -r180.jpg)
    • 270 degrees clockwise rotation (file name ends with -r270.jpg)
    • Horizontal flip (file name ends with -fh.jpg)
    • Vertical flip (file name ends with -fv.jpg)

    Acknowledgment

    The authors would like to thank the volunteers (as in the Contributor list) who provided annotations for this data set. We would also like to thank the PDS Imaging Note for the continuous support of this work.

  18. m

    BananaImageBD: An Extensive Image Dataset of Common Bangladeshi Banana...

    • data.mendeley.com
    Updated Sep 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Hasanul Ferdaus (2024). BananaImageBD: An Extensive Image Dataset of Common Bangladeshi Banana Varieties with Different Ripeness Levels [Dataset]. http://doi.org/10.17632/ptfscwtnyz.1
    Explore at:
    Dataset updated
    Sep 4, 2024
    Authors
    Md Hasanul Ferdaus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Bangladesh
    Description

    Type of data: 256x256 px Banana images. Data format: JPEG Contents of the dataset: Banana cultivars and ripeness stages.

    Number of classes: (1) Four Most Popular Banana cultivars in Bangladesh - Bangla Kola, Chompa Kola, Sabri Kola, and Sagor Kola, and (2) Four Ripeness Stages - Green, Semi-ripe, Ripe, and Overripe

    Number of images: (1) Total original (raw) images of banana cultivars = 2512, Augmented to 7536 images, and (2) Total original (raw) images of ripeness stages = 825, Augmented to 2460 images.

    Distribution of instances: (1) Original (raw) images in each class of banana cultivars: Bangla Kola = 444, Champa Kola = 1035, Sabri Kola = 509, and Sagor Kola = 524. (2) Augmented images in each class of banana cultivars: Bangla Kola = 1332, Chompa Kola = 3105, Sabri Kola = 1527, Sagor Kola = 1572. (3) Original (raw) images in each class of Ripeness stages: Green = 213, Semi-ripe = 205, Ripe = 204, and Overripe = 203. (4) Augmented images in each class of Ripeness stages: Green = 639, Semi-ripe = 612, Ripe = 600, and Overripe = 609.

    Dataset Size: (1) Total size of the original (raw) banana cultivars dataset = 17.5 MB. (2) Total size of the augmented banana cultivars dataset = 80.1 MB. (3) Total size of the original (raw) ripeness stages dataset = 5.58 MB, and (4) Total size of the augmented ripeness stages dataset = 25.4 MB.

    Data Acuisition Process: Images of bananas are captured using mobile phone cameras. Data Source Location: Local banana wholesale markets and retail fruit shops from different places of Bangladesh. Where applicable: Training machine learning and deep learning models to distinguish popular banana cultivars of Bangladesh and the ripeness stages of bananas.

  19. f

    Table_1_Deep Learning-Based Multilevel Classification of Alzheimer’s Disease...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Apr 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gwak, Jeonghwan; Song, Jong-In; Ho, Thi Kieu Khanh; Jeon, Younghun; Lee, Kun Ho; Kim, Minhee; Kim, Byeong C.; Kim, Jae Gwan (2022). Table_1_Deep Learning-Based Multilevel Classification of Alzheimer’s Disease Using Non-invasive Functional Near-Infrared Spectroscopy.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000231964
    Explore at:
    Dataset updated
    Apr 26, 2022
    Authors
    Gwak, Jeonghwan; Song, Jong-In; Ho, Thi Kieu Khanh; Jeon, Younghun; Lee, Kun Ho; Kim, Minhee; Kim, Byeong C.; Kim, Jae Gwan
    Description

    The timely diagnosis of Alzheimer’s disease (AD) and its prodromal stages is critically important for the patients, who manifest different neurodegenerative severity and progression risks, to take intervention and early symptomatic treatments before the brain damage is shaped. As one of the promising techniques, functional near-infrared spectroscopy (fNIRS) has been widely employed to support early-stage AD diagnosis. This study aims to validate the capability of fNIRS coupled with Deep Learning (DL) models for AD multi-class classification. First, a comprehensive experimental design, including the resting, cognitive, memory, and verbal tasks was conducted. Second, to precisely evaluate the AD progression, we thoroughly examined the change of hemodynamic responses measured in the prefrontal cortex among four subject groups and among genders. Then, we adopted a set of DL architectures on an extremely imbalanced fNIRS dataset. The results indicated that the statistical difference between subject groups did exist during memory and verbal tasks. This presented the correlation of the level of hemoglobin activation and the degree of AD severity. There was also a gender effect on the hemoglobin changes due to the functional stimulation in our study. Moreover, we demonstrated the potential of distinguished DL models, which boosted the multi-class classification performance. The highest accuracy was achieved by Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM) using the original dataset of three hemoglobin types (0.909 ± 0.012 on average). Compared to conventional machine learning algorithms, DL models produced a better classification performance. These findings demonstrated the capability of DL frameworks on the imbalanced class distribution analysis and validated the great potential of fNIRS-based approaches to be further contributed to the development of AD diagnosis systems.

  20. o

    Data from: Peat bogs

    • data.opendatascience.eu
    • data.europa.eu
    Updated Jan 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Peat bogs [Dataset]. https://data.opendatascience.eu/geonetwork/srv/search?keyword=Environment
    Explore at:
    Dataset updated
    Jan 2, 2021
    Description

    Overview: 412: Wetlands with accumulation of considerable amount of decomposed moss (mostly Sphagnum)and vegetation matter. Both natural and exploited peat bogs Traceability (lineage): This dataset was produced with a machine learning framework with several input datasets, specified in detail in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ) Scientific methodology: The single-class probability layers were generated with a spatiotemporal ensemble machine learning framework detailed in Witjes et al., 2022 (in review, preprint available at https://doi.org/10.21203/rs.3.rs-561383/v3 ). The single-class uncertainty layers were calculated by taking the standard deviation of the three single-class probabilities predicted by the three components of the ensemble. The HCL (hard class) layers represents the class with the highest probability as predicted by the ensemble. Usability: The HCL layers have a decreasing average accuracy (weighted F1-score) at each subsequent level in the CLC hierarchy. These metrics are 0.83 at level 1 (5 classes):, 0.63 at level 2 (14 classes), and 0.49 at level 3 (43 classes). This means that the hard-class maps are more reliable when aggregating classes to a higher level in the hierarchy (e.g. 'Discontinuous Urban Fabric' and 'Continuous Urban Fabric' to 'Urban Fabric'). Some single-class probabilities may more closely represent actual patterns for some classes that were overshadowed by unequal sample point distributions. Users are encouraged to set their own thresholds when postprocessing these datasets to optimize the accuracy for their specific use case. Uncertainty quantification: Uncertainty is quantified by taking the standard deviation of the probabilities predicted by the three components of the spatiotemporal ensemble model. Data validation approaches: The LULC classification was validated through spatial 5-fold cross-validation as detailed in the accompanying publication. Completeness: The dataset has chunks of empty predictions in regions with complex coast lines (e.g. the Zeeland province in the Netherlands and the Mar da Palha bay area in Portugal). These are artifacts that will be avoided in subsequent versions of the LULC product. Consistency: The accuracy of the predictions was compared per year and per 30km*30km tile across europe to derive temporal and spatial consistency by calculating the standard deviation. The standard deviation of annual weighted F1-score was 0.135, while the standard deviation of weighted F1-score per tile was 0.150. This means the dataset is more consistent through time than through space: Predictions are notably less accurate along the Mediterrranean coast. The accompanying publication contains additional information and visualisations. Positional accuracy: The raster layers have a resolution of 30m, identical to that of the Landsat data cube used as input features for the machine learning framework that predicted it. Temporal accuracy: The dataset contains predictions and uncertainty layers for each year between 2000 and 2019. Thematic accuracy: The maps reproduce the Corine Land Cover classification system, a hierarchical legend that consists of 5 classes at the highest level, 14 classes at the second level, and 44 classes at the third level. Class 523: Oceans was omitted due to computational constraints.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Akhil Theerthala (2023). Imbalanced Cifar-10 [Dataset]. https://www.kaggle.com/datasets/akhiltheerthala/imbalanced-cifar-10
Organization logo

Imbalanced Cifar-10

A synthetically imbalanced version of CIFAR 10 for multi-class classification

Explore at:
187 scholarly articles cite this dataset (View in Google Scholar)
zip(807146485 bytes)Available download formats
Dataset updated
Jun 17, 2023
Authors
Akhil Theerthala
Description

This dataset is a modified version of the classic CIFAR 10, deliberately designed to be imbalanced across its classes. CIFAR 10 typically consists of 60,000 32x32 color images in 10 classes, with 5000 images per class in the training set. However, this dataset skews these distributions to create a more challenging environment for developing and testing machine learning algorithms. The distribution can be visualized as follows,

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F7862887%2Fae7643fe0e58a489901ce121dc2e8262%2FCifar_Imbalanced_data.png?generation=1686732867580792&alt=media" alt="">

The primary purpose of this dataset is to offer researchers and practitioners a platform to develop, test, and enhance algorithms' robustness when faced with class imbalances. It is especially suited for those interested in binary and multi-class imbalance learning, anomaly detection, and other relevant fields.

The imbalance was created synthetically, maintaining the same quality and diversity of the original CIFAR 10 dataset, but with varying degrees of representation for each class. Details of the class distributions are included in the dataset's metadata.

This dataset is beneficial for: - Developing and testing strategies for handling imbalanced datasets. - Investigating the effects of class imbalance on model performance. - Comparing different machine learning algorithms' performance under class imbalance.

Usage Information:

The dataset maintains the same format as the original CIFAR 10 dataset, making it easy to incorporate into existing projects. It is organised in a way such that the dataset can be integrated into PyTorch ImageFolder directly. You can load the dataset in Python using popular libraries like NumPy and PyTorch.

License: This dataset follows the same license terms as the original CIFAR 10 dataset. Please refer to the official CIFAR 10 website for details.

Acknowledgments: We want to acknowledge the creators of the CIFAR 10 dataset. Without their work and willingness to share data, this synthetic imbalanced dataset wouldn't be possible.

Search
Clear search
Close search
Google apps
Main menu