80 datasets found

f
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...
plos.figshare.com
txt
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.
f
Number of instances increased by SMOTE technique.
plos.figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr (2023). Number of instances increased by SMOTE technique. [Dataset]. http://doi.org/10.1371/journal.pone.0179805.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179805.t003
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of instances increased by SMOTE technique.
m
Synthetic oversampling for credit card default prediction
data.mendeley.com
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fransiscus Pratikto (2023). Synthetic oversampling for credit card default prediction [Dataset]. http://doi.org/10.17632/jrss9jdjz9.1
Explore at:
Unique identifier
https://doi.org/10.17632/jrss9jdjz9.1
Dataset updated
Mar 8, 2023
Authors
Fransiscus Pratikto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains more than 17000 data of credit card holder with 20 predictor variables and 1 binary target variable. The corresponding R code for comparing several proposed (density-based) and existing synthetic oversampling methods (SMOTE-based) is also provided.
s
Data from: High impact bug report identification with imbalanced learning...
researchdata.smu.edu.sg
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN (2023). Data from: High impact bug report identification with imbalanced learning strategies [Dataset]. http://doi.org/10.25440/smu.12062763.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062763.v1
Dataset updated
Jun 1, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
YANG Xinli; David LO; Xin XIA; Qiao HUANG; Jianling SUN
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the underlying research data for the publication "High impact bug report identification with imbalanced learning strategies" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/3702In practice, some bugs have more impact than others and thus deserve more immediate attention. Due to tight schedule and limited human resources, developers may not have enough time to inspect all bugs. Thus, they often concentrate on bugs that are highly impactful. In the literature, high-impact bugs are used to refer to the bugs which appear at unexpected time or locations and bring more unexpected effects (i.e., surprise bugs), or break pre-existing functionalities and destroy the user experience (i.e., breakage bugs). Unfortunately, identifying high-impact bugs from thousands of bug reports in a bug tracking system is not an easy feat. Thus, an automated technique that can identify high-impact bug reports can help developers to be aware of them early, rectify them quickly, and minimize the damages they cause. Considering that only a small proportion of bugs are high-impact bugs, the identification of high-impact bug reports is a difficult task. In this paper, we propose an approach to identify high-impact bug reports by leveraging imbalanced learning strategies. We investigate the effectiveness of various variants, each of which combines one particular imbalanced learning strategy and one particular classification algorithm. In particular, we choose four widely used strategies for dealing with imbalanced data and four state-of-the-art text classification algorithms to conduct experiments on four datasets from four different open source projects. We mainly perform an analytical study on two types of high-impact bugs, i.e., surprise bugs and breakage bugs. The results show that different variants have different performances, and the best performing variants SMOTE (synthetic minority over-sampling technique) + KNN (K-nearest neighbours) for surprise bug identification and RUS (random under-sampling) + NB (naive Bayes) for breakage bug identification outperform the F1-scores of the two state-of-the-art approaches by Thung et al. and Garcia and Shihab.Supplementary code and data available from GitHub:
t
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer...
service.tib.eu
Updated Dec 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer (2024). Dataset: SMOTE: Synthetic Minority Over-Sampling Technique. https://doi.org/10.57702/tq0zp0i3 [Dataset]. https://service.tib.eu/ldmservice/dataset/smote--synthetic-minority-over-sampling-technique
Explore at:
Dataset updated
Dec 3, 2024
Description
SMOTE: synthetic minority over-sampling technique.
f
Data from: Prediction of 35 Target Per- and Polyfluoroalkyl Substances...
acs.figshare.com
txt
Updated Aug 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jialin Dong; Gabriel Tsai; Christopher I. Olivares (2023). Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning [Dataset]. http://doi.org/10.1021/acsestwater.3c00134.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acsestwater.3c00134.s002
Dataset updated
Aug 18, 2023
Dataset provided by
ACS Publications
Authors
Jialin Dong; Gabriel Tsai; Christopher I. Olivares
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Comprehensive monitoring of perfluoroalkyl and polyfluoroalkyl substances (PFASs) is challenging because of the high analytical cost and an increasing number of analytes. We developed a machine learning pipeline to understand environmental features influencing PFAS profiles in groundwater. By examining 23 public data sets (2016–2022) in California, we built a state-wide groundwater database (25,000 observations across 4200 wells) encompassing contamination sources, weather, air quality, soil, hydrology, and groundwater quality (PFASs and cocontaminants). We used supervised learning to prescreen total PFAS concentrations above 70 ng/L and multilabel semisupervised learning to predict 35 individual PFAS concentrations above 2 ng/L. Random forest with ADASYN oversampling performed the best for total PFASs (AUROC 99%). XGBoost with SMOTE oversampling achieved the AUROC of 73–100% for individual PFAS prediction. Contamination sources and soil variables contributed the most to accuracy. Individual PFASs were strongly correlated within each PFAS’s subfamily (i.e., short- vs long-chain PFCAs, sulfonamides). These associations improved prediction performance using classifier chains, which predicts a PFAS based on previously predicted species. We applied the model to reconstruct PFAS profiles in groundwater wells with missing data in previous years. Our approach can complement monitoring programs of environmental agencies to validate previous investigation results and prioritize sites for future PFAS sampling.
Data from: Image-based automated species identification: Can virtual data...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Jun 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Morris Klasen; Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage; Jonas Eberle; Dirk Ahrens; Volker Steinhage (2022). Image-based automated species identification: Can virtual data augmentation overcome problems of insufficient sampling? [Dataset]. http://doi.org/10.5061/dryad.f1vhhmgx9
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.f1vhhmgx9
Dataset updated
Jun 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Morris Klasen; Morris Klasen; Jonas Eberle; Dirk Ahrens; Volker Steinhage; Jonas Eberle; Dirk Ahrens; Volker Steinhage
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Automated species identification and delimitation is challenging, particularly in rare and thus often scarcely sampled species, which do not allow sufficient discrimination of infraspecific versus interspecific variation. Typical problems arising from either low or exaggerated interspecific morphological differentiation are best met by automated methods of machine learning that learn efficient and effective species identification from training samples. However, limited infraspecific sampling remains a key challenge also in machine learning.

In this study, we assessed whether a data augmentation approach may help to overcome the problem of scarce training data in automated visual species identification. The stepwise augmentation of data comprised image rotation as well as visual and virtual augmentation. The visual data augmentation applies classic approaches of data augmentation and generation of artificial images using a Generative Adversarial Networks (GAN) approach. Descriptive feature vectors are derived from bottleneck features of a VGG-16 convolutional neural network (CNN) that are then stepwise reduced in dimensionality using Global Average Pooling and PCA to prevent overfitting. Finally, data augmentation employs synthetic additional sampling in feature space by an oversampling algorithm in vector space (SMOTE). Applied on four different image datasets, which include scarab beetle genitalia (Pleophylla, Schizonycha) as well as wing patterns of bees (Osmia) and cattleheart butterflies (Parides), our augmentation approach outperformed a deep learning baseline approach by means of resulting identification accuracy with non-augmented data as well as a traditional 2D morphometric approach (Procrustes analysis of scarab beetle genitalia).
Wireless Sensor Network Dataset
kaggle.com
Updated Jun 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rehan Adil Abbasi (2024). Wireless Sensor Network Dataset [Dataset]. https://www.kaggle.com/datasets/rehanadilabbasi/wireless-sensor-network-dataset/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rehan Adil Abbasi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Basic Information:

Number of entries: 374,661 Number of features: 19 Data Types:

15 integer columns 3 float columns 1 object column (label) Column Names:

id, Time, Is_CH, who CH, Dist_To_CH, ADV_S, ADV_R, JOIN_S, JOIN_R, SCH_S, SCH_R, Rank, DATA_S, DATA_R, Data_Sent_To_BS, dist_CH_To_BS, send_code, Consumed Energy, label Explore the Dataset First Five Rows:

id Time Is_CH who CH Dist_To_CH ADV_S ADV_R JOIN_S JOIN_R SCH_S SCH_R Rank DATA_S DATA_R Data_Sent_To_BS dist_CH_To_BS send_code Consumed Energy label 0 101000 50 1 101000 0.00000 1 0 0 25 1 0 0 0 1200 48 0.00000 1 0.00000 Attack 1 101001 50 0 101044 75.32345 0 4 1 0 0 1 2 38 0 0 0.00000 1 0.09797 Normal 2 101002 50 0 101010 46.95453 0 4 1 0 0 1 19 41 0 0 0.00000 1 0.09797 Normal 3 101003 50 0 101044 64.85231 0 4 1 0 0 1 16 38 0 0 0.00000 1 0.09797 Normal 4 101004 50 0 101010 4.83341 0 4 1 0 0 1 0 41 0 0 0.00000 1 0.09797 Normal Missing Values: No missing values detected in the dataset.

Statistical Summary:

The dataset includes various features related to network operations such as Time, Is_CH, who CH, Dist_To_CH, and more. The target variable label contains categorical values such as "Attack" and "Normal". Analyze Class Distribution Let's analyze the distribution of the classes within the dataset.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution Handle Class Imbalance If a class imbalance is identified, appropriate techniques such as undersampling, oversampling, or SMOTE (Synthetic Minority Over-sampling Technique) will be applied to balance the dataset.

Next Steps:

Identify the class distribution. Apply balancing techniques if necessary. Continue with data preprocessing and feature engineering. We will perform the class distribution analysis and balancing in the subsequent step.

i have found some duplicate values and drop them dataset.duplicated().sum() dataset.drop_duplicates(inplace=True)

ChatGPT Duplicate Handling Initial Duplicate Count Duplicates found: 8,873 Action Taken Duplicates dropped: The dataset has been cleaned by removing all duplicate entries. Verification Duplicates after cleaning: 0 The dataset is now free of duplicate entries. We can proceed with the analysis of the class distribution and any necessary steps to handle class imbalance.

Analyze Class Distribution Let's analyze the distribution of the label column to understand the balance between the classes.

python Copy code class_distribution = dataset['label'].value_counts() class_distribution I'll compute this now.

Class Distribution Analysis The distribution of the classes within the dataset is as follows:

Normal: 332,040 Grayhole: 13,909 Blackhole: 10,049 TDMA: 6,633 Flooding: 3,157 Observations There is a significant class imbalance, with the majority class being "Normal" (332,040 instances). Minority classes include "Grayhole", "Blackhole", "TDMA", and "Flooding".
f
Confusion matrix.
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301263.t001
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The diagnosis of human knee abnormalities using the surface electromyography (sEMG) signal obtained from lower limb muscles with machine learning is a major problem due to the noisy nature of the sEMG signal and the imbalance in data corresponding to healthy and knee abnormal subjects. To address this challenge, a combination of wavelet decomposition (WD) with ensemble empirical mode decomposition (EEMD) and the Synthetic Minority Oversampling Technique (S-WD-EEMD) is proposed. In this study, a hybrid WD-EEMD is considered for the minimization of noises produced in the sEMG signal during the collection, while the Synthetic Minority Oversampling Technique (SMOTE) is considered to balance the data by increasing the minority class samples during the training of machine learning techniques. The findings indicate that the hybrid WD-EEMD with SMOTE oversampling technique enhances the efficacy of the examined classifiers when employed on the imbalanced sEMG data. The F-Score of the Extra Tree Classifier, when utilizing WD-EEMD signal processing with SMOTE oversampling, is 98.4%, whereas, without the SMOTE oversampling technique, it is 95.1%.
f
Data from: Dataset for classification of signaling proteins based on...
figshare.com
portalcientifico.sergas.es
txt
Updated Jan 19, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carlos Fernandez-Lozano; Cristian Robert Munteanu (2016). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. http://doi.org/10.6084/m9.figshare.1330132.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.1330132.v1
Dataset updated
Jan 19, 2016
Dataset provided by
figshare
Authors
Carlos Fernandez-Lozano; Cristian Robert Munteanu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038

Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)
f
Top 10 performing oversamplers for DTS1 versus baseline (no oversampling and...
plos.figshare.com
figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson (2023). Top 10 performing oversamplers for DTS1 versus baseline (no oversampling and SMOTE) averaged across four classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0243907.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0243907.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Kevin Teh; Paul Armitage; Solomon Tesfaye; Dinesh Selvarajah; Iain D. Wilkinson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Top 10 performing oversamplers for DTS1 versus baseline (no oversampling and SMOTE) averaged across four classifiers.
Data from: Signature Informed Sampling for Transcriptomic Data
zenodo.org
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez (2023). Signature Informed Sampling for Transcriptomic Data [Dataset]. http://doi.org/10.5281/zenodo.8383203
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8383203
Dataset updated
Dec 4, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nikita Janakarajan; Nikita Janakarajan; Mara Graziani; Mara Graziani; María Rodríguez Martínez; María Rodríguez Martínez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the data and associated results of all experiments conducted in our work "Signature Informed Sampling for Transcriptomic Data". In this work we propose a simple, novel, non-parametric method for augmenting data inspired by the concept of chromosomal crossover. We benchmark our proposed methods against random oversampling, SMOTE, modified versions of gamma-Poisson and Poisson sapling, and the unbalanced data.

The compressed file data_5x5stratified.zip contains all the data used for our experiments. This includes the original count data based off of which augmentation was performed, the cross validation split indices as a json file, the training and validation data (TCGA) augmented by the various augmentation methods mentioned in our study, a test set (containing only real samples from TCGA) and an external test set (CPTAC) standardised accordingly with respect to each augmentation method and training data per cv split.

The compressed file 5x5_Results.zip contains all the results from all the experiments. This includes the parameter files used to train the various models, the metrics computed, the latent space of train, validation and test (if the model is a VAE), and the trained model itself for all 25 (5x5) splits.
DoH Attack and Malware Detection using ML/DL
kaggle.com
Updated May 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BCCC Datasets (2024). DoH Attack and Malware Detection using ML/DL [Dataset]. https://www.kaggle.com/datasets/bcccdatasets/bccc-cira-cic-dohbrw-2020/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 21, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
BCCC Datasets
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset was created to address the imbalance in the 'CIRA-CIC-DoBre-2020' dataset. Unlike the 'CIRA-CIC-DoHBrw-2020' dataset, which is skewed with about 90% malicious and only 10% benign Domain over HTTPS (DoH) network traffic, the 'BCCC-CIRA-CIC-DoHBrw-2020' dataset offers a more balanced composition. It includes equal numbers of malicious and benign DoH network traffic instances, with 249,836 instances in each category. This balance was achieved using the Synthetic Minority Over-sampling Technique (SMOTE). The 'BCCC-CIRA-CIC-DoHBrw-2020' dataset comprises three CSV files: one for malicious DoH traffic, one for benign DoH traffic, and a third that combines both types.

The full research paper outlining the details of the dataset and its underlying principles: “Unveiling DoH Tunnel: Toward Generating a Balanced DoH EncryptedTraffic Dataset and Profiling malicious Behaviour using InherentlyInterpretable Machine Learning“, Sepideh Niktabe, Arash Habibi Lashkari, Arousha Haghighian Roudsari, Peer-to-Peer Networking and Applications, Vol. 17, 2023
S
Systematic analysis and modeling of the FLASH sparing effect as a function...
scidb.cn
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qibin FU; Tuchen HUANG (2024). Systematic analysis and modeling of the FLASH sparing effect as a function of dose rate and dose [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00150
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00186.00150
Dataset updated
Jun 29, 2024
Dataset provided by
Science Data Bank
Authors
Qibin FU; Tuchen HUANG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online searches through Web of Science and PubMed were conducted on 15 September, 2023 for articles published after 1950 using the following terms: TS = (ultra high dose rate OR ultra-high dose rate OR ultrahigh dose rate) AND TS = (in vivo OR animal model OR mice OR preclinical). The queries produced 980 results in total, with 564 results left after removing duplicate entries.The titles and abstracts were reviewed manually by two authors and the full-text of suitable manuscripts was further screened considering the factors such as topics, experiment condition and methods, research objects, endpoints, etc. The detailed record identification and screening flows based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) are summarized in Figure 1. Finally, forty articles were included in our analysis.The FLASH effect was confirmed if there were significant differences in experimental phenomena and data under the two radiation conditions. In the same article, the research items with different endpoints but otherwise identical conditions were regarded as one item. As summarized in Table 1, a total of 131 items were extracted from the 40 articles included in the analysis. For each item, the FLASH effect (1 represents significant sparing effect and 0 represents no sparing effect) and detailed parameters were recorded, including type and energy of the radiation, dose, dose rate, experimental object, pulse characteristics (if provided), etc.According to emulate the quantitative analyses of normal tissue effect in the clinic (QUANTEC), the probability of triggering the FLASH effect as a function of mean dose rate or dose was analyzed with the binary logistic regression model. The analysis was done using the SPSS software. For the statistical data items, there are large imbalances in the number of data entries with and without FLASH effect (people are more inclined to report the research with positive results). Therefore, a more balanced dataset was obtained by oversampling using the K-Means SMOTE algorithm (Figure S1), which was implemented using Python based on the imblearn library.The ROC curve (receiver operating characteristic curve) was plotted as FPR (False Positive Rate) against TPR (True Positive Rate) at different threshold values. The classification model was validated using the AUC (area under ROC curve) value, which was threshold and scale invariant.
f
Number of instances decreased by Random Under-Sampling technique.
plos.figshare.com
xls
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr (2023). Number of instances decreased by Random Under-Sampling technique. [Dataset]. http://doi.org/10.1371/journal.pone.0179805.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179805.t002
Dataset updated
Jun 18, 2023
Dataset provided by
PLOS ONE
Authors
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of instances decreased by Random Under-Sampling technique.
f
Evaluation of the performance of classification models on imbalance dataset...
figshare.com
xls
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr (2023). Evaluation of the performance of classification models on imbalance dataset using the G2 attributes. [Dataset]. http://doi.org/10.1371/journal.pone.0179805.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179805.t005
Dataset updated
Jun 6, 2023
Dataset provided by
PLOS ONE
Authors
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Evaluation of the performance of classification models on imbalance dataset using the G2 attributes.
f
Ranking of the dataset attributes based on their Information Gain (IG).
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr (2023). Ranking of the dataset attributes based on their Information Gain (IG). [Dataset]. http://doi.org/10.1371/journal.pone.0179805.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0179805.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Manal Alghamdi; Mouaz Al-Mallah; Steven Keteyian; Clinton Brawner; Jonathan Ehrman; Sherif Sakr
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Ranking of the dataset attributes based on their Information Gain (IG).
f
Mortality statistics, censored for explant and transplant.
plos.figshare.com
xls
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natasha A. Loghmanpour; Marek J. Druzdzel; James F. Antaki (2023). Mortality statistics, censored for explant and transplant. [Dataset]. http://doi.org/10.1371/journal.pone.0111264.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0111264.t001
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Natasha A. Loghmanpour; Marek J. Druzdzel; James F. Antaki
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SMOTE: synthetic minority oversampling technique.Mortality statistics, censored for explant and transplant.
f
Classification of rare land cover types: Distinguishing annual and perennial...
plos.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Bogner; Bumsuk Seo; Dorian Rohner; Björn Reineking (2023). Classification of rare land cover types: Distinguishing annual and perennial crops in an agricultural catchment in South Korea [Dataset]. http://doi.org/10.1371/journal.pone.0190476
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0190476
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Christina Bogner; Bumsuk Seo; Dorian Rohner; Björn Reineking
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South Korea
Description
Many environmental data are inherently imbalanced, with some majority land use and land cover types dominating over rare ones. In cultivated ecosystems minority classes are often the target as they might indicate a beginning land use change. Most standard classifiers perform best on a balanced distribution of classes, and fail to detect minority classes. We used the synthetic minority oversampling technique (smote) with Random Forest to classify land cover classes in a small agricultural catchment in South Korea using modis time series. This area faces a major soil erosion problem and policy measures encourage farmers to replace annual by perennial crops to mitigate this issue. Our major goal was therefore to improve the classification performance on annual and perennial crops. We compared four different classification scenarios on original imbalanced and synthetically oversampled balanced data to quantify the effect of smote on classification performance. smote substantially increased the true positive rate of all oversampled minority classes. However, the performance on minor classes remained lower than on the majority class. We attribute this result to a class overlap already present in the original data set that is not resolved by smote. Our results show that resampling algorithms could help to derive more accurate land use and land cover maps from freely available data. These maps can be used to provide information on the distribution of land use classes in heterogeneous agricultural areas and could potentially benefit decision making.

Facebook

Twitter

Click to copy link

Link copied

Cite

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002

The definition of a confusion matrix.

Explore at:

31 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0317396.t002

Dataset updated

Feb 10, 2025

Dataset provided by

PLOS ONE

Authors

Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

Clear search

Close search

Google apps

Main menu

The definition of a confusion matrix.

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

Number of instances increased by SMOTE technique.

Synthetic oversampling for credit card default prediction

Data from: High impact bug report identification with imbalanced learning...

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, W Philip Kegelmeyer...

Data from: Prediction of 35 Target Per- and Polyfluoroalkyl Substances...

Data from: Image-based automated species identification: Can virtual data...

Wireless Sensor Network Dataset

Confusion matrix.

Data from: Dataset for classification of signaling proteins based on...

Top 10 performing oversamplers for DTS1 versus baseline (no oversampling and...

Data from: Signature Informed Sampling for Transcriptomic Data

DoH Attack and Malware Detection using ML/DL

Systematic analysis and modeling of the FLASH sparing effect as a function...

Number of instances decreased by Random Under-Sampling technique.

Evaluation of the performance of classification models on imbalance dataset...

Ranking of the dataset attributes based on their Information Gain (IG).

Mortality statistics, censored for explant and transplant.

Classification of rare land cover types: Distinguishing annual and perennial...

The definition of a confusion matrix.