98 datasets found

The selected explanatory variables.
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0281901.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
Classification results using ML algorithms after applying SMOTE and feature...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abu Marjan,; Al Mamun, Abdulla; Islam, Rashedul; Uddin, Palash; Nitu, Adiba Mahjabin; Ibn Afjal, Masud (2024). Classification results using ML algorithms after applying SMOTE and feature engineering, training, and testing ratios is 50:50. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001300526
Explore at:
Dataset updated
Sep 3, 2024
Authors
Abu Marjan,; Al Mamun, Abdulla; Islam, Rashedul; Uddin, Palash; Nitu, Adiba Mahjabin; Ibn Afjal, Masud
Description
All values represent the mean value of 5 trials of experiments.
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t008
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t007
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier.
Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...
plos.figshare.com
txt
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0180830
Dataset updated
Jun 18, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.
f
Number of samples after dataset optimization.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao, Jia-jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng; Mao, Jun; Xu, Hui (2025). Number of samples after dataset optimization. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002090994
Explore at:
Dataset updated
May 21, 2025
Authors
Gao, Jia-jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng; Mao, Jun; Xu, Hui
Description
Landslides are frequent and hazardous geological disasters, posing significant risks to human safety and infrastructure. Accurate assessments of landslide susceptibility are crucial for risk management and mitigation. However, geological surveys of landslide areas are typically conducted at the township level, have lowsample sizes, and rely on experience. This study proposes a framework for assessing landslide susceptibility in Taiping Township, Zhejiang Province, China, using data balancing, machine learning, and data from 1,325 slope units with nine slope characteristics. The dataset was balanced using the Synthetic Minority Oversampling Technique and the Tomek link undersampling method (SMOTE-Tomek). A comparative analysis of six machine learning models was performed, and the SHapley Additive exPlanation (SHAP) method was used to assess the influencing factors. The results indicate that the machine learning algorithms provide high accuracy, and the random forest (RF) algorithm achieves the optimum model accuracy (0.791, F1 = 0.723). The very low, low, medium, and high sensitivity zones account for 92.27%, 5.12%, 1.78%, and 0.83% of the area, respectively. The height of cut slopes has the most significant impact on landslide sensitivity, whereas the altitude has a minor impact. The proposed model accurately assesses landslide susceptibility at the township scale, providing valuable insights for risk management and mitigation.
f
Area ratio and historical landslide numbers in different susceptibility...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cai, Jia-zeng; Li, Kun-lun; Gao, Jia-jun; Mao, Jun; Lv, Ming-zhou; Xu, Hui (2025). Area ratio and historical landslide numbers in different susceptibility categories for different models using the SMOTE sampling method. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002091023
Explore at:
Dataset updated
May 21, 2025
Authors
Cai, Jia-zeng; Li, Kun-lun; Gao, Jia-jun; Mao, Jun; Lv, Ming-zhou; Xu, Hui
Description
Area ratio and historical landslide numbers in different susceptibility categories for different models using the SMOTE sampling method.
f
Area ratio and historical landslide numbers in differentsusceptibility...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated May 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gao, Jia-jun; Xu, Hui; Mao, Jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng (2025). Area ratio and historical landslide numbers in differentsusceptibility categories for different models using the SMOTE-Tomek sampling method. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002091040
Explore at:
Dataset updated
May 21, 2025
Authors
Gao, Jia-jun; Xu, Hui; Mao, Jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng
Description
Area ratio and historical landslide numbers in differentsusceptibility categories for different models using the SMOTE-Tomek sampling method.
A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...
plos.figshare.com
xls
Updated Feb 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t009
Dataset updated
Feb 10, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.
Data from: RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF...
scielo.figshare.com
tiff
Updated Feb 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Jin; Bowen Yang; Chunguang Wang (2024). RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF IMBALANCED DATA SET OF PIG BEHAVIOR [Dataset]. http://doi.org/10.6084/m9.figshare.23290691.v1
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.23290691.v1
Dataset updated
Feb 14, 2024
Dataset provided by
SciELOhttp://www.scielo.org/
Authors
Min Jin; Bowen Yang; Chunguang Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT To address the problem of the low accuracy and poor robustness of modeling methods for imbalanced data sets of pig behavior identification and classification, the three commonly used re-sampling methods of under-sampling, SMOTE and Borderline-SMOTE are compared, and an adaptive boundary data augmentation algorithm AD-BL-SMOTE is proposed. The activity of the pigs was measured using triaxial accelerometers, which were fixed on the backs of the pigs. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that re-sampling methods are an effective way to improve the performance of pig behavior identification and classification. Moreover, AD-BL-SMOTE could yield greater improvements in classification performance than the other three methods for balancing the training data set. The overall major mean accuracy of lying, standing, walking, and exploring by pigs A, B and C was significantly improved by using AD-BL-SMOTE, reaching 91.8%, 93.0% and 96.0%, respectively.
S
Systematic analysis and modeling of the FLASH sparing effect as a function...
scidb.cn
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qibin FU; Tuchen HUANG (2024). Systematic analysis and modeling of the FLASH sparing effect as a function of dose rate and dose [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00150
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00186.00150
Dataset updated
Jun 29, 2024
Dataset provided by
Science Data Bank
Authors
Qibin FU; Tuchen HUANG
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Online searches through Web of Science and PubMed were conducted on 15 September, 2023 for articles published after 1950 using the following terms: TS = (ultra high dose rate OR ultra-high dose rate OR ultrahigh dose rate) AND TS = (in vivo OR animal model OR mice OR preclinical). The queries produced 980 results in total, with 564 results left after removing duplicate entries.The titles and abstracts were reviewed manually by two authors and the full-text of suitable manuscripts was further screened considering the factors such as topics, experiment condition and methods, research objects, endpoints, etc. The detailed record identification and screening flows based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) are summarized in Figure 1. Finally, forty articles were included in our analysis.The FLASH effect was confirmed if there were significant differences in experimental phenomena and data under the two radiation conditions. In the same article, the research items with different endpoints but otherwise identical conditions were regarded as one item. As summarized in Table 1, a total of 131 items were extracted from the 40 articles included in the analysis. For each item, the FLASH effect (1 represents significant sparing effect and 0 represents no sparing effect) and detailed parameters were recorded, including type and energy of the radiation, dose, dose rate, experimental object, pulse characteristics (if provided), etc.According to emulate the quantitative analyses of normal tissue effect in the clinic (QUANTEC), the probability of triggering the FLASH effect as a function of mean dose rate or dose was analyzed with the binary logistic regression model. The analysis was done using the SPSS software. For the statistical data items, there are large imbalances in the number of data entries with and without FLASH effect (people are more inclined to report the research with positive results). Therefore, a more balanced dataset was obtained by oversampling using the K-Means SMOTE algorithm (Figure S1), which was implemented using Python based on the imblearn library.The ROC curve (receiver operating characteristic curve) was plotted as FPR (False Positive Rate) against TPR (True Positive Rate) at different threshold values. The classification model was validated using the AUC (area under ROC curve) value, which was threshold and scale invariant.
P
Replication Data for: default prediction of P2P loan based on improved...
opendata.pku.edu.cn
doc, xls, zip
Updated Aug 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peking University Open Research Data Platform (2019). Replication Data for: default prediction of P2P loan based on improved stacking model [Dataset]. http://doi.org/10.18170/DVN/LONMK9
Explore at:
zip(23476), xls(9035463), xls(8510407), doc(16384), xls(7564965), zip(4050)Available download formats
Unique identifier
https://doi.org/10.18170/DVN/LONMK9
Dataset updated
Aug 28, 2019
Dataset provided by
Peking University Open Research Data Platform
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Data Sources: Training Set of Give me Some Credit Data in Kaggle Platform and a series of new data obtained after a series of pre-processing; Data format: three CSV copies; Data Description: The data set includes the age, income, family and loan situation of borrowers and there are 11 variables in total in which Serious Dlqin2yrs is lategory label,.1 represents default, 0 represents non-default, and another 10. Three variables are predictive characteristics. Smote_standardized_data is the result of traindata's basic processing, KNN filling missing values, outlier processing, standardization and balancing with SMOTE algorithm.
d
Predictive Models on the 2013 NCDB Colon Cancer Data
elsevier.digitalcommonsdata.com
Updated May 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Grey Leonard (2021). Predictive Models on the 2013 NCDB Colon Cancer Data [Dataset]. http://doi.org/10.17632/jg44fgspzk.1
Explore at:
Unique identifier
https://doi.org/10.17632/jg44fgspzk.1
Dataset updated
May 4, 2021
Authors
Grey Leonard
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The attached file contains R code which encompasses and describes the process of loading data, cleaning data, selecting variables, imputing missing values, creating training and test sets, model building and evaluation. Additionally, the code contains the process to create graphs and tables for data and model evaluation.

The goal was to build a logistic regression model to predict outcomes after surgery for colon cancer and to compare its performance with machine learning algorithms. An XGBgoost model, a Random Forest model and an XGBoost model from oversampled data using SMOTE were built and compared with logistic regression. Overall, the machine learning algorithms had improved AUC.
m
Data from: Mental issues, internet addiction and quality of life predict...
data.mendeley.com
Updated Jul 12, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.1
Explore at:
Unique identifier
https://doi.org/10.17632/2yy4j7rgvg.1
Dataset updated
Jul 12, 2024
Authors
Andras Matuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by support vector machine with SMOTE-ENN (AUC = .942; balanced accuracy = .868, sensitivity = .898; specificity = .837). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.
Data from: A virtual multi-label approach to imbalanced data classification
tandf.figshare.com
text/x-tex
Updated Feb 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
Explore at:
text/x-texAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19390561.v1
Dataset updated
Feb 28, 2024
Dataset provided by
Taylor & Francishttps://taylorandfrancis.com/
Authors
Elizabeth P. Chou; Shan-Ping Yang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.
f
Table_4_Quantifying the Impacts of Pre- and Post-Conception TSH Levels on...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Oct 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang, Ling; Jin, Liping; Yu, Kun-Hsing; Zheng, Weiwei; Hua, Jing; Tian, Dajun; Zhang, Chao; Zhao, Huijuan; Sun, Yuantong; Li, Xun; Ma, Wuren; Xiao, Shuo (2021). Table_4_Quantifying the Impacts of Pre- and Post-Conception TSH Levels on Birth Outcomes: An Examination of Different Machine Learning Models.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000832048
Explore at:
Dataset updated
Oct 29, 2021
Authors
Zhang, Ling; Jin, Liping; Yu, Kun-Hsing; Zheng, Weiwei; Hua, Jing; Tian, Dajun; Zhang, Chao; Zhao, Huijuan; Sun, Yuantong; Li, Xun; Ma, Wuren; Xiao, Shuo
Description
BackgroundWhile previous studies identified risk factors for diverse pregnancy outcomes, traditional statistical methods had limited ability to quantify their impacts on birth outcomes precisely. We aimed to use a novel approach that applied different machine learning models to not only predict birth outcomes but systematically quantify the impacts of pre- and post-conception serum thyroid-stimulating hormone (TSH) levels and other predictive characteristics on birth outcomes.MethodsWe used data from women who gave birth in Shanghai First Maternal and Infant Hospital from 2014 to 2015. We included 14,110 women with the measurement of preconception TSH in the first analysis and 3,428 out of 14,110 women with both pre- and post-conception TSH measurement in the second analysis. Synthetic Minority Over-sampling Technique (SMOTE) was applied to adjust the imbalance of outcomes. We randomly split (7:3) the data into a training set and a test set in both analyses. We compared Area Under Curve (AUC) for dichotomous outcomes and macro F1 score for categorical outcomes among four machine learning models, including logistic model, random forest model, XGBoost model, and multilayer neural network models to assess model performance. The model with the highest AUC or macro F1 score was used to quantify the importance of predictive features for adverse birth outcomes with the loss function algorithm.ResultsThe XGBoost model provided prominent advantages in terms of improved performance and prediction of polytomous variables. Predictive models with abnormal preconception TSH or not-well-controlled TSH, a novel indicator with pre- and post-conception TSH levels combined, provided the similar robust prediction for birth outcomes. The highest AUC of 98.7% happened in XGBoost model for predicting low Apgar score with not-well-controlled TSH adjusted. By loss function algorithm, we found that not-well-controlled TSH ranked 4th, 6th, and 7th among 14 features, respectively, in predicting birthweight, induction, and preterm birth, and 3rd among 19 features in predicting low Apgar score.ConclusionsOur four machine learning models offered valid predictions of birth outcomes in women during pre- and post-conception. The predictive features panel suggested the combined TSH indicator (not-well-controlled TSH) could be a potentially competitive biomarker to predict adverse birth outcomes.
Credit_Card_Fraud
kaggle.com
zip
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oscar Yáñez Feijóo (2024). Credit_Card_Fraud [Dataset]. https://www.kaggle.com/datasets/oscaryezfeijo/credit-card-fraud
Explore at:
zip(8067891 bytes)Available download formats
Dataset updated
May 18, 2024
Authors
Oscar Yáñez Feijóo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Credit Card Fraud Detection Introduction Credit card fraud detection is a critical challenge in the financial sector. This project aims to build a machine learning model to identify fraudulent credit card transactions using a comprehensive dataset.

Dataset Overview The dataset contains transactions made by credit cards in September 2013 by European cardholders. It presents a significant class imbalance, with the majority of transactions being non-fraudulent.

Features:

Time: Seconds elapsed between this transaction and the first transaction in the dataset. V1 to V28: Anonymized features resulting from a PCA transformation. Amount: Transaction amount. Class: Target variable (1 for fraud, 0 for non-fraud). Steps Taken 1. Data Preprocessing Standardization: Standardized numeric features to improve model performance. Handling Imbalance: Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset and ensure the model is well-trained on both classes. 2. Exploratory Data Analysis Correlation Analysis: Examined correlations between features to understand relationships and their potential impact on the model. 3. Model Building Algorithm Used: Random Forest Classifier, chosen for its robustness and high performance. Hyperparameter Tuning: Employed RandomizedSearchCV to find the best hyperparameters and enhance model accuracy. 4. Model Evaluation Confusion Matrix & Classification Report: Evaluated the model’s performance using key metrics such as precision, recall, F1-score, and overall accuracy. Feature Importance: Analyzed feature importances to identify which features contribute most to detecting fraud. Results The model achieved outstanding performance metrics:

Accuracy: 100% Precision, Recall, F1-score: 1.00 for both classes Confusion Matrix: True Negatives (TN): 9906 False Positives (FP): 8 False Negatives (FN): 9 True Positives (TP): 9757 Conclusion This project demonstrates the effectiveness of machine learning in detecting fraudulent credit card transactions. The key steps, including data preprocessing, handling class imbalance, and hyperparameter tuning, were crucial in achieving high model performance. The feature importance analysis provided valuable insights into the key indicators of fraudulent activity.

Check out the full code and detailed analysis in the GitHub Repository.
f
Model comparison using multiple metrics before balancing by SMOTE(Train-Test...
plos.figshare.com
xls
Updated Jan 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede (2025). Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)). [Dataset]. http://doi.org/10.1371/journal.pdig.0000707.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000707.t003
Dataset updated
Jan 9, 2025
Dataset provided by
PLOS Digital Health
Authors
Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)).
f
Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min Jin; Chunguang Wang; Dan Børge Jensen (2023). Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation by Borderline SMOTE on the Classification of Imbalanced Datasets of Pig Behavior.pdf [Dataset]. http://doi.org/10.3389/fanim.2021.666855.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fanim.2021.666855.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Min Jin; Chunguang Wang; Dan Børge Jensen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification of imbalanced datasets of animal behavior has been one of the top challenges in the field of animal science. An imbalanced dataset will lead many classification algorithms to being less effective and result in a higher misclassification rate for the minority classes. The aim of this study was to assess a method for addressing the problem of imbalanced datasets of pigs' behavior by using an over-sampling method, namely Borderline-SMOTE. The pigs' activity was measured using a triaxial accelerometer, which was mounted on the back of the pigs. Wavelet filtering and Borderline-SMOTE were both applied as methods to pre-process the dataset. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that wavelet filtering and Borderline-SMOTE both lead to improved performance. Furthermore, Borderline-SMOTE yielded greater improvements in classification performance than an alternative method for balancing the training data, namely random under-sampling, which is commonly used in animal science research. However, the overall performance was not adequate to satisfy the research needs in this field and to address the common but urgent problem of imbalanced behavior dataset.

Facebook

Twitter

Click to copy link

Link copied

Cite

Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002

The selected explanatory variables.

Explore at:

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0281901.t002

Dataset updated

Jun 21, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

Clear search

Close search

Google apps

Main menu

The selected explanatory variables.

The definition of a confusion matrix.

Classification results using ML algorithms after applying SMOTE and feature...

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

Number of samples after dataset optimization.

Area ratio and historical landslide numbers in different susceptibility...

Area ratio and historical landslide numbers in differentsusceptibility...

A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...

Data from: RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF...

Systematic analysis and modeling of the FLASH sparing effect as a function...

Replication Data for: default prediction of P2P loan based on improved...

Predictive Models on the 2013 NCDB Colon Cancer Data

Data from: Mental issues, internet addiction and quality of life predict...

Data from: A virtual multi-label approach to imbalanced data classification

Table_4_Quantifying the Impacts of Pre- and Post-Conception TSH Levels on...

Credit_Card_Fraud

Model comparison using multiple metrics before balancing by SMOTE(Train-Test...

Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation...

The selected explanatory variables.See More Versions

The selected explanatory variables.