98 datasets found
  1. The selected explanatory variables.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

  2. The definition of a confusion matrix.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  3. f

    Classification results using ML algorithms after applying SMOTE and feature...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Sep 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abu Marjan,; Al Mamun, Abdulla; Islam, Rashedul; Uddin, Palash; Nitu, Adiba Mahjabin; Ibn Afjal, Masud (2024). Classification results using ML algorithms after applying SMOTE and feature engineering, training, and testing ratios is 50:50. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001300526
    Explore at:
    Dataset updated
    Sep 3, 2024
    Authors
    Abu Marjan,; Al Mamun, Abdulla; Islam, Rashedul; Uddin, Palash; Nitu, Adiba Mahjabin; Ibn Afjal, Masud
    Description

    All values represent the mean value of 5 trials of experiments.

  4. A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

  5. A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the ILPD and QSAR datasets is presented, based on various classification metrics using the Random Forest classifier.

  6. Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced...

    • plos.figshare.com
    txt
    Updated Jun 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong (2023). Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data [Dataset]. http://doi.org/10.1371/journal.pone.0180830
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 18, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Jinyan Li; Lian-sheng Liu; Simon Fong; Raymond K. Wong; Sabah Mohammed; Jinan Fiaidhi; Yunsick Sung; Kelvin K. L. Wong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clinical data analysis and forecasting have made substantial contributions to disease control, prevention and detection. However, such data usually suffer from highly imbalanced samples in class distributions. In this paper, we aim to formulate effective methods to rebalance binary imbalanced dataset, where the positive samples take up only the minority. We investigate two different meta-heuristic algorithms, particle swarm optimization and bat algorithm, and apply them to empower the effects of synthetic minority over-sampling technique (SMOTE) for pre-processing the datasets. One approach is to process the full dataset as a whole. The other is to split up the dataset and adaptively process it one segment at a time. The experimental results reported in this paper reveal that the performance improvements obtained by the former methods are not scalable to larger data scales. The latter methods, which we call Adaptive Swarm Balancing Algorithms, lead to significant efficiency and effectiveness improvements on large datasets while the first method is invalid. We also find it more consistent with the practice of the typical large imbalanced medical datasets. We further use the meta-heuristic algorithms to optimize two key parameters of SMOTE. The proposed methods lead to more credible performances of the classifier, and shortening the run time compared to brute-force method.

  7. f

    Number of samples after dataset optimization.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao, Jia-jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng; Mao, Jun; Xu, Hui (2025). Number of samples after dataset optimization. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002090994
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Gao, Jia-jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng; Mao, Jun; Xu, Hui
    Description

    Landslides are frequent and hazardous geological disasters, posing significant risks to human safety and infrastructure. Accurate assessments of landslide susceptibility are crucial for risk management and mitigation. However, geological surveys of landslide areas are typically conducted at the township level, have lowsample sizes, and rely on experience. This study proposes a framework for assessing landslide susceptibility in Taiping Township, Zhejiang Province, China, using data balancing, machine learning, and data from 1,325 slope units with nine slope characteristics. The dataset was balanced using the Synthetic Minority Oversampling Technique and the Tomek link undersampling method (SMOTE-Tomek). A comparative analysis of six machine learning models was performed, and the SHapley Additive exPlanation (SHAP) method was used to assess the influencing factors. The results indicate that the machine learning algorithms provide high accuracy, and the random forest (RF) algorithm achieves the optimum model accuracy (0.791, F1 = 0.723). The very low, low, medium, and high sensitivity zones account for 92.27%, 5.12%, 1.78%, and 0.83% of the area, respectively. The height of cut slopes has the most significant impact on landslide sensitivity, whereas the altitude has a minor impact. The proposed model accurately assesses landslide susceptibility at the township scale, providing valuable insights for risk management and mitigation.

  8. f

    Area ratio and historical landslide numbers in different susceptibility...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cai, Jia-zeng; Li, Kun-lun; Gao, Jia-jun; Mao, Jun; Lv, Ming-zhou; Xu, Hui (2025). Area ratio and historical landslide numbers in different susceptibility categories for different models using the SMOTE sampling method. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002091023
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Cai, Jia-zeng; Li, Kun-lun; Gao, Jia-jun; Mao, Jun; Lv, Ming-zhou; Xu, Hui
    Description

    Area ratio and historical landslide numbers in different susceptibility categories for different models using the SMOTE sampling method.

  9. f

    Area ratio and historical landslide numbers in differentsusceptibility...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gao, Jia-jun; Xu, Hui; Mao, Jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng (2025). Area ratio and historical landslide numbers in differentsusceptibility categories for different models using the SMOTE-Tomek sampling method. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002091040
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Gao, Jia-jun; Xu, Hui; Mao, Jun; Li, Kun-lun; Lv, Ming-zhou; Cai, Jia-zeng
    Description

    Area ratio and historical landslide numbers in differentsusceptibility categories for different models using the SMOTE-Tomek sampling method.

  10. A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.

  11. Data from: RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF...

    • scielo.figshare.com
    tiff
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Jin; Bowen Yang; Chunguang Wang (2024). RESEARCH ON IDENTIFICATION AND CLASSIFICATION METHOD OF IMBALANCED DATA SET OF PIG BEHAVIOR [Dataset]. http://doi.org/10.6084/m9.figshare.23290691.v1
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    Min Jin; Bowen Yang; Chunguang Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ABSTRACT To address the problem of the low accuracy and poor robustness of modeling methods for imbalanced data sets of pig behavior identification and classification, the three commonly used re-sampling methods of under-sampling, SMOTE and Borderline-SMOTE are compared, and an adaptive boundary data augmentation algorithm AD-BL-SMOTE is proposed. The activity of the pigs was measured using triaxial accelerometers, which were fixed on the backs of the pigs. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that re-sampling methods are an effective way to improve the performance of pig behavior identification and classification. Moreover, AD-BL-SMOTE could yield greater improvements in classification performance than the other three methods for balancing the training data set. The overall major mean accuracy of lying, standing, walking, and exploring by pigs A, B and C was significantly improved by using AD-BL-SMOTE, reaching 91.8%, 93.0% and 96.0%, respectively.

  12. S

    Systematic analysis and modeling of the FLASH sparing effect as a function...

    • scidb.cn
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qibin FU; Tuchen HUANG (2024). Systematic analysis and modeling of the FLASH sparing effect as a function of dose rate and dose [Dataset]. http://doi.org/10.57760/sciencedb.j00186.00150
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2024
    Dataset provided by
    Science Data Bank
    Authors
    Qibin FU; Tuchen HUANG
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Online searches through Web of Science and PubMed were conducted on 15 September, 2023 for articles published after 1950 using the following terms: TS = (ultra high dose rate OR ultra-high dose rate OR ultrahigh dose rate) AND TS = (in vivo OR animal model OR mice OR preclinical). The queries produced 980 results in total, with 564 results left after removing duplicate entries.The titles and abstracts were reviewed manually by two authors and the full-text of suitable manuscripts was further screened considering the factors such as topics, experiment condition and methods, research objects, endpoints, etc. The detailed record identification and screening flows based on Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) are summarized in Figure 1. Finally, forty articles were included in our analysis.The FLASH effect was confirmed if there were significant differences in experimental phenomena and data under the two radiation conditions. In the same article, the research items with different endpoints but otherwise identical conditions were regarded as one item. As summarized in Table 1, a total of 131 items were extracted from the 40 articles included in the analysis. For each item, the FLASH effect (1 represents significant sparing effect and 0 represents no sparing effect) and detailed parameters were recorded, including type and energy of the radiation, dose, dose rate, experimental object, pulse characteristics (if provided), etc.According to emulate the quantitative analyses of normal tissue effect in the clinic (QUANTEC), the probability of triggering the FLASH effect as a function of mean dose rate or dose was analyzed with the binary logistic regression model. The analysis was done using the SPSS software. For the statistical data items, there are large imbalances in the number of data entries with and without FLASH effect (people are more inclined to report the research with positive results). Therefore, a more balanced dataset was obtained by oversampling using the K-Means SMOTE algorithm (Figure S1), which was implemented using Python based on the imblearn library.The ROC curve (receiver operating characteristic curve) was plotted as FPR (False Positive Rate) against TPR (True Positive Rate) at different threshold values. The classification model was validated using the AUC (area under ROC curve) value, which was threshold and scale invariant.

  13. P

    Replication Data for: default prediction of P2P loan based on improved...

    • opendata.pku.edu.cn
    doc, xls, zip
    Updated Aug 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peking University Open Research Data Platform (2019). Replication Data for: default prediction of P2P loan based on improved stacking model [Dataset]. http://doi.org/10.18170/DVN/LONMK9
    Explore at:
    zip(23476), xls(9035463), xls(8510407), doc(16384), xls(7564965), zip(4050)Available download formats
    Dataset updated
    Aug 28, 2019
    Dataset provided by
    Peking University Open Research Data Platform
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Data Sources: Training Set of Give me Some Credit Data in Kaggle Platform and a series of new data obtained after a series of pre-processing; Data format: three CSV copies; Data Description: The data set includes the age, income, family and loan situation of borrowers and there are 11 variables in total in which Serious Dlqin2yrs is lategory label,.1 represents default, 0 represents non-default, and another 10. Three variables are predictive characteristics. Smote_standardized_data is the result of traindata's basic processing, KNN filling missing values, outlier processing, standardization and balancing with SMOTE algorithm.

  14. d

    Predictive Models on the 2013 NCDB Colon Cancer Data

    • elsevier.digitalcommonsdata.com
    Updated May 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Grey Leonard (2021). Predictive Models on the 2013 NCDB Colon Cancer Data [Dataset]. http://doi.org/10.17632/jg44fgspzk.1
    Explore at:
    Dataset updated
    May 4, 2021
    Authors
    Grey Leonard
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The attached file contains R code which encompasses and describes the process of loading data, cleaning data, selecting variables, imputing missing values, creating training and test sets, model building and evaluation. Additionally, the code contains the process to create graphs and tables for data and model evaluation.

    The goal was to build a logistic regression model to predict outcomes after surgery for colon cancer and to compare its performance with machine learning algorithms. An XGBgoost model, a Random Forest model and an XGBoost model from oversampled data using SMOTE were built and compared with logistic regression. Overall, the machine learning algorithms had improved AUC.

  15. m

    Data from: Mental issues, internet addiction and quality of life predict...

    • data.mendeley.com
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.1
    Explore at:
    Dataset updated
    Jul 12, 2024
    Authors
    Andras Matuz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by support vector machine with SMOTE-ENN (AUC = .942; balanced accuracy = .868, sensitivity = .898; specificity = .837). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.

  16. Data from: A virtual multi-label approach to imbalanced data classification

    • tandf.figshare.com
    text/x-tex
    Updated Feb 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth P. Chou; Shan-Ping Yang (2024). A virtual multi-label approach to imbalanced data classification [Dataset]. http://doi.org/10.6084/m9.figshare.19390561.v1
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    Feb 28, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Elizabeth P. Chou; Shan-Ping Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    One of the most challenging issues in machine learning is imbalanced data analysis. Usually, in this type of research, correctly predicting minority labels is more critical than correctly predicting majority labels. However, traditional machine learning techniques easily lead to learning bias. Traditional classifiers tend to place all subjects in the majority group, resulting in biased predictions. Machine learning studies are typically conducted from one of two perspectives: a data-based perspective or a model-based perspective. Oversampling and undersampling are examples of data-based approaches, while the addition of costs, penalties, or weights to optimize the algorithm is typical of a model-based approach. Some ensemble methods have been studied recently. These methods cause various problems, such as overfitting, the omission of some information, and long computation times. In addition, these methods do not apply to all kinds of datasets. Based on this problem, the virtual labels (ViLa) approach for the majority label is proposed to solve the imbalanced problem. A new multiclass classification approach with the equal K-means clustering method is demonstrated in the study. The proposed method is compared with commonly used imbalance problem methods, such as sampling methods (oversampling, undersampling, and SMOTE) and classifier methods (SVM and one-class SVM). The results show that the proposed method performs better when the degree of data imbalance increases and will gradually outperform other methods.

  17. f

    Table_4_Quantifying the Impacts of Pre- and Post-Conception TSH Levels on...

    • datasetcatalog.nlm.nih.gov
    • frontiersin.figshare.com
    Updated Oct 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang, Ling; Jin, Liping; Yu, Kun-Hsing; Zheng, Weiwei; Hua, Jing; Tian, Dajun; Zhang, Chao; Zhao, Huijuan; Sun, Yuantong; Li, Xun; Ma, Wuren; Xiao, Shuo (2021). Table_4_Quantifying the Impacts of Pre- and Post-Conception TSH Levels on Birth Outcomes: An Examination of Different Machine Learning Models.docx [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000832048
    Explore at:
    Dataset updated
    Oct 29, 2021
    Authors
    Zhang, Ling; Jin, Liping; Yu, Kun-Hsing; Zheng, Weiwei; Hua, Jing; Tian, Dajun; Zhang, Chao; Zhao, Huijuan; Sun, Yuantong; Li, Xun; Ma, Wuren; Xiao, Shuo
    Description

    BackgroundWhile previous studies identified risk factors for diverse pregnancy outcomes, traditional statistical methods had limited ability to quantify their impacts on birth outcomes precisely. We aimed to use a novel approach that applied different machine learning models to not only predict birth outcomes but systematically quantify the impacts of pre- and post-conception serum thyroid-stimulating hormone (TSH) levels and other predictive characteristics on birth outcomes.MethodsWe used data from women who gave birth in Shanghai First Maternal and Infant Hospital from 2014 to 2015. We included 14,110 women with the measurement of preconception TSH in the first analysis and 3,428 out of 14,110 women with both pre- and post-conception TSH measurement in the second analysis. Synthetic Minority Over-sampling Technique (SMOTE) was applied to adjust the imbalance of outcomes. We randomly split (7:3) the data into a training set and a test set in both analyses. We compared Area Under Curve (AUC) for dichotomous outcomes and macro F1 score for categorical outcomes among four machine learning models, including logistic model, random forest model, XGBoost model, and multilayer neural network models to assess model performance. The model with the highest AUC or macro F1 score was used to quantify the importance of predictive features for adverse birth outcomes with the loss function algorithm.ResultsThe XGBoost model provided prominent advantages in terms of improved performance and prediction of polytomous variables. Predictive models with abnormal preconception TSH or not-well-controlled TSH, a novel indicator with pre- and post-conception TSH levels combined, provided the similar robust prediction for birth outcomes. The highest AUC of 98.7% happened in XGBoost model for predicting low Apgar score with not-well-controlled TSH adjusted. By loss function algorithm, we found that not-well-controlled TSH ranked 4th, 6th, and 7th among 14 features, respectively, in predicting birthweight, induction, and preterm birth, and 3rd among 19 features in predicting low Apgar score.ConclusionsOur four machine learning models offered valid predictions of birth outcomes in women during pre- and post-conception. The predictive features panel suggested the combined TSH indicator (not-well-controlled TSH) could be a potentially competitive biomarker to predict adverse birth outcomes.

  18. Credit_Card_Fraud

    • kaggle.com
    zip
    Updated May 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Yáñez Feijóo (2024). Credit_Card_Fraud [Dataset]. https://www.kaggle.com/datasets/oscaryezfeijo/credit-card-fraud
    Explore at:
    zip(8067891 bytes)Available download formats
    Dataset updated
    May 18, 2024
    Authors
    Oscar Yáñez Feijóo
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Credit Card Fraud Detection Introduction Credit card fraud detection is a critical challenge in the financial sector. This project aims to build a machine learning model to identify fraudulent credit card transactions using a comprehensive dataset.

    Dataset Overview The dataset contains transactions made by credit cards in September 2013 by European cardholders. It presents a significant class imbalance, with the majority of transactions being non-fraudulent.

    Features:

    Time: Seconds elapsed between this transaction and the first transaction in the dataset. V1 to V28: Anonymized features resulting from a PCA transformation. Amount: Transaction amount. Class: Target variable (1 for fraud, 0 for non-fraud). Steps Taken 1. Data Preprocessing Standardization: Standardized numeric features to improve model performance. Handling Imbalance: Applied SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset and ensure the model is well-trained on both classes. 2. Exploratory Data Analysis Correlation Analysis: Examined correlations between features to understand relationships and their potential impact on the model. 3. Model Building Algorithm Used: Random Forest Classifier, chosen for its robustness and high performance. Hyperparameter Tuning: Employed RandomizedSearchCV to find the best hyperparameters and enhance model accuracy. 4. Model Evaluation Confusion Matrix & Classification Report: Evaluated the model’s performance using key metrics such as precision, recall, F1-score, and overall accuracy. Feature Importance: Analyzed feature importances to identify which features contribute most to detecting fraud. Results The model achieved outstanding performance metrics:

    Accuracy: 100% Precision, Recall, F1-score: 1.00 for both classes Confusion Matrix: True Negatives (TN): 9906 False Positives (FP): 8 False Negatives (FN): 9 True Positives (TP): 9757 Conclusion This project demonstrates the effectiveness of machine learning in detecting fraudulent credit card transactions. The key steps, including data preprocessing, handling class imbalance, and hyperparameter tuning, were crucial in achieving high model performance. The feature importance analysis provided valuable insights into the key indicators of fraudulent activity.

    Check out the full code and detailed analysis in the GitHub Repository.

  19. f

    Model comparison using multiple metrics before balancing by SMOTE(Train-Test...

    • plos.figshare.com
    xls
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede (2025). Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)). [Dataset]. http://doi.org/10.1371/journal.pdig.0000707.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Daniel Niguse Mamo; Agmasie Damtew Walle; Eden Ketema Woldekidan; Jibril Bashir Adem; Yosef Haile Gebremariam; Meron Asmamaw Alemayehu; Ermias Bekele Enyew; Shimels Derso Kebede
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Model comparison using multiple metrics before balancing by SMOTE(Train-Test Split (80%-20%)).

  20. f

    Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation...

    • frontiersin.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Jin; Chunguang Wang; Dan Børge Jensen (2023). Data_Sheet_1_Effect of De-noising by Wavelet Filtering and Data Augmentation by Borderline SMOTE on the Classification of Imbalanced Datasets of Pig Behavior.pdf [Dataset]. http://doi.org/10.3389/fanim.2021.666855.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Min Jin; Chunguang Wang; Dan Børge Jensen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification of imbalanced datasets of animal behavior has been one of the top challenges in the field of animal science. An imbalanced dataset will lead many classification algorithms to being less effective and result in a higher misclassification rate for the minority classes. The aim of this study was to assess a method for addressing the problem of imbalanced datasets of pigs' behavior by using an over-sampling method, namely Borderline-SMOTE. The pigs' activity was measured using a triaxial accelerometer, which was mounted on the back of the pigs. Wavelet filtering and Borderline-SMOTE were both applied as methods to pre-process the dataset. A multilayer feed-forward neural network was trained and validated with 21 input features to classify four pig activities: lying, standing, walking, and exploring. The results showed that wavelet filtering and Borderline-SMOTE both lead to improved performance. Furthermore, Borderline-SMOTE yielded greater improvements in classification performance than an alternative method for balancing the training data, namely random under-sampling, which is commonly used in animal science research. However, the overall performance was not adequate to satisfy the research needs in this field and to address the common but urgent problem of imbalanced behavior dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
Organization logo

The selected explanatory variables.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Jun 21, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

Search
Clear search
Close search
Google apps
Main menu