30 datasets found

f
Classification result classifiers using TF-IDF with SMOTE.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t007
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification result classifiers using TF-IDF with SMOTE.
f
The definition of a confusion matrix.
plos.figshare.com
xls
Updated Feb 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The definition of a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0317396.t002
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.
f
DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...
frontiersin.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann (2023). DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water.docx [Dataset]. http://doi.org/10.3389/fenvs.2021.701288.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fenvs.2021.701288.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Daniel Lowell Weller; Tanzy M. T. Love; Martin Wiedmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.
f
Table1_A comparative study in class imbalance mitigation when working with...
frontiersin.figshare.com
pdf
Updated Mar 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rawan S. Abdulsadig; Esther Rodriguez-Villegas (2024). Table1_A comparative study in class imbalance mitigation when working with physiological signals.pdf [Dataset]. http://doi.org/10.3389/fdgth.2024.1377165.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2024.1377165.s001
Dataset updated
Mar 26, 2024
Dataset provided by
Frontiers
Authors
Rawan S. Abdulsadig; Esther Rodriguez-Villegas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Class imbalance is a common challenge that is often faced when dealing with classification tasks aiming to detect medical events that are particularly infrequent. Apnoea is an example of such events. This challenge can however be mitigated using class rebalancing algorithms. This work investigated 10 widely used data-level class imbalance mitigation methods aiming towards building a random forest (RF) model that attempts to detect apnoea events from photoplethysmography (PPG) signals acquired from the neck. Those methods are random undersampling (RandUS), random oversampling (RandOS), condensed nearest-neighbors (CNNUS), edited nearest-neighbors (ENNUS), Tomek’s links (TomekUS), synthetic minority oversampling technique (SMOTE), Borderline-SMOTE (BLSMOTE), adaptive synthetic oversampling (ADASYN), SMOTE with TomekUS (SMOTETomek) and SMOTE with ENNUS (SMOTEENN). Feature-space transformation using PCA and KernelPCA was also examined as a potential way of providing better representations of the data for the class rebalancing methods to operate. This work showed that RandUS is the best option for improving the sensitivity score (up to 11%). However, it could hinder the overall accuracy due to the reduced amount of training data. On the other hand, augmenting the data with new artificial data points was shown to be a non-trivial task that needs further development, especially in the presence of subject dependencies, as was the case in this work.
m
Data from: Mental issues, internet addiction and quality of life predict...
data.mendeley.com
Updated Jul 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.2
Explore at:
Unique identifier
https://doi.org/10.17632/2yy4j7rgvg.2
Dataset updated
Jul 31, 2024
Authors
Andras Matuz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by random forest with class weigths (AUC = .811; balanced accuracy = .745, sensitivity = .765; specificity = .726). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.
f
Confusion matrix.
plos.figshare.com
xls
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0301263.t001
Dataset updated
May 31, 2024
Dataset provided by
PLOS ONE
Authors
Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The diagnosis of human knee abnormalities using the surface electromyography (sEMG) signal obtained from lower limb muscles with machine learning is a major problem due to the noisy nature of the sEMG signal and the imbalance in data corresponding to healthy and knee abnormal subjects. To address this challenge, a combination of wavelet decomposition (WD) with ensemble empirical mode decomposition (EEMD) and the Synthetic Minority Oversampling Technique (S-WD-EEMD) is proposed. In this study, a hybrid WD-EEMD is considered for the minimization of noises produced in the sEMG signal during the collection, while the Synthetic Minority Oversampling Technique (SMOTE) is considered to balance the data by increasing the minority class samples during the training of machine learning techniques. The findings indicate that the hybrid WD-EEMD with SMOTE oversampling technique enhances the efficacy of the examined classifiers when employed on the imbalanced sEMG data. The F-Score of the Extra Tree Classifier, when utilizing WD-EEMD signal processing with SMOTE oversampling, is 98.4%, whereas, without the SMOTE oversampling technique, it is 95.1%.
f
Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...
frontiersin.figshare.com
pdf
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner (2023). Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods to Balance Sensitivity and Specificity of Predictive Models Based on Imbalanced Chemical Data Sets.PDF [Dataset]. http://doi.org/10.3389/fchem.2018.00362.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fchem.2018.00362.s001
Dataset updated
May 30, 2023
Dataset provided by
Frontiers
Authors
Priyanka Banerjee; Frederic O. Dehnbostel; Robert Preissner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Increase in the number of new chemicals synthesized in past decades has resulted in constant growth in the development and application of computational models for prediction of activity as well as safety profiles of the chemicals. Most of the time, such computational models and its application must deal with imbalanced chemical data. It is indeed a challenge to construct a classifier using imbalanced data set. In this study, we analyzed and validated the importance of different sampling methods over non-sampling method, to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data. Additionally, this study has achieved an accuracy of 93.00%, an AUC of 0.94, F1 measure of 0.90, sensitivity of 96.00% and specificity of 91.00% using SMOTE sampling and Random Forest classifier for the prediction of Drug Induced Liver Injury (DILI). Our results suggest that, irrespective of data set used, sampling methods can have major influence on reducing the gap between sensitivity and specificity of a model. This study demonstrates the efficacy of different sampling methods for class imbalanced problem using binary chemical data sets.
f
Data Sheet 8_Prediction of outpatient rehabilitation patient preferences and...
frontiersin.figshare.com
docx
Updated Jan 15, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang (2025). Data Sheet 8_Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm.docx [Dataset]. http://doi.org/10.3389/frai.2024.1473837.s009
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1473837.s009
Dataset updated
Jan 15, 2025
Dataset provided by
Frontiers
Authors
Xuehui Fan; Ruixue Ye; Yan Gao; Kaiwen Xue; Zeyu Zhang; Jing Xu; Jingpu Zhao; Jun Feng; Yulong Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundThe Department of Rehabilitation Medicine is key to improving patients’ quality of life. Driven by chronic diseases and an aging population, there is a need to enhance the efficiency and resource allocation of outpatient facilities. This study aims to analyze the treatment preferences of outpatient rehabilitation patients by using data and a grading tool to establish predictive models. The goal is to improve patient visit efficiency and optimize resource allocation through these predictive models.MethodsData were collected from 38 Chinese institutions, including 4,244 patients visiting outpatient rehabilitation clinics. Data processing was conducted using Python software. The pandas library was used for data cleaning and preprocessing, involving 68 categorical and 12 continuous variables. The steps included handling missing values, data normalization, and encoding conversion. The data were divided into 80% training and 20% test sets using the Scikit-learn library to ensure model independence and prevent overfitting. Performance comparisons among XGBoost, random forest, and logistic regression were conducted using metrics, including accuracy and receiver operating characteristic (ROC) curves. The imbalanced learning library’s SMOTE technique was used to address the sample imbalance during model training. The model was optimized using a confusion matrix and feature importance analysis, and partial dependence plots (PDP) were used to analyze the key influencing factors.ResultsXGBoost achieved the highest overall accuracy of 80.21% with high precision and recall in Category 1. random forest showed a similar overall accuracy. Logistic Regression had a significantly lower accuracy, indicating difficulties with nonlinear data. The key influencing factors identified include distance to medical institutions, arrival time, length of hospital stay, and specific diseases, such as cardiovascular, pulmonary, oncological, and orthopedic conditions. The tiered diagnosis and treatment tool effectively helped doctors assess patients’ conditions and recommend suitable medical institutions based on rehabilitation grading.ConclusionThis study confirmed that ensemble learning methods, particularly XGBoost, outperform single models in classification tasks involving complex datasets. Addressing class imbalance and enhancing feature engineering can further improve model performance. Understanding patient preferences and the factors influencing medical institution selection can guide healthcare policies to optimize resource allocation, improve service quality, and enhance patient satisfaction. Tiered diagnosis and treatment tools play a crucial role in helping doctors evaluate patient conditions and make informed recommendations for appropriate medical care.
f
Data Sheet 1_A unified Foot and Mouth Disease dataset for Uganda: evaluating...
frontiersin.figshare.com
docx
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Geofrey Kapalaga; Florence N. Kivunike; Susan Kerfua; Daudi Jjingo; Savino Biryomumaisho; Justus Rutaisire; Paul Ssajjakambwe; Swidiq Mugerwa; Yusuf Kiwala (2024). Data Sheet 1_A unified Foot and Mouth Disease dataset for Uganda: evaluating machine learning predictive performance degradation under varying distributions.docx [Dataset]. http://doi.org/10.3389/frai.2024.1446368.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2024.1446368.s001
Dataset updated
Jul 31, 2024
Dataset provided by
Frontiers
Authors
Geofrey Kapalaga; Florence N. Kivunike; Susan Kerfua; Daudi Jjingo; Savino Biryomumaisho; Justus Rutaisire; Paul Ssajjakambwe; Swidiq Mugerwa; Yusuf Kiwala
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Uganda
Description
In Uganda, the absence of a unified dataset for constructing machine learning models to predict Foot and Mouth Disease outbreaks hinders preparedness. Although machine learning models exhibit excellent predictive performance for Foot and Mouth Disease outbreaks under stationary conditions, they are susceptible to performance degradation in non-stationary environments. Rainfall and temperature are key factors influencing these outbreaks, and their variability due to climate change can significantly impact predictive performance. This study created a unified Foot and Mouth Disease dataset by integrating disparate sources and pre-processing data using mean imputation, duplicate removal, visualization, and merging techniques. To evaluate performance degradation, seven machine learning models were trained and assessed using metrics including accuracy, area under the receiver operating characteristic curve, recall, precision and F1-score. The dataset showed a significant class imbalance with more non-outbreaks than outbreaks, requiring data augmentation methods. Variability in rainfall and temperature impacted predictive performance, causing notable degradation. Random Forest with borderline SMOTE was the top-performing model in a stationary environment, achieving 92% accuracy, 0.97 area under the receiver operating characteristic curve, 0.94 recall, 0.90 precision, and 0.92 F1-score. However, under varying distributions, all models exhibited significant performance degradation, with random forest accuracy dropping to 46%, area under the receiver operating characteristic curve to 0.58, recall to 0.03, precision to 0.24, and F1-score to 0.06. This study underscores the creation of a unified Foot and Mouth Disease dataset for Uganda and reveals significant performance degradation in seven machine learning models under varying distributions. These findings highlight the need for new methods to address the impact of distribution variability on predictive performance.
f
Classification results of machine learning models using CNN features with.
figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification results of machine learning models using CNN features with. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t008
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification results of machine learning models using CNN features with.
f
Supplementary Material 8
figshare.com
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant (2025). Supplementary Material 8 [Dataset]. http://doi.org/10.6084/m9.figshare.28601057.v1
Explore at:
application/x-wine-extension-iniAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28601057.v1
Dataset updated
May 12, 2025
Dataset provided by
figshare
Authors
Nishitha R Kumar; Tejashree A Balraj; Kerry K Cooper; Akila Prashant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Synthetic Minority Over-sampling Technique (SMOTE) is a machine learning approach to address class imbalance in datasets. It is beneficial for identifying antimicrobial resistance (AMR) patterns. In AMR studies, datasets often contain more susceptible isolates than resistant ones, leading to biased model performance. SMOTE overcomes this issue by generating synthetic samples of the minority class (resistant isolates) through interpolation rather than simple duplication, thereby improving model generalization.When applied to AMR prediction, SMOTE enhances the ability of classification models to accurately identify resistant Escherichia coli strains by balancing the dataset, ensuring that machine learning algorithms do not overlook rare resistance patterns. It is commonly used with classifiers like decision trees, support vector machines (SVM), and deep learning models to improve predictive accuracy. By mitigating class imbalance, SMOTE enables robust AMR detection, aiding in early identification of drug-resistant bacteria and informing antibiotic stewardship efforts.Supervised machine learning is widely used in Escherichia coli genomic analysis to predict antimicrobial resistance, virulence factors, and strain classification. By training models on labeled genomic data (e.g., the presence or absence of resistance genes, SNP profiles, or MLST types), these classifiers help identify patterns and make accurate predictions.10 Supervised machine learning classifiers for E.coli genome analysis:Logistic regression (LR): A simple yet effective statistical model for binary classification, such as predicting antibiotic resistance or susceptibility in E. coli.Linear support vector machine (Linear SVM): This machine finds the optimal hyperplane to separate E. coli strains based on genomic features such as gene presence or sequence variations.Radial basis function kernel-support vector machine (RBF-SVM): A more flexible version of SVM that uses a non-linear kernel to capture complex relationships in genomic data, improving classification accuracy.Extra trees classifier: This tree-based ensemble method enhances classification by randomly selecting features and thresholds, improving robustness in E. coli strain differentiation.Random forest (RF): An ensemble learning method that constructs multiple decision trees, reducing overfitting and improving prediction accuracy for resistance genes and virulence factors.Adaboost: A boosting algorithm that combines weak classifiers iteratively, refining predictions and improving the identification of antimicrobial resistance patterns.XGboost: An optimized gradient boosting algorithm that efficiently handles large genomic datasets, commonly used for high-accuracy predictions in E. coli classification.Naïve bayes (NB): A probabilistic classifier based on Bayes' theorem, suitable for predicting resistance phenotypes based on genomic features.Linear discriminant Analysis (LDA) is a statistical approach that maximizes class separability. It helps distinguish between resistant and susceptible E. coli strains.Quadratic discriminant Analysis (QDA) is a variation of LDA that allows for non-linear decision boundaries, improving classification in datasets with complex genomic structures. When applied to E. coli genomes, these classifiers help predict antibiotic resistance, track outbreak strains, and understand genomic adaptations. Combining them with feature selection and optimization techniques enhances accuracy, making them valuable tools in bacterial genomics and clinical research.
f
Hyperparameter details of all machine learning models.
figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Hyperparameter details of all machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t002
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameter details of all machine learning models.
f
Classification results of classifiers using fastText.
plos.figshare.com
xls
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Classification results of classifiers using fastText. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t011
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Classification results of classifiers using fastText.
f
Example of different sentiments from the citation sentiment corpus.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Example of different sentiments from the citation sentiment corpus. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t001
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Example of different sentiments from the citation sentiment corpus.
Detailed overview of cohort characteristics for train and test cohort.
plos.figshare.com
xls
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar (2024). Detailed overview of cohort characteristics for train and test cohort. [Dataset]. http://doi.org/10.1371/journal.pone.0309383.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309383.t002
Dataset updated
Sep 4, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Values are presented as means with the standard deviations in parentheses.
f
Data Sheet 1_Machine learning prediction and interpretability analysis of...
figshare.com
frontiersin.figshare.com
zip
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongyi Chen; Haiyang Song; Hongyu Huang; Xiaojun Fang; Huang Chen; Qingqing Yang; Junyu Zhang; Wenjun Ding; Zheng Gong; Jun Ke (2025). Data Sheet 1_Machine learning prediction and interpretability analysis of high-risk chest pain: a study from the MIMIC-IV database.zip [Dataset]. http://doi.org/10.3389/fphys.2025.1594277.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.3389/fphys.2025.1594277.s001
Dataset updated
Jun 30, 2025
Dataset provided by
Frontiers
Authors
Hongyi Chen; Haiyang Song; Hongyu Huang; Xiaojun Fang; Huang Chen; Qingqing Yang; Junyu Zhang; Wenjun Ding; Zheng Gong; Jun Ke
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundHigh-risk chest pain is a critical presentation in emergency departments, frequently indicative of life-threatening cardiopulmonary conditions. Rapid and accurate diagnosis is pivotal for improving patient survival rates.MethodsWe developed a machine learning prediction model using the MIMIC-IV database (n = 14,716 patients, including 1,302 high-risk cases). To address class imbalance, we implemented feature engineering with SMOTE and under-sampling techniques. Model optimization was performed via Bayesian hyperparameter tuning. Seven algorithms were evaluated: Logistic Regression, Random Forest, SVM, XGBoost, LightGBM, TabTransformer, and TabNet.ResultsThe LightGBM model demonstrated superior performance with accuracy = 0.95, precision = 0.95, recall = 0.95, and F1-score = 0.94. SHAP analysis revealed maximum troponin and creatine kinase-MB levels as the top predictive features.ConclusionOur optimized LightGBM model provides clinically significant predictive capability for high-risk chest pain, offering emergency physicians a decision-support tool to enhance diagnostic accuracy and patient outcomes.
f
Strength and weakness of feature representation technique.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Khaled Alnowaiser (2024). Strength and weakness of feature representation technique. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302304.t003
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Khaled Alnowaiser
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Strength and weakness of feature representation technique.
f
Table_1_Interpretable machine learning model to predict surgical difficulty...
frontiersin.figshare.com
docx
Updated Feb 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Miao Yu; Zihan Yuan; Ruijie Li; Bo Shi; Daiwei Wan; Xiaoqiang Dong (2024). Table_1_Interpretable machine learning model to predict surgical difficulty in laparoscopic resection for rectal cancer.docx [Dataset]. http://doi.org/10.3389/fonc.2024.1337219.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fonc.2024.1337219.s001
Dataset updated
Feb 6, 2024
Dataset provided by
Frontiers
Authors
Miao Yu; Zihan Yuan; Ruijie Li; Bo Shi; Daiwei Wan; Xiaoqiang Dong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundLaparoscopic total mesorectal excision (LaTME) is standard surgical methods for rectal cancer, and LaTME operation is a challenging procedure. This study is intended to use machine learning to develop and validate prediction models for surgical difficulty of LaTME in patients with rectal cancer and compare these models’ performance.MethodsWe retrospectively collected the preoperative clinical and MRI pelvimetry parameter of rectal cancer patients who underwent laparoscopic total mesorectal resection from 2017 to 2022. The difficulty of LaTME was defined according to the scoring criteria reported by Escal. Patients were randomly divided into training group (80%) and test group (20%). We selected independent influencing features using the least absolute shrinkage and selection operator (LASSO) and multivariate logistic regression method. Adopt synthetic minority oversampling technique (SMOTE) to alleviate the class imbalance problem. Six machine learning model were developed: light gradient boosting machine (LGBM); categorical boosting (CatBoost); extreme gradient boost (XGBoost), logistic regression (LR); random forests (RF); multilayer perceptron (MLP). The area under receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity and F1 score were used to evaluate the performance of the model. The Shapley Additive Explanations (SHAP) analysis provided interpretation for the best machine learning model. Further decision curve analysis (DCA) was used to evaluate the clinical manifestations of the model.ResultsA total of 626 patients were included. LASSO regression analysis shows that tumor height, prognostic nutrition index (PNI), pelvic inlet, pelvic outlet, sacrococcygeal distance, mesorectal fat area and angle 5 (the angle between the apex of the sacral angle and the lower edge of the pubic bone) are the predictor variables of the machine learning model. In addition, the correlation heatmap shows that there is no significant correlation between these seven variables. When predicting the difficulty of LaTME surgery, the XGBoost model performed best among the six machine learning models (AUROC=0.855). Based on the decision curve analysis (DCA) results, the XGBoost model is also superior, and feature importance analysis shows that tumor height is the most important variable among the seven factors.ConclusionsThis study developed an XGBoost model to predict the difficulty of LaTME surgery. This model can help clinicians quickly and accurately predict the difficulty of surgery and adopt individualized surgical methods.
f
Summary of the evaluation metrics (AUCROC, Accuracy, Precision, Recall,...
figshare.com
xls
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar (2024). Summary of the evaluation metrics (AUCROC, Accuracy, Precision, Recall, F-Score) for the prediction models on the test set. [Dataset]. http://doi.org/10.1371/journal.pone.0309383.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0309383.t003
Dataset updated
Sep 4, 2024
Dataset provided by
PLOS ONE
Authors
Hexin Li; Negin Ashrafi; Chris Kang; Guanlan Zhao; Yubing Chen; Maryam Pishgar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the evaluation metrics (AUCROC, Accuracy, Precision, Recall, F-Score) for the prediction models on the test set.
f
A hybrid neural network-based model for landslide susceptibility...
figshare.com
application/cdfv2
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuchen Jia (2025). A hybrid neural network-based model for landslide susceptibility mapping_data [Dataset]. http://doi.org/10.6084/m9.figshare.29195249.v1
Explore at:
application/cdfv2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29195249.v1
Dataset updated
May 30, 2025
Dataset provided by
figshare
Authors
Yuchen Jia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Landslides represent one of the most destructive geological hazards worldwide, where susceptibility assessment serves as a critical component in regional landslide risk management. To address the limitations of conventional methods in spatial representation, class imbalance handling, and temporal feature extraction, this study proposes a Buffer-SMOTE-Transformer comprehensive optimization framework. The framework integrates geospatial buffer sampling techniques to refine negative sample selection, employs SMOTE algorithm to effectively resolve class imbalance issues, and incorporates a weighted hybrid Transformer network to enhance modeling capability for complex geographical features. An empirical analysis conducted in China's Guangdong Province demonstrates that the BST model reveals the varying impacts of sample selection, dataset construction, and model performance on assessment outcomes. The framework achieves significant superiority over conventional machine learning methods (Random Forest, LGB) in key metrics, with AUC reaching 0.964 and Recall attaining 0.953. These findings not only elucidate the cascading amplification effects of comprehensive optimization in susceptibility modeling but also establish a novel technical pathway for large-regional-scale geological hazard risk assessment.

Facebook

Twitter

Click to copy link

Link copied

Cite

Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007

Classification result classifiers using TF-IDF with SMOTE.

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

xlsAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0302304.t007

Dataset updated

May 28, 2024

Dataset provided by

PLOS ONE

Authors

Khaled Alnowaiser

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Classification result classifiers using TF-IDF with SMOTE.

Clear search

Close search

Google apps

Main menu

Classification result classifiers using TF-IDF with SMOTE.

The definition of a confusion matrix.

DataSheet1_Comparison of Resampling Algorithms to Address Class Imbalance...

Table1_A comparative study in class imbalance mitigation when working with...

Data from: Mental issues, internet addiction and quality of life predict...

Confusion matrix.

Data_Sheet 1_Prediction Is a Balancing Act: Importance of Sampling Methods...

Data Sheet 8_Prediction of outpatient rehabilitation patient preferences and...

Data Sheet 1_A unified Foot and Mouth Disease dataset for Uganda: evaluating...

Classification results of machine learning models using CNN features with.

Supplementary Material 8

Hyperparameter details of all machine learning models.

Classification results of classifiers using fastText.

Example of different sentiments from the citation sentiment corpus.

Detailed overview of cohort characteristics for train and test cohort.

Data Sheet 1_Machine learning prediction and interpretability analysis of...

Strength and weakness of feature representation technique.

Table_1_Interpretable machine learning model to predict surgical difficulty...

Summary of the evaluation metrics (AUCROC, Accuracy, Precision, Recall,...

A hybrid neural network-based model for landslide susceptibility...

Classification result classifiers using TF-IDF with SMOTE.