89 datasets found
  1. f

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

  2. f

    The used datasets with their details.

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). The used datasets with their details. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In recent years, the challenge of imbalanced data has become increasingly prominent in machine learning, affecting the performance of classification algorithms. This study proposes a novel data-level oversampling method called Cluster-Based Reduced Noise SMOTE (CRN-SMOTE) to address this issue. CRN-SMOTE combines SMOTE for oversampling minority classes with a novel cluster-based noise reduction technique. In this cluster-based noise reduction approach, it is crucial that samples from each category form one or two clusters, a feature that conventional noise reduction methods do not achieve. The proposed method is evaluated on four imbalanced datasets (ILPD, QSAR, Blood, and Maternal Health Risk) using five metrics: Cohen’s kappa, Matthew’s correlation coefficient (MCC), F1-score, precision, and recall. Results demonstrate that CRN-SMOTE consistently outperformed the state-of-the-art Reduced Noise SMOTE (RN-SMOTE), SMOTE-Tomek Link, and SMOTE-ENN methods across all datasets, with particularly notable improvements observed in the QSAR and Maternal Health Risk datasets, indicating its effectiveness in enhancing imbalanced classification performance. Overall, the experimental findings indicate that CRN-SMOTE outperformed RN-SMOTE in 100% of the cases, achieving average improvements of 6.6% in Kappa, 4.01% in MCC, 1.87% in F1-score, 1.7% in precision, and 2.05% in recall, with setting SMOTE’s neighbors’ number to 5.

  3. f

    Performance of machine learning models on test set using the SMOTE-adjusted...

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan (2023). Performance of machine learning models on test set using the SMOTE-adjusted balanced training set. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001031532
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Lee, Carl; Bashyal, Suraj; Bhandari, Ramesh; Budhathoki, Nirajan
    Description

    Performance of machine learning models on test set using the SMOTE-adjusted balanced training set.

  4. Performance of machine learning models using SMOTE-balanced dataset.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf (2023). Performance of machine learning models using SMOTE-balanced dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0293061.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Nihal Abuzinadah; Muhammad Umer; Abid Ishaq; Abdullah Al Hejaili; Shtwai Alsubai; Ala’ Abdulmajid Eshmawi; Abdullah Mohamed; Imran Ashraf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance of machine learning models using SMOTE-balanced dataset.

  5. f

    Classification results of machine learning models using TF-IDF with SMOTE.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eysha Saad; Saima Sadiq; Ramish Jamil; Furqan Rustam; Arif Mehmood; Gyu Sang Choi; Imran Ashraf (2023). Classification results of machine learning models using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0270327.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Eysha Saad; Saima Sadiq; Ramish Jamil; Furqan Rustam; Arif Mehmood; Gyu Sang Choi; Imran Ashraf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification results of machine learning models using TF-IDF with SMOTE.

  6. f

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted...

    • plos.figshare.com
    xls
    Updated Nov 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alaa Alomari; Hossam Faris; Pedro A. Castillo (2023). Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes. [Dataset]. http://doi.org/10.1371/journal.pone.0290581.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Nov 16, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Alaa Alomari; Hossam Faris; Pedro A. Castillo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary table: Oversampling techniques using SMOTE, ADASYN, and weighted rare classes.

  7. f

    A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk...

    • plos.figshare.com
    xls
    Updated Feb 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t009
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A comparison of the CRN-SMOTE and RN-SMOTE methods on the health risk dataset based on different classification metrics using the Random Forest classifier.

  8. f

    The dataset used in this study.

    • plos.figshare.com
    zip
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The dataset used in this study. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

  9. Classification result classifiers using TF-IDF with SMOTE.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khaled Alnowaiser (2024). Classification result classifiers using TF-IDF with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0302304.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Khaled Alnowaiser
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classification result classifiers using TF-IDF with SMOTE.

  10. Comparison of model evaluation indicators.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ming-zhou Lv; Kun-lun Li; Jia-zeng Cai; Jun Mao; Jia-jun Gao; Hui Xu (2025). Comparison of model evaluation indicators. [Dataset]. http://doi.org/10.1371/journal.pone.0323487.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 21, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Ming-zhou Lv; Kun-lun Li; Jia-zeng Cai; Jun Mao; Jia-jun Gao; Hui Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Landslides are frequent and hazardous geological disasters, posing significant risks to human safety and infrastructure. Accurate assessments of landslide susceptibility are crucial for risk management and mitigation. However, geological surveys of landslide areas are typically conducted at the township level, have lowsample sizes, and rely on experience. This study proposes a framework for assessing landslide susceptibility in Taiping Township, Zhejiang Province, China, using data balancing, machine learning, and data from 1,325 slope units with nine slope characteristics. The dataset was balanced using the Synthetic Minority Oversampling Technique and the Tomek link undersampling method (SMOTE-Tomek). A comparative analysis of six machine learning models was performed, and the SHapley Additive exPlanation (SHAP) method was used to assess the influencing factors. The results indicate that the machine learning algorithms provide high accuracy, and the random forest (RF) algorithm achieves the optimum model accuracy (0.791, F1 = 0.723). The very low, low, medium, and high sensitivity zones account for 92.27%, 5.12%, 1.78%, and 0.83% of the area, respectively. The height of cut slopes has the most significant impact on landslide sensitivity, whereas the altitude has a minor impact. The proposed model accurately assesses landslide susceptibility at the township scale, providing valuable insights for risk management and mitigation.

  11. f

    Classifier in terms of different performance metrics with different...

    • figshare.com
    xls
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Classifier in terms of different performance metrics with different pre-processing techniques with SMOTE. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Classifier in terms of different performance metrics with different pre-processing techniques with SMOTE.

  12. f

    Data from: Dataset for classification of signaling proteins based on...

    • figshare.com
    • portalcientifico.sergas.es
    • +1more
    txt
    Updated Jan 19, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlos Fernandez-Lozano; Cristian Robert Munteanu (2016). Dataset for classification of signaling proteins based on molecular star graph descriptors using machine-learning models [Dataset]. http://doi.org/10.6084/m9.figshare.1330132.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jan 19, 2016
    Dataset provided by
    figshare
    Authors
    Carlos Fernandez-Lozano; Cristian Robert Munteanu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The positive group of 608 signaling protein sequences was downloaded as FASTA format from Protein Databank (Berman et al., 2000) by using the “Molecular Function Browser” in the “Advanced Search Interface” (“Signaling (GO ID23052)”, protein identity cut-off = 30%). The negative group of 2077 non-signaling proteins was downloaded as the PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) (Wang & R. L. Dunbrack, 2003) (November 19th, 2012) using identity (degree of correspondence between two sequences) less than 20%, resolution of 1.6 Å and R-factor 0.25. The full dataset is containing 2685 FASTA sequences of protein chains from the PDB databank: 608 are signaling proteins and 2077 are non-signaling peptides. This kind of unbalanced data is not the most suitable to be used as an input for learning algorithms because the results would present a high sensitivity and low specificity; learning algorithms would tend to classify most of samples as part of the most common group. To avoid this situation, a pre-processing stage is needed in order to get a more balanced dataset, in this case by means of the synthetic minority oversampling technique (SMOTE). In short, SMOTE provides a more balanced dataset using an expansion of the lower class by creating new samples, interpolating other minority-class samples. After this pre-processing, the final dataset is composed of 1824 positive samples (signaling protein chains) and 2432 negative cases (non-signaling protein chains). Paper is available at: http://dx.doi.org/10.1016/j.jtbi.2015.07.038

    Please cite: Carlos Fernandez-Lozano, Rubén F. Cuiñas, José A. Seoane, Enrique Fernández-Blanco, Julian Dorado, Cristian R. Munteanu, Classification of signaling proteins based on molecular star graph descriptors using Machine Learning models, Journal of Theoretical Biology, Volume 384, 7 November 2015, Pages 50-58, ISSN 0022-5193, http://dx.doi.org/10.1016/j.jtbi.2015.07.038.(http://www.sciencedirect.com/science/article/pii/S0022519315003999)

  13. f

    Parameters of machine learning models.

    • plos.figshare.com
    xls
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran (2024). Parameters of machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0301263.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Ankit Vijayvargiya; Aparna Sinha; Naveen Gehlot; Ashutosh Jena; Rajesh Kumar; Kieran Moran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The diagnosis of human knee abnormalities using the surface electromyography (sEMG) signal obtained from lower limb muscles with machine learning is a major problem due to the noisy nature of the sEMG signal and the imbalance in data corresponding to healthy and knee abnormal subjects. To address this challenge, a combination of wavelet decomposition (WD) with ensemble empirical mode decomposition (EEMD) and the Synthetic Minority Oversampling Technique (S-WD-EEMD) is proposed. In this study, a hybrid WD-EEMD is considered for the minimization of noises produced in the sEMG signal during the collection, while the Synthetic Minority Oversampling Technique (SMOTE) is considered to balance the data by increasing the minority class samples during the training of machine learning techniques. The findings indicate that the hybrid WD-EEMD with SMOTE oversampling technique enhances the efficacy of the examined classifiers when employed on the imbalanced sEMG data. The F-Score of the Extra Tree Classifier, when utilizing WD-EEMD signal processing with SMOTE oversampling, is 98.4%, whereas, without the SMOTE oversampling technique, it is 95.1%.

  14. f

    Table_2_Interpretable machine learning model to predict surgical difficulty...

    • frontiersin.figshare.com
    docx
    Updated Feb 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miao Yu; Zihan Yuan; Ruijie Li; Bo Shi; Daiwei Wan; Xiaoqiang Dong (2024). Table_2_Interpretable machine learning model to predict surgical difficulty in laparoscopic resection for rectal cancer.docx [Dataset]. http://doi.org/10.3389/fonc.2024.1337219.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    Feb 6, 2024
    Dataset provided by
    Frontiers
    Authors
    Miao Yu; Zihan Yuan; Ruijie Li; Bo Shi; Daiwei Wan; Xiaoqiang Dong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundLaparoscopic total mesorectal excision (LaTME) is standard surgical methods for rectal cancer, and LaTME operation is a challenging procedure. This study is intended to use machine learning to develop and validate prediction models for surgical difficulty of LaTME in patients with rectal cancer and compare these models’ performance.MethodsWe retrospectively collected the preoperative clinical and MRI pelvimetry parameter of rectal cancer patients who underwent laparoscopic total mesorectal resection from 2017 to 2022. The difficulty of LaTME was defined according to the scoring criteria reported by Escal. Patients were randomly divided into training group (80%) and test group (20%). We selected independent influencing features using the least absolute shrinkage and selection operator (LASSO) and multivariate logistic regression method. Adopt synthetic minority oversampling technique (SMOTE) to alleviate the class imbalance problem. Six machine learning model were developed: light gradient boosting machine (LGBM); categorical boosting (CatBoost); extreme gradient boost (XGBoost), logistic regression (LR); random forests (RF); multilayer perceptron (MLP). The area under receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity and F1 score were used to evaluate the performance of the model. The Shapley Additive Explanations (SHAP) analysis provided interpretation for the best machine learning model. Further decision curve analysis (DCA) was used to evaluate the clinical manifestations of the model.ResultsA total of 626 patients were included. LASSO regression analysis shows that tumor height, prognostic nutrition index (PNI), pelvic inlet, pelvic outlet, sacrococcygeal distance, mesorectal fat area and angle 5 (the angle between the apex of the sacral angle and the lower edge of the pubic bone) are the predictor variables of the machine learning model. In addition, the correlation heatmap shows that there is no significant correlation between these seven variables. When predicting the difficulty of LaTME surgery, the XGBoost model performed best among the six machine learning models (AUROC=0.855). Based on the decision curve analysis (DCA) results, the XGBoost model is also superior, and feature importance analysis shows that tumor height is the most important variable among the seven factors.ConclusionsThis study developed an XGBoost model to predict the difficulty of LaTME surgery. This model can help clinicians quickly and accurately predict the difficulty of surgery and adopt individualized surgical methods.

  15. e

    Machine learning methods with Fermi-LAT catalogs - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Apr 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Machine learning methods with Fermi-LAT catalogs - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/8cede587-6165-5165-a19d-9f1729893aad
    Explore at:
    Dataset updated
    Apr 28, 2023
    Description

    Classification of sources is one of the most important tasks in astronomy. Sources detected in one wavelength band, for example using gamma rays, may have several possible associations in other wavebands, or there may be no plausible association candidates. In this work we aim to determine the probabilistic classification of unassociated sources in the third Fermi Large Area Telescope (LAT) point source catalog (3FGL) and the fourth Fermi LAT data release 2 point source catalog (4FGL-DR2) using two classes - pulsars and active galactic nuclei (AGNs) - or three classes - pulsars, AGNs, and "OTHER" sources. We use several machine learning (ML) methods to determine a probabilistic classification of Fermi-LAT sources.We evaluate the dependence of results on the meta parameters of the ML methods, such as the maximal depth of the trees in tree-based classification methods and the number of neurons in neural networks. We determine a probabilistic classification of both associated and unassociated sources in the 3FGL and 4FGL-DR2 catalogs. We cross-check the accuracy by comparing the predicted classes of unassociated sources in 3FGL with their associations in 4FGL-DR2 for cases where such associations exist. We find that in the two-class case it is important to correct for the presence of OTHER sources among the unassociated ones in order to realistically estimate the number of pulsars and AGNs.We find that the three-class classification, despite different types of sources in the OTHER class, has a similar performance as the two-class classification in terms of reliability diagrams and, at the same time, it does not require adjustment due to presence of the OTHER sources among the unassociated sources. We show an example of the use of the probabilistic catalogs for population studies, which include associated and unassociated sources. Cone search capability for table J/A+A/660/A87/cat1 (PSR candidates using both catalogs) Cone search capability for table J/A+A/660/A87/cat2 (3FGL 2-class classification) Cone search capability for table J/A+A/660/A87/cat3 (3FGL 2-class using SMOTE) Cone search capability for table J/A+A/660/A87/cat4 (3FGL 3-class classification) Cone search capability for table J/A+A/660/A87/cat5 (3FGL 3-class using SMOTE) Cone search capability for table J/A+A/660/A87/cat6 (OTHER candidates using 4FGL-DR2)

  16. m

    Data from: Mental issues, internet addiction and quality of life predict...

    • data.mendeley.com
    Updated Jul 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andras Matuz (2024). Mental issues, internet addiction and quality of life predict burnout among Hungarian teachers: a machine learning analysis [Dataset]. http://doi.org/10.17632/2yy4j7rgvg.1
    Explore at:
    Dataset updated
    Jul 12, 2024
    Authors
    Andras Matuz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: Burnout is usually defined as a state of emotional, physical, and mental exhaustion that affects people in various professions (e.g. physicians, nurses, teachers). The consequences of burnout involve decreased motivation, productivity, and overall diminished well-being. The machine learning-based prediction of burnout has therefore become the focus of recent research. In this study, the aim was to detect burnout using machine learning and to identify its most important predictors in a sample of Hungarian high-school teachers. Methods: The final sample consisted of 1,576 high-school teachers (522 male), who completed a survey including various sociodemographic and health-related questions and psychological questionnaires. Specifically, depression, insomnia, internet habits (e.g. when and why one uses the internet) and problematic internet usage were among the most important predictors tested in this study. Supervised classification algorithms were trained to detect burnout assessed by two well-known burnout questionnaires. Feature selection was conducted using recursive feature elimination. Hyperparameters were tuned via grid search with 5-fold cross-validation. Due to class imbalance, class weights (i.e. cost-sensitive learning), downsampling and a hybrid method (SMOTE-ENN) were applied in separate analyses. The final model evaluation was carried out on a previously unseen holdout test sample. Results: Burnout was detected in 19.7% of the teachers included in the final dataset. The best predictive performance on the holdout test sample was achieved by support vector machine with SMOTE-ENN (AUC = .942; balanced accuracy = .868, sensitivity = .898; specificity = .837). The best predictors of burnout were Beck’s Depression Inventory scores, Athen’s Insomnia Scale scores, subscales of the Problematic Internet Use Questionnaire and self-reported current health status. Conclusions: The performances of the algorithms were comparable with previous studies; however, it is important to note that we tested our models on previously unseen holdout samples suggesting higher levels of generalizability. Another remarkable finding is that besides depression and insomnia, other variables such as problematic internet use and time spent online also turned out to be important predictors of burnout.

  17. f

    Results of Kruskal-Wallis test.

    • datasetcatalog.nlm.nih.gov
    • figshare.com
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mao, Jun; Cai, Jia-zeng; Xu, Hui; Gao, Jia-jun; Li, Kun-lun; Lv, Ming-zhou (2025). Results of Kruskal-Wallis test. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002090985
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Mao, Jun; Cai, Jia-zeng; Xu, Hui; Gao, Jia-jun; Li, Kun-lun; Lv, Ming-zhou
    Description

    Landslides are frequent and hazardous geological disasters, posing significant risks to human safety and infrastructure. Accurate assessments of landslide susceptibility are crucial for risk management and mitigation. However, geological surveys of landslide areas are typically conducted at the township level, have lowsample sizes, and rely on experience. This study proposes a framework for assessing landslide susceptibility in Taiping Township, Zhejiang Province, China, using data balancing, machine learning, and data from 1,325 slope units with nine slope characteristics. The dataset was balanced using the Synthetic Minority Oversampling Technique and the Tomek link undersampling method (SMOTE-Tomek). A comparative analysis of six machine learning models was performed, and the SHapley Additive exPlanation (SHAP) method was used to assess the influencing factors. The results indicate that the machine learning algorithms provide high accuracy, and the random forest (RF) algorithm achieves the optimum model accuracy (0.791, F1 = 0.723). The very low, low, medium, and high sensitivity zones account for 92.27%, 5.12%, 1.78%, and 0.83% of the area, respectively. The height of cut slopes has the most significant impact on landslide sensitivity, whereas the altitude has a minor impact. The proposed model accurately assesses landslide susceptibility at the township scale, providing valuable insights for risk management and mitigation.

  18. f

    Supplementary tables. A hybrid resampling algorithms SMOTE and ENN based...

    • tandf.figshare.com
    docx
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madhulata Kumari; Naidu Subbarao (2024). Supplementary tables. A hybrid resampling algorithms SMOTE and ENN based deep learning models for identification of Marburg virus inhibitors [Dataset]. http://doi.org/10.25402/FMC.19550878.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Madhulata Kumari; Naidu Subbarao
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Supplementary Table 1: The lead molecules of anti-MARV from ChemDiv antiviral library Supplementary Table 2: The lead molecules of anti-MARV from ChEMBL antiviral library. Supplementary Table 3: The lead molecules of anti-MARV from phytochemical database. Supplementary Table 4: The lead molecules of anti-MARV from natural product NCI diversity setIV.

  19. f

    Landslide evaluation factors and value range.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lv, Ming-zhou; Mao, Jun; Xu, Hui; Cai, Jia-zeng; Li, Kun-lun; Gao, Jia-jun (2025). Landslide evaluation factors and value range. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002091003
    Explore at:
    Dataset updated
    May 21, 2025
    Authors
    Lv, Ming-zhou; Mao, Jun; Xu, Hui; Cai, Jia-zeng; Li, Kun-lun; Gao, Jia-jun
    Description

    Landslides are frequent and hazardous geological disasters, posing significant risks to human safety and infrastructure. Accurate assessments of landslide susceptibility are crucial for risk management and mitigation. However, geological surveys of landslide areas are typically conducted at the township level, have lowsample sizes, and rely on experience. This study proposes a framework for assessing landslide susceptibility in Taiping Township, Zhejiang Province, China, using data balancing, machine learning, and data from 1,325 slope units with nine slope characteristics. The dataset was balanced using the Synthetic Minority Oversampling Technique and the Tomek link undersampling method (SMOTE-Tomek). A comparative analysis of six machine learning models was performed, and the SHapley Additive exPlanation (SHAP) method was used to assess the influencing factors. The results indicate that the machine learning algorithms provide high accuracy, and the random forest (RF) algorithm achieves the optimum model accuracy (0.791, F1 = 0.723). The very low, low, medium, and high sensitivity zones account for 92.27%, 5.12%, 1.78%, and 0.83% of the area, respectively. The height of cut slopes has the most significant impact on landslide sensitivity, whereas the altitude has a minor impact. The proposed model accurately assesses landslide susceptibility at the township scale, providing valuable insights for risk management and mitigation.

  20. f

    Data from: Prediction of 35 Target Per- and Polyfluoroalkyl Substances...

    • figshare.com
    • acs.figshare.com
    txt
    Updated Aug 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jialin Dong; Gabriel Tsai; Christopher I. Olivares (2023). Prediction of 35 Target Per- and Polyfluoroalkyl Substances (PFASs) in California Groundwater Using Multilabel Semisupervised Machine Learning [Dataset]. http://doi.org/10.1021/acsestwater.3c00134.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 18, 2023
    Dataset provided by
    ACS Publications
    Authors
    Jialin Dong; Gabriel Tsai; Christopher I. Olivares
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Comprehensive monitoring of perfluoroalkyl and polyfluoroalkyl substances (PFASs) is challenging because of the high analytical cost and an increasing number of analytes. We developed a machine learning pipeline to understand environmental features influencing PFAS profiles in groundwater. By examining 23 public data sets (2016–2022) in California, we built a state-wide groundwater database (25,000 observations across 4200 wells) encompassing contamination sources, weather, air quality, soil, hydrology, and groundwater quality (PFASs and cocontaminants). We used supervised learning to prescreen total PFAS concentrations above 70 ng/L and multilabel semisupervised learning to predict 35 individual PFAS concentrations above 2 ng/L. Random forest with ADASYN oversampling performed the best for total PFASs (AUROC 99%). XGBoost with SMOTE oversampling achieved the AUROC of 73–100% for individual PFAS prediction. Contamination sources and soil variables contributed the most to accuracy. Individual PFASs were strongly correlated within each PFAS’s subfamily (i.e., short- vs long-chain PFCAs, sulfonamides). These associations improved prediction performance using classifier chains, which predicts a PFAS based on previously predicted species. We applied the model to reconstruct PFAS profiles in groundwater wells with missing data in previous years. Our approach can complement monitoring programs of environmental agencies to validate previous investigation results and prioritize sites for future PFAS sampling.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari (2025). A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier. [Dataset]. http://doi.org/10.1371/journal.pone.0317396.t008

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

Related Article
Explore at:
xlsAvailable download formats
Dataset updated
Feb 10, 2025
Dataset provided by
PLOS ONE
Authors
Javad Hemmatian; Rassoul Hajizadeh; Fakhroddin Nazari
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A comparison of the RN-SMOTE, SMOTE-Tomek Link, SMOTE-ENN, and the proposed 1CRN-SMOTE methods on the Blood and Health-risk datasets is presented, based on various classification metrics using the Random Forest classifier.

Search
Clear search
Close search
Google apps
Main menu