34 datasets found
  1. f

    Data from: ESPDHot: An Effective Machine Learning-Based Approach for...

    • acs.figshare.com
    txt
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li (2024). ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein–DNA Interaction Hotspots [Dataset]. http://doi.org/10.1021/acs.jcim.3c02011.s002
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 8, 2024
    Dataset provided by
    ACS Publications
    Authors
    Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Protein–DNA interactions are pivotal to various cellular processes. Precise identification of the hotspot residues for protein–DNA interactions holds great significance for revealing the intricate mechanisms in protein–DNA recognition and for providing essential guidance for protein engineering. Aiming at protein–DNA interaction hotspots, this work introduces an effective prediction method, ESPDHot based on a stacked ensemble machine learning framework. Here, the interface residue whose mutation leads to a binding free energy change (ΔΔG) exceeding 2 kcal/mol is defined as a hotspot. To tackle the imbalanced data set issue, the adaptive synthetic sampling (ADASYN), an oversampling technique, is adopted to synthetically generate new minority samples, thereby rectifying data imbalance. As for molecular characteristics, besides traditional features, we introduce three new characteristic types including residue interface preference proposed by us, residue fluctuation dynamics characteristics, and coevolutionary features. Combining the Boruta method with our previously developed Random Grouping strategy, we obtained an optimal set of features. Finally, a stacking classifier is constructed to output prediction results, which integrates three classical predictors, Support Vector Machine (SVM), XGBoost, and Artificial Neural Network (ANN) as the first layer, and Logistic Regression (LR) algorithm as the second one. Notably, ESPDHot outperforms the current state-of-the-art predictors, achieving superior performance on the independent test data set, with F1, MCC, and AUC reaching 0.571, 0.516, and 0.870, respectively.

  2. f

    Hyperparameter search space for LR.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter search space for LR. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.

  3. f

    Hyperparameter settings of classification model.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of classification model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

  4. f

    Hyperparameter search space for SVM.

    • plos.figshare.com
    xls
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter search space for SVM. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.

  5. f

    Hyperparameter optimization with 10-fold grid search CV for the original...

    • plos.figshare.com
    xls
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter optimization with 10-fold grid search CV for the original extremely imbalanced data dynamics using a 70:30 partition with and without feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t012
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameter optimization with 10-fold grid search CV for the original extremely imbalanced data dynamics using a 70:30 partition with and without feature selection.

  6. f

    Confusion matrix.

    • plos.figshare.com
    xls
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.

  7. f

    Comparative analysis over various datasets.

    • plos.figshare.com
    xls
    Updated Jan 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Yu; Wei Huang; Xin Tang; Duosi Zheng (2025). Comparative analysis over various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316557.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jan 10, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Tao Yu; Wei Huang; Xin Tang; Duosi Zheng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In credit risk assessment, unsupervised classification techniques can be introduced to reduce human resource expenses and expedite decision-making. Despite the efficacy of unsupervised learning methods in handling unlabeled datasets, their performance remains limited owing to challenges such as imbalanced data, local optima, and parameter adjustment complexities. Thus, this paper introduces a novel hybrid unsupervised classification method, named the two-stage hybrid system with spectral clustering and semi-supervised support vector machine (TSC-SVM), which effectively addresses the unsupervised imbalance problem in credit risk assessment by targeting global optimal solutions. Furthermore, a multi-view combined unsupervised method is designed to thoroughly mine data and enhance the robustness of label predictions. This method mitigates discrepancies in prediction outcomes from three distinct perspectives. The effectiveness, efficiency, and robustness of the proposed TSC-SVM model are demonstrated through various real-world applications. The proposed algorithm is anticipated to expand the customer base for financial institutions while reducing economic losses.

  8. f

    Data from: Large-Scale Learning of Structure−Activity Relationships Using a...

    • acs.figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell (2023). Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics [Dataset]. http://doi.org/10.1021/ci100073w.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.

  9. f

    Hyperparameter Search Space for KNN.

    • plos.figshare.com
    xls
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter Search Space for KNN. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.

  10. f

    Data from: Comparison of experimental results (%).

    • figshare.com
    bin
    Updated Aug 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhaotian Li; Edward Fox (2023). Comparison of experimental results (%). [Dataset]. http://doi.org/10.1371/journal.pone.0290086.t002
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 17, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Zhaotian Li; Edward Fox
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The sudden resignation of core employees often brings losses to companies in various aspects. Traditional employee turnover theory cannot analyze the unbalanced data of employees comprehensively, which leads the company to make wrong decisions. In the face the classification of unbalanced data, the traditional Support Vector Machine (SVM) suffers from insufficient decision plane offset and unbalanced support vector distribution, for which the Synthetic Minority Oversampling Technique (SMOTE) is introduced to improve the balance of generated data. Further, the Fuzzy C-mean (FCM) clustering is improved and combined with the SMOTE (IFCM-SMOTE-SVM) to new synthesized samples with higher accuracy, solving the drawback that the separation data synthesized by SMOTE is too random and easy to generate noisy data. The kernel function is combined with IFCM-SMOTE-SVM and transformed to a high-dimensional space for clustering sampling and classification, and the kernel space-based classification algorithm (KS-IFCM-SMOTE-SVM) is proposed, which improves the effectiveness of the generated data on SVM classification results. Finally, the generalization ability of KS-IFCM-SMOTE-SVM for different types of enterprise data is experimentally demonstrated, and it is verified that the proposed algorithm has stable and accurate performance. This study introduces the SMOTE and FCM clustering, and improves the SVM by combining the data transformation in the kernel space to achieve accurate classification of unbalanced data of employees, which helps enterprises to predict whether employees have the tendency to leave in advance.

  11. f

    Comparison of methods for balancing data in SVM.

    • plos.figshare.com
    xls
    Updated Aug 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xinyi Wei; Boyu Shi (2025). Comparison of methods for balancing data in SVM. [Dataset]. http://doi.org/10.1371/journal.pone.0327569.t010
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Aug 7, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Xinyi Wei; Boyu Shi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Coronary heart disease (CHD) is a major cardiovascular disorder that poses significant threats to global health and is increasingly affecting younger populations. Its treatment and prevention face challenges such as high costs, prolonged recovery periods, and limited efficacy of traditional methods. Additionally, the complexity of diagnostic indicators and the global shortage of medical professionals further complicate accurate diagnosis. This study employs machine learning techniques to analyze CHD-related pathogenic factors and proposes an efficient diagnostic and predictive framework. To address the data imbalance issue, SMOTE-ENN is utilized, and five machine learning algorithms—Decision Trees, KNN, SVM, XGBoost, and Random Forest—are applied for classification tasks. Principal Component Analysis (PCA) and Grid Search are used to optimize the models, with evaluation metrics including accuracy, precision, recall, F1-score, and AUC. According to the random forest model’s optimization experiment, the initial unbalanced data’s accuracy was 85.26%, and the F1-score was 12.58%. The accuracy increased to 92.16% and the F1-score reached 93.85% after using SMOTE-ENN for data balancing, which is an increase of 6.90% and 81.27%, respectively; the model accuracy increased to 97.91% and the F1-score increased to 97.88% after adding PCA feature dimensionality reduction processing, which is an increase of 5.75% and 4.03%, respectively, compared with the SMOTE-ENN stage. This indicates that combining data balancing and feature dimensionality reduction techniques significantly improves model accuracy and makes the random forest model the best model. This study provides an efficient diagnostic tool for CHD, alleviates the challenges posed by limited medical resources, and offers a scientific foundation for precise prevention and intervention strategies.

  12. f

    Data Sheet 1_Using preprocessed datasets to construct and interpret...

    • frontiersin.figshare.com
    docx
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cong Wang; Yufeng Fu; Ran Wan; Le Zhao; Hongbo Wang; Junwei Guo; Qiang Liu; Shan Li; Shengtao Ma; Zhicai Wang; Wei Huang; Huimin Liu; Song Yang; Cong Nie (2025). Data Sheet 1_Using preprocessed datasets to construct and interpret multiclass identification models.docx [Dataset]. http://doi.org/10.3389/fpls.2025.1597673.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Aug 20, 2025
    Dataset provided by
    Frontiers
    Authors
    Cong Wang; Yufeng Fu; Ran Wan; Le Zhao; Hongbo Wang; Junwei Guo; Qiang Liu; Shan Li; Shengtao Ma; Zhicai Wang; Wei Huang; Huimin Liu; Song Yang; Cong Nie
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionImage and near-infrared (NIR) spectroscopic data are widely used for constructing analytical models in precision agriculture. While model interpretation can provide valuable insights for quality control and improvement, the inherent ambiguity of individual image pixels or spectral data points often hinders practical interpretability when using raw data directly. Furthermore, the presence of imbalanced datasets can lead to model overfitting and consequently, poor robustness. Therefore, developing alternative approaches for constructing interpretable and robust models using these data types is crucial.MethodsThis study proposes using preprocessed data—specifically, morphological features extracted from images and chemical component concentrations predicted from NIR spectra—to build multiclass identification models. Combined kernel SVM based models were proposed to identify the rice variety and cultivation region of tobacco. The determination of kernel parameters and percentage of different types of kernel functions were accomplished by PSO, which make the approach self-adaptive. Feature importance and contribution analyses were conducted using Shapley additive explanations (SHAP).ResultsThe resulting models demonstrated high robustness and accuracy, achieving classification success rates of 97.9 and 97.4% via n-fold cross validation on rice and tobacco datasets, respectively, and 97.7% on an independent test set (tobacco dataset 2). This analysis identified key variables and elucidated their specific contributions to the model predictions.DiscussionThis study expands the applicability of image and NIR spectroscopic data, offering researchers an effective methodology for investigating factors crucial to the quality control and improvement of agricultural products.

  13. f

    The selected explanatory variables.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.

  14. f

    Statistical characteristics of the original extremely imbalance LMCH and...

    • figshare.com
    xls
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Statistical characteristics of the original extremely imbalance LMCH and filtered moderately imbalance LMCH diabetes data dynamics. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Statistical characteristics of the original extremely imbalance LMCH and filtered moderately imbalance LMCH diabetes data dynamics.

  15. f

    Hyperparameter optimization using 10-fold grid search CV for the filtered...

    • plos.figshare.com
    xls
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter optimization using 10-fold grid search CV for the filtered LMCH data dynamics with 80:20 partition with feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t015
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Hyperparameter optimization using 10-fold grid search CV for the filtered LMCH data dynamics with 80:20 partition with feature selection.

  16. f

    Performance comparison of hyper-parameterized classifiers on filtered LMCH...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Performance comparison of hyper-parameterized classifiers on filtered LMCH data dynamics using an 80:20 partition with and without feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t011
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of hyper-parameterized classifiers on filtered LMCH data dynamics using an 80:20 partition with and without feature selection.

  17. f

    Performance comparison of hyper-parameterized classifiers on original LMCH...

    • plos.figshare.com
    xls
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Performance comparison of hyper-parameterized classifiers on original LMCH data dynamics using a 70:30 partition with and without feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Performance comparison of hyper-parameterized classifiers on original LMCH data dynamics using a 70:30 partition with and without feature selection.

  18. f

    Literature analysis summarized by three main categories and six...

    • plos.figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Takaya Saito; Marc Rehmsmeier (2023). Literature analysis summarized by three main categories and six subcategories. [Dataset]. http://doi.org/10.1371/journal.pone.0118432.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Takaya Saito; Marc Rehmsmeier
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    aSVM: type of SVM, Data: data type, Eval: evaluation method.bThe total number of articles is 58.cFiltered by SVM binary (BS) AND Imbalanced (IB1 or IB2) AND NOT Small sample size (SS). The total number of these articles is 33.Literature analysis summarized by three main categories and six subcategories.

  19. f

    Analysis before utilizing hyperparameter optimization and feature selection...

    • plos.figshare.com
    xls
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Analysis before utilizing hyperparameter optimization and feature selection techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis before utilizing hyperparameter optimization and feature selection techniques.

  20. f

    Hyperparameter settings for Word2Vec and Doc2vec.

    • plos.figshare.com
    xls
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings for Word2Vec and Doc2vec. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Oct 18, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li (2024). ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein–DNA Interaction Hotspots [Dataset]. http://doi.org/10.1021/acs.jcim.3c02011.s002

Data from: ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein–DNA Interaction Hotspots

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Apr 8, 2024
Dataset provided by
ACS Publications
Authors
Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li
License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Protein–DNA interactions are pivotal to various cellular processes. Precise identification of the hotspot residues for protein–DNA interactions holds great significance for revealing the intricate mechanisms in protein–DNA recognition and for providing essential guidance for protein engineering. Aiming at protein–DNA interaction hotspots, this work introduces an effective prediction method, ESPDHot based on a stacked ensemble machine learning framework. Here, the interface residue whose mutation leads to a binding free energy change (ΔΔG) exceeding 2 kcal/mol is defined as a hotspot. To tackle the imbalanced data set issue, the adaptive synthetic sampling (ADASYN), an oversampling technique, is adopted to synthetically generate new minority samples, thereby rectifying data imbalance. As for molecular characteristics, besides traditional features, we introduce three new characteristic types including residue interface preference proposed by us, residue fluctuation dynamics characteristics, and coevolutionary features. Combining the Boruta method with our previously developed Random Grouping strategy, we obtained an optimal set of features. Finally, a stacking classifier is constructed to output prediction results, which integrates three classical predictors, Support Vector Machine (SVM), XGBoost, and Artificial Neural Network (ANN) as the first layer, and Logistic Regression (LR) algorithm as the second one. Notably, ESPDHot outperforms the current state-of-the-art predictors, achieving superior performance on the independent test data set, with F1, MCC, and AUC reaching 0.571, 0.516, and 0.870, respectively.

Search
Clear search
Close search
Google apps
Main menu