34 datasets found

f
Data from: ESPDHot: An Effective Machine Learning-Based Approach for...
acs.figshare.com
txt
Updated Apr 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li (2024). ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein–DNA Interaction Hotspots [Dataset]. http://doi.org/10.1021/acs.jcim.3c02011.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jcim.3c02011.s002
Dataset updated
Apr 8, 2024
Dataset provided by
ACS Publications
Authors
Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Protein–DNA interactions are pivotal to various cellular processes. Precise identification of the hotspot residues for protein–DNA interactions holds great significance for revealing the intricate mechanisms in protein–DNA recognition and for providing essential guidance for protein engineering. Aiming at protein–DNA interaction hotspots, this work introduces an effective prediction method, ESPDHot based on a stacked ensemble machine learning framework. Here, the interface residue whose mutation leads to a binding free energy change (ΔΔG) exceeding 2 kcal/mol is defined as a hotspot. To tackle the imbalanced data set issue, the adaptive synthetic sampling (ADASYN), an oversampling technique, is adopted to synthetically generate new minority samples, thereby rectifying data imbalance. As for molecular characteristics, besides traditional features, we introduce three new characteristic types including residue interface preference proposed by us, residue fluctuation dynamics characteristics, and coevolutionary features. Combining the Boruta method with our previously developed Random Grouping strategy, we obtained an optimal set of features. Finally, a stacking classifier is constructed to output prediction results, which integrates three classical predictors, Support Vector Machine (SVM), XGBoost, and Artificial Neural Network (ANN) as the first layer, and Logistic Regression (LR) algorithm as the second one. Notably, ESPDHot outperforms the current state-of-the-art predictors, achieving superior performance on the independent test data set, with F1, MCC, and AUC reaching 0.571, 0.516, and 0.870, respectively.
f
Hyperparameter search space for LR.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter search space for LR. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t005
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
Hyperparameter settings of classification model.
plos.figshare.com
xls
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings of classification model. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t006
Dataset updated
Oct 18, 2024
Dataset provided by
PLOS ONE
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.
f
Hyperparameter search space for SVM.
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter search space for SVM. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t004
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
Hyperparameter optimization with 10-fold grid search CV for the original...
plos.figshare.com
xls
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter optimization with 10-fold grid search CV for the original extremely imbalanced data dynamics using a 70:30 partition with and without feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t012
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t012
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameter optimization with 10-fold grid search CV for the original extremely imbalanced data dynamics using a 70:30 partition with and without feature selection.
f
Confusion matrix.
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t003
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
Comparative analysis over various datasets.
plos.figshare.com
xls
Updated Jan 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Yu; Wei Huang; Xin Tang; Duosi Zheng (2025). Comparative analysis over various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0316557.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316557.t003
Dataset updated
Jan 10, 2025
Dataset provided by
PLOS ONE
Authors
Tao Yu; Wei Huang; Xin Tang; Duosi Zheng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In credit risk assessment, unsupervised classification techniques can be introduced to reduce human resource expenses and expedite decision-making. Despite the efficacy of unsupervised learning methods in handling unlabeled datasets, their performance remains limited owing to challenges such as imbalanced data, local optima, and parameter adjustment complexities. Thus, this paper introduces a novel hybrid unsupervised classification method, named the two-stage hybrid system with spectral clustering and semi-supervised support vector machine (TSC-SVM), which effectively addresses the unsupervised imbalance problem in credit risk assessment by targeting global optimal solutions. Furthermore, a multi-view combined unsupervised method is designed to thoroughly mine data and enhance the robustness of label predictions. This method mitigates discrepancies in prediction outcomes from three distinct perspectives. The effectiveness, efficiency, and robustness of the proposed TSC-SVM model are demonstrated through various real-world applications. The proposed algorithm is anticipated to expand the customer base for financial institutions while reducing economic losses.
f
Data from: Large-Scale Learning of Structure−Activity Relationships Using a...
acs.figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell (2023). Large-Scale Learning of Structure−Activity Relationships Using a Linear Support Vector Machine and Problem-Specific Metrics [Dataset]. http://doi.org/10.1021/ci100073w.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1021/ci100073w.s001
Dataset updated
May 30, 2023
Dataset provided by
ACS Publications
Authors
Georg Hinselmann; Lars Rosenbaum; Andreas Jahn; Nikolas Fechner; Claude Ostermann; Andreas Zell
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches.
f
Hyperparameter Search Space for KNN.
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter Search Space for KNN. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t006
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diabetes is a persistent metabolic disorder linked to elevated levels of blood glucose, commonly referred to as blood sugar. This condition can have detrimental effects on the heart, blood vessels, eyes, kidneys, and nerves as time passes. It is a chronic ailment that arises when the body fails to produce enough insulin or is unable to effectively use the insulin it produces. When diabetes is not properly managed, it often leads to hyperglycemia, a condition characterized by elevated blood sugar levels or impaired glucose tolerance. This can result in significant harm to various body systems, including the nerves and blood vessels. In this paper, we propose a multiclass diabetes mellitus detection and classification approach using an extremely imbalanced Laboratory of Medical City Hospital data dynamics. We also formulate a new dataset that is moderately imbalanced based on the Laboratory of Medical City Hospital data dynamics. To correctly identify the multiclass diabetes mellitus, we employ three machine learning classifiers namely support vector machine, logistic regression, and k-nearest neighbor. We also focus on dimensionality reduction (feature selection—filter, wrapper, and embedded method) to prune the unnecessary features and to scale up the classification performance. To optimize the classification performance of classifiers, we tune the model by hyperparameter optimization with 10-fold grid search cross-validation. In the case of the original extremely imbalanced dataset with 70:30 partition and support vector machine classifier, we achieved maximum accuracy of 0.964, precision of 0.968, recall of 0.964, F1-score of 0.962, Cohen kappa of 0.835, and AUC of 0.99 by using top 4 feature according to filter method. By using the top 9 features according to wrapper-based sequential feature selection, the k-nearest neighbor provides an accuracy of 0.935 and 1.0 for the other performance metrics. For our created moderately imbalanced dataset with an 80:20 partition, the SVM classifier achieves a maximum accuracy of 0.938, and 1.0 for other performance metrics. For the multiclass diabetes mellitus detection and classification, our experiments outperformed conducted research based on the Laboratory of Medical City Hospital data dynamics.
f
Data from: Comparison of experimental results (%).
figshare.com
bin
Updated Aug 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhaotian Li; Edward Fox (2023). Comparison of experimental results (%). [Dataset]. http://doi.org/10.1371/journal.pone.0290086.t002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0290086.t002
Dataset updated
Aug 17, 2023
Dataset provided by
PLOS ONE
Authors
Zhaotian Li; Edward Fox
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The sudden resignation of core employees often brings losses to companies in various aspects. Traditional employee turnover theory cannot analyze the unbalanced data of employees comprehensively, which leads the company to make wrong decisions. In the face the classification of unbalanced data, the traditional Support Vector Machine (SVM) suffers from insufficient decision plane offset and unbalanced support vector distribution, for which the Synthetic Minority Oversampling Technique (SMOTE) is introduced to improve the balance of generated data. Further, the Fuzzy C-mean (FCM) clustering is improved and combined with the SMOTE (IFCM-SMOTE-SVM) to new synthesized samples with higher accuracy, solving the drawback that the separation data synthesized by SMOTE is too random and easy to generate noisy data. The kernel function is combined with IFCM-SMOTE-SVM and transformed to a high-dimensional space for clustering sampling and classification, and the kernel space-based classification algorithm (KS-IFCM-SMOTE-SVM) is proposed, which improves the effectiveness of the generated data on SVM classification results. Finally, the generalization ability of KS-IFCM-SMOTE-SVM for different types of enterprise data is experimentally demonstrated, and it is verified that the proposed algorithm has stable and accurate performance. This study introduces the SMOTE and FCM clustering, and improves the SVM by combining the data transformation in the kernel space to achieve accurate classification of unbalanced data of employees, which helps enterprises to predict whether employees have the tendency to leave in advance.
f
Comparison of methods for balancing data in SVM.
plos.figshare.com
xls
Updated Aug 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinyi Wei; Boyu Shi (2025). Comparison of methods for balancing data in SVM. [Dataset]. http://doi.org/10.1371/journal.pone.0327569.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0327569.t010
Dataset updated
Aug 7, 2025
Dataset provided by
PLOS ONE
Authors
Xinyi Wei; Boyu Shi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Coronary heart disease (CHD) is a major cardiovascular disorder that poses significant threats to global health and is increasingly affecting younger populations. Its treatment and prevention face challenges such as high costs, prolonged recovery periods, and limited efficacy of traditional methods. Additionally, the complexity of diagnostic indicators and the global shortage of medical professionals further complicate accurate diagnosis. This study employs machine learning techniques to analyze CHD-related pathogenic factors and proposes an efficient diagnostic and predictive framework. To address the data imbalance issue, SMOTE-ENN is utilized, and five machine learning algorithms—Decision Trees, KNN, SVM, XGBoost, and Random Forest—are applied for classification tasks. Principal Component Analysis (PCA) and Grid Search are used to optimize the models, with evaluation metrics including accuracy, precision, recall, F1-score, and AUC. According to the random forest model’s optimization experiment, the initial unbalanced data’s accuracy was 85.26%, and the F1-score was 12.58%. The accuracy increased to 92.16% and the F1-score reached 93.85% after using SMOTE-ENN for data balancing, which is an increase of 6.90% and 81.27%, respectively; the model accuracy increased to 97.91% and the F1-score increased to 97.88% after adding PCA feature dimensionality reduction processing, which is an increase of 5.75% and 4.03%, respectively, compared with the SMOTE-ENN stage. This indicates that combining data balancing and feature dimensionality reduction techniques significantly improves model accuracy and makes the random forest model the best model. This study provides an efficient diagnostic tool for CHD, alleviates the challenges posed by limited medical resources, and offers a scientific foundation for precise prevention and intervention strategies.
f
Data Sheet 1_Using preprocessed datasets to construct and interpret...
frontiersin.figshare.com
docx
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cong Wang; Yufeng Fu; Ran Wan; Le Zhao; Hongbo Wang; Junwei Guo; Qiang Liu; Shan Li; Shengtao Ma; Zhicai Wang; Wei Huang; Huimin Liu; Song Yang; Cong Nie (2025). Data Sheet 1_Using preprocessed datasets to construct and interpret multiclass identification models.docx [Dataset]. http://doi.org/10.3389/fpls.2025.1597673.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpls.2025.1597673.s001
Dataset updated
Aug 20, 2025
Dataset provided by
Frontiers
Authors
Cong Wang; Yufeng Fu; Ran Wan; Le Zhao; Hongbo Wang; Junwei Guo; Qiang Liu; Shan Li; Shengtao Ma; Zhicai Wang; Wei Huang; Huimin Liu; Song Yang; Cong Nie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionImage and near-infrared (NIR) spectroscopic data are widely used for constructing analytical models in precision agriculture. While model interpretation can provide valuable insights for quality control and improvement, the inherent ambiguity of individual image pixels or spectral data points often hinders practical interpretability when using raw data directly. Furthermore, the presence of imbalanced datasets can lead to model overfitting and consequently, poor robustness. Therefore, developing alternative approaches for constructing interpretable and robust models using these data types is crucial.MethodsThis study proposes using preprocessed data—specifically, morphological features extracted from images and chemical component concentrations predicted from NIR spectra—to build multiclass identification models. Combined kernel SVM based models were proposed to identify the rice variety and cultivation region of tobacco. The determination of kernel parameters and percentage of different types of kernel functions were accomplished by PSO, which make the approach self-adaptive. Feature importance and contribution analyses were conducted using Shapley additive explanations (SHAP).ResultsThe resulting models demonstrated high robustness and accuracy, achieving classification success rates of 97.9 and 97.4% via n-fold cross validation on rice and tobacco datasets, respectively, and 97.7% on an independent test set (tobacco dataset 2). This analysis identified key variables and elucidated their specific contributions to the model predictions.DiscussionThis study expands the applicability of image and NIR spectroscopic data, offering researchers an effective methodology for investigating factors crucial to the quality control and improvement of agricultural products.
f
The selected explanatory variables.
plos.figshare.com
xls
Updated Jun 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada (2023). The selected explanatory variables. [Dataset]. http://doi.org/10.1371/journal.pone.0281901.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0281901.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Seyed Iman Mohammadpour; Majid Khedmati; Mohammad Javad Hassan Zada
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
While the cost of road traffic fatalities in the U.S. surpasses $240 billion a year, the availability of high-resolution datasets allows meticulous investigation of the contributing factors to crash severity. In this paper, the dataset for Trucks Involved in Fatal Accidents in 2010 (TIFA 2010) is utilized to classify the truck-involved crash severity where there exist different issues including missing values, imbalanced classes, and high dimensionality. First, a decision tree-based algorithm, the Synthetic Minority Oversampling Technique (SMOTE), and the Random Forest (RF) feature importance approach are employed for missing value imputation, minority class oversampling, and dimensionality reduction, respectively. Afterward, a variety of classification algorithms, including RF, K-Nearest Neighbors (KNN), Multi-Layer Perceptron (MLP), Gradient-Boosted Decision Trees (GBDT), and Support Vector Machine (SVM) are developed to reveal the influence of the introduced data preprocessing framework on the output quality of ML classifiers. The results show that the GBDT model outperforms all the other competing algorithms for the non-preprocessed crash data based on the G-mean performance measure, but the RF makes the most accurate prediction for the treated dataset. This finding indicates that after the feature selection is conducted to alleviate the computational cost of the machine learning algorithms, bagging (bootstrap aggregating) of decision trees in RF leads to a better model rather than boosting them via GBDT. Besides, the adopted feature importance approach decreases the overall accuracy by only up to 5% in most of the estimated models. Moreover, the worst class recall value of the RF algorithm without prior oversampling is only 34.4% compared to the corresponding value of 90.3% in the up-sampled model which validates the proposed multi-step preprocessing scheme. This study also identifies the temporal and spatial (roadway) attributes, as well as crash characteristics, and Emergency Medical Service (EMS) as the most critical factors in truck crash severity.
f
Statistical characteristics of the original extremely imbalance LMCH and...
figshare.com
xls
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Statistical characteristics of the original extremely imbalance LMCH and filtered moderately imbalance LMCH diabetes data dynamics. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t002
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Statistical characteristics of the original extremely imbalance LMCH and filtered moderately imbalance LMCH diabetes data dynamics.
f
Hyperparameter optimization using 10-fold grid search CV for the filtered...
plos.figshare.com
xls
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Hyperparameter optimization using 10-fold grid search CV for the filtered LMCH data dynamics with 80:20 partition with feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t015
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t015
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Hyperparameter optimization using 10-fold grid search CV for the filtered LMCH data dynamics with 80:20 partition with feature selection.
f
Performance comparison of hyper-parameterized classifiers on filtered LMCH...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Performance comparison of hyper-parameterized classifiers on filtered LMCH data dynamics using an 80:20 partition with and without feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t011
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t011
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of hyper-parameterized classifiers on filtered LMCH data dynamics using an 80:20 partition with and without feature selection.
f
Performance comparison of hyper-parameterized classifiers on original LMCH...
plos.figshare.com
xls
Updated May 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Performance comparison of hyper-parameterized classifiers on original LMCH data dynamics using a 70:30 partition with and without feature selection. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t008
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Performance comparison of hyper-parameterized classifiers on original LMCH data dynamics using a 70:30 partition with and without feature selection.
f
Literature analysis summarized by three main categories and six...
plos.figshare.com
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takaya Saito; Marc Rehmsmeier (2023). Literature analysis summarized by three main categories and six subcategories. [Dataset]. http://doi.org/10.1371/journal.pone.0118432.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0118432.t004
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS ONE
Authors
Takaya Saito; Marc Rehmsmeier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
aSVM: type of SVM, Data: data type, Eval: evaluation method.bThe total number of articles is 58.cFiltered by SVM binary (BS) AND Imbalanced (IB1 or IB2) AND NOT Small sample size (SS). The total number of these articles is 33.Literature analysis summarized by three main categories and six subcategories.
f
Analysis before utilizing hyperparameter optimization and feature selection...
plos.figshare.com
xls
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin (2024). Analysis before utilizing hyperparameter optimization and feature selection techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0300785.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300785.t007
Dataset updated
May 16, 2024
Dataset provided by
PLOS ONE
Authors
Md Abdus Sahid; Mozaddid Ul Hoque Babar; Md Palash Uddin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis before utilizing hyperparameter optimization and feature selection techniques.
f
Hyperparameter settings for Word2Vec and Doc2vec.
plos.figshare.com
xls
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li (2024). Hyperparameter settings for Word2Vec and Doc2vec. [Dataset]. http://doi.org/10.1371/journal.pone.0305095.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305095.t005
Dataset updated
Oct 18, 2024
Dataset provided by
PLOS ONE
Authors
Lu Xiao; Qiaoxing Li; Qian Ma; Jiasheng Shen; Yong Yang; Danyang Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text classification, as an important research area of text mining, can quickly and effectively extract valuable information to address the challenges of organizing and managing large-scale text data in the era of big data. Currently, the related research on text classification tends to focus on the application in fields such as information filtering, information retrieval, public opinion monitoring, and library and information, with few studies applying text classification methods to the field of tourist attractions. In light of this, a corpus of tourist attraction description texts is constructed using web crawler technology in this paper. We propose a novel text representation method that combines Word2Vec word embeddings with TF-IDF-CRF-POS weighting, optimizing traditional TF-IDF by incorporating total relative term frequency, category discriminability, and part-of-speech information. Subsequently, the proposed algorithm respectively combines seven commonly used classifiers (DT, SVM, LR, NB, MLP, RF, and KNN), known for their good performance, to achieve multi-class text classification for six subcategories of national A-level tourist attractions. The effectiveness and superiority of this algorithm are validated by comparing the overall performance, specific category performance, and model stability against several commonly used text representation methods. The results demonstrate that the newly proposed algorithm achieves higher accuracy and F1-measure on this type of professional dataset, and even outperforms the high-performance BERT classification model currently favored by the industry. Acc, marco-F1, and mirco-F1 values are respectively 2.29%, 5.55%, and 2.90% higher. Moreover, the algorithm can identify rare categories in the imbalanced dataset and exhibit better stability across datasets of different sizes. Overall, the algorithm presented in this paper exhibits superior classification performance and robustness. In addition, the conclusions obtained by the predicted value and the true value are consistent, indicating that this algorithm is practical. The professional domain text dataset used in this paper poses higher challenges due to its complexity (uneven text length, relatively imbalanced categories), and a high degree of similarity between categories. However, this proposed algorithm can efficiently implement the classification of multiple subcategories of this type of text set, which is a beneficial exploration of the application research of complex Chinese text datasets in specific fields, and provides a useful reference for the vector expression and classification of text datasets with similar content.

Facebook

Twitter

Click to copy link

Link copied

Cite

Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li (2024). ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein–DNA Interaction Hotspots [Dataset]. http://doi.org/10.1021/acs.jcim.3c02011.s002

Data from: ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein–DNA Interaction Hotspots

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.1021/acs.jcim.3c02011.s002

Dataset updated

Apr 8, 2024

Dataset provided by

ACS Publications

Authors

Lianci Tao; Tong Zhou; Zhixiang Wu; Fangrui Hu; Shuang Yang; Xiaotian Kong; Chunhua Li

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Protein–DNA interactions are pivotal to various cellular processes. Precise identification of the hotspot residues for protein–DNA interactions holds great significance for revealing the intricate mechanisms in protein–DNA recognition and for providing essential guidance for protein engineering. Aiming at protein–DNA interaction hotspots, this work introduces an effective prediction method, ESPDHot based on a stacked ensemble machine learning framework. Here, the interface residue whose mutation leads to a binding free energy change (ΔΔG) exceeding 2 kcal/mol is defined as a hotspot. To tackle the imbalanced data set issue, the adaptive synthetic sampling (ADASYN), an oversampling technique, is adopted to synthetically generate new minority samples, thereby rectifying data imbalance. As for molecular characteristics, besides traditional features, we introduce three new characteristic types including residue interface preference proposed by us, residue fluctuation dynamics characteristics, and coevolutionary features. Combining the Boruta method with our previously developed Random Grouping strategy, we obtained an optimal set of features. Finally, a stacking classifier is constructed to output prediction results, which integrates three classical predictors, Support Vector Machine (SVM), XGBoost, and Artificial Neural Network (ANN) as the first layer, and Logistic Regression (LR) algorithm as the second one. Notably, ESPDHot outperforms the current state-of-the-art predictors, achieving superior performance on the independent test data set, with F1, MCC, and AUC reaching 0.571, 0.516, and 0.870, respectively.

Clear search

Close search

Google apps

Main menu

Data from: ESPDHot: An Effective Machine Learning-Based Approach for...

Hyperparameter search space for LR.

Hyperparameter settings of classification model.

Hyperparameter search space for SVM.

Hyperparameter optimization with 10-fold grid search CV for the original...

Confusion matrix.

Comparative analysis over various datasets.

Data from: Large-Scale Learning of Structure−Activity Relationships Using a...

Hyperparameter Search Space for KNN.

Data from: Comparison of experimental results (%).

Comparison of methods for balancing data in SVM.

Data Sheet 1_Using preprocessed datasets to construct and interpret...

The selected explanatory variables.

Statistical characteristics of the original extremely imbalance LMCH and...

Hyperparameter optimization using 10-fold grid search CV for the filtered...

Performance comparison of hyper-parameterized classifiers on filtered LMCH...

Performance comparison of hyper-parameterized classifiers on original LMCH...

Literature analysis summarized by three main categories and six...

Analysis before utilizing hyperparameter optimization and feature selection...

Hyperparameter settings for Word2Vec and Doc2vec.

Data from: ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein–DNA Interaction Hotspots