https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Feature selection is an important technique for data mining before a machine learning algorithm is applied. Despite its importance, most studies of feature selection are restricted to batch learning. Unlike traditional batch learning methods, online learning represents a promising family of efficient and scalable machine learning algorithms for large-scale applications. Most existing studies of online learning require accessing all the attributes/features of training instances. Such a classical setting is not always appropriate for real-world applications when data instances are of high dimensionality or it is expensive to acquire the full set of attributes/features. To address this limitation, we investigate the problem of Online Feature Selection (OFS) in which an online learner is only allowed to maintain a classifier involved only a small and fixed number of features. The key challenge of Online Feature Selection is how to make accurate prediction using a small and fixed number of active features. This is in contrast to the classical setup of online learning where all the features can be used for prediction. We attempt to tackle this challenge by studying sparsity regularization and truncation techniques. Specifically, this article addresses two different tasks of online feature selection: (1) learning with full input where an learner is allowed to access all the features to decide the subset of active features, and (2) learning with partial input where only a limited number of features is allowed to be accessed for each instance by the learner. We present novel algorithms to solve each of the two problems and give their performance analysis. We evaluate the performance of the proposed algorithms for online feature selection on several public datasets, and demonstrate their applications to real-world problems including image classification in computer vision and microarray gene expression analysis in bioinformatics. The encouraging results of our experiments validate the efficacy and efficiency of the proposed techniques.Related Publication: Hoi, S. C., Wang, J., Zhao, P., & Jin, R. (2012). Online feature selection for mining big data. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 93-100). ACM. http://dx.doi.org/10.1145/2351316.2351329 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2402/ Wang, J., Zhao, P., Hoi, S. C., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698-710. http://dx.doi.org/10.1109/TKDE.2013.32 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2277/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Feature preparation Preprocessing was applied to the data, such as creating dummy variables and performing transformations (centering, scaling, YeoJohnson) using the preProcess() function from the “caret” package in R. The correlation among the variables was examined and no serious multicollinearity problems were found. A stepwise variable selection was performed using a logistic regression model. The final set of variables included: Demographic: age, body mass index, sex, ethnicity, smoking History of disease: heart disease, migraine, insomnia, gastrointestinal disease, COVID-19 history: covid vaccination, rashes, conjunctivitis, shortness of breath, chest pain, cough, runny nose, dysgeusia, muscle and joint pain, fatigue, fever ,COVID-19 reinfection, and ICU admission. These variables were used to train and test various machine-learning models Model selection and training The data was randomly split into 80% training and 20% testing subsets. The “h2o” package in R version 4.3.1 was employed to implement different algorithms. AutoML was first used, which automatically explored a range of models with different configurations. Gradient Boosting Machines (GBM), Random Forest (RF), and Regularized Generalized Linear Model (GLM) were identified as the best-performing models on our data and their parameters were fine-tuned. An ensemble method that stacked different models together was also used, as it could sometimes improve the accuracy. The models were evaluated using the area under the curve (AUC) and C-statistics as diagnostic measures. The model with the highest AUC was selected for further analysis using the confusion matrix, accuracy, sensitivity, specificity, and F1 and F2 scores. The optimal prediction threshold was determined by plotting the sensitivity, specificity, and accuracy and choosing the point of intersection as it balanced the trade-off between the three metrics. The model’s predictions were also plotted, and the quantile ranges were used to classify the model’s prediction as follows: > 1st quantile, > 2nd quantile, > 3rd quartile and < 3rd quartile (very low, low, moderate, high) respectively. Metric Formula C-statistics (TPR + TNR - 1) / 2 Sensitivity/Recall TP / (TP + FN) Specificity TN / (TN + FP) Accuracy (TP + TN) / (TP + TN + FP + FN) F1 score 2 * (precision * recall) / (precision + recall) Model interpretation We used the variable importance plot, which is a measure of how much each variable contributes to the prediction power of a machine learning model. In H2O package, variable importance for GBM and RF is calculated by measuring the decrease in the model's error when a variable is split on. The more a variable's split decreases the error, the more important that variable is considered to be. The error is calculated using the following formula: 𝑆𝐸=𝑀𝑆𝐸∗𝑁=𝑉𝐴𝑅∗𝑁 and then it is scaled between 0 and 1 and plotted. Also, we used The SHAP summary plot which is a graphical tool to visualize the impact of input features on the prediction of a machine learning model. SHAP stands for SHapley Additive exPlanations, a method to calculate the contribution of each feature to the prediction by averaging over all possible subsets of features [28]. SHAP summary plot shows the distribution of the SHAP values for each feature across the data instances. We use the h2o.shap_summary_plot() function in R to generate the SHAP summary plot for our GBM model. We pass the model object and the test data as arguments, and optionally specify the columns (features) we want to include in the plot. The plot shows the SHAP values for each feature on the x-axis, and the features on the y-axis. The color indicates whether the feature value is low (blue) or high (red). The plot also shows the distribution of the feature values as a density plot on the right.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the reproducibility package for the paper "On the Anatomy of Real-World R Code for Static Analysis", accepted at MSR 2024.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Improving the flame retardancy of polymeric materials used in engineering applications is an increasingly important strategy for limiting fire hazards. However, the wide variety of flame retardant polymeric nanocomposite compositions prevents quick identification of the optimal design for a specific application. In this study, we built a flame retardancy database of more than 800 polymeric nanocomposites, including information from polymer flammability, thermal stability, and nanofiller properties. Then, we applied five machine learning algorithms to predict the flame retardancy index for different types of flame retardant polymeric nanocomposites. Among them, extreme gradient boosting regression gives the best prediction with a coefficient of determination (R2) of 0.94 and a root-mean-square error of 0.17. In addition, we studied how the physical features of polymeric nanocomposites affected flame retardancy using the correlation matrix and feature importance plot, which in turn was used to guide the design of polymeric nanocomposites for flame retardant applications. Following the guidelines, a high-performance flame retardant polymeric nanocomposite was designed and synthesized, and the experimental FRI result was compared with the machine learning prediction (6% prediction error). This result demonstrated a fast identification of flame retardancy of polymeric nanocomposite without large-scale fire tests, which could accelerate the design of functional polymeric nanocomposites in the flame retardant field.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Metal–organic frameworks (MOFs), an emerging class of nanoporous materials, have drawn considerable attention as promising adsorbents for gas separations. Among various separation applications, CO2/CO separation is of particular interest owing to its industrial relevance. While searching for promising MOFs from tens of thousands of candidates represents a great challenge, this study conducts large-scale molecular simulations to identify top-performing CO2 adsorbents, followed by investigating structure–property relationships for their design. Optimal MOFs are found to possess features such as metal nodes of greater metallic charges and dipole moments with a relatively confined pore structure. With the large-scale data at our disposal, machine learning models capable of predicting the CO2-to-CO selectivity and adsorption uptakes are also established. Specifically, three algorithms including support vector regression (SVR), extreme gradient boosting (XGBoost), and random forest (RF) models are employed. The results show that the RF algorithm demonstrates the best accuracy, and the r value for the predicted CO2-to-CO selectivity (S) can be as large as ∼0.88. The relative importance of the adopted features is also investigated with results suggesting that the adsorption of CO2 initiates more preferentially than that of CO due to the stronger van der Waals interaction and electrostatic contribution between CO2 and the metal sites. Finally, a design rule is proposed for the optimal design of CO2-selective materials. Overall, this work demonstrates a successful hybrid approach combining molecular simulations and machine learning for screening highly CO2/CO selective MOFs and offering insights into the design of optimal adsorbents.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Parameter setting table for training feature extraction network.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The performance of electrochemical sensors is influenced by various factors. To enhance the effectiveness of these sensors, it is crucial to find the right balance among these factors. Researchers and engineers continually explore innovative approaches to enhance sensitivity, selectivity, and reliability. Machine learning (ML) techniques facilitate the analysis and predictive modeling of sensor performance by establishing quantitative relationships between parameters and their effects. This work presents a case study on developing a molecularly imprinted polymer (MIP)-based sensor for detecting doxorubicin (Dox), emphasizing the use of ML-based ensemble models to improve performance and reliability. Four ML models, including Decision Tree (DT), eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and K-Nearest Neighbors (KNN), are used to evaluate the effect of each parameter on prediction performance, using the SHapley Additive exPlanations (SHAP) method to determine feature importance. Based on the analysis, removing a less influential feature and introducing a new feature significantly improved the model’s predictive capabilities. By applying the min–max scaling technique, it is ensured that all features contribute proportionally to the model learning process. Additionally, multiple ML modelsLinear Regression (LR), KNN, DT, RF, Adaptive Boosting (AdaBoost), Gradient Boosting (GB), Support Vector Regression (SVR), XGBoost, Bagging, Partial Least Squares (PLS), and Ridge Regressionare applied to the data set and their performance in predicting the sensor output current is compared. To further enhance prediction performance, a novel ensemble model is proposed that integrates DT, RF, GB, XGBoost, and Bagging regressors, leveraging their combined strengths to offset individual weaknesses. The main benefit of this work lies in its ability to enhance MIP-based sensor performance by developing a novel stacking regressor ensemble model, which improves prediction performance and reliability. This methodology is broadly applicable to the development of other sensors with different transducers and sensing elements. Through extensive simulation results, the proposed stacking regressor ensemble model demonstrated superior predictive performance compared to individual ML models. The model achieved an R-squared (R2) of 0.993, significantly reducing the root-mean-square error (RMSE) to 0.436 and the mean absolute error (MAE) to 0.244. These improvements enhanced sensitivity and reliability of the MIP-based electrochemical sensor, demonstrating a substantial performance gain over individual ML models.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Hydrogen energy holds promise for controlling emissions but is limited by the production cost and method. Chemical looping hydrogen production (CLHP) provides a more efficient and environmentally sustainable route to produce high-purity hydrogen compared with conventional methods. Yet, CLHP involves a series of operational variables, and the optimization of operating conditions is the critical issue for large-scale hydrogen production. In this study, support vector machine (SVM), decision tree (DT), random forest (RF), artificial neural network (ANN), and physics-informed neural network (PINN) models are developed to predict hydrogen production rates by analyzing multiple process variables. Through the analysis of the database and experiments, we integrated physical consistency as prior physical knowledge into the PINN for eliminating the data dependence. All models are optimized for optimal performance through hyperparameters. The comparison of five machine learning models reveals that DT and RF models exhibit a characteristic step-like pattern in their predictions, while SVM and ANN models produce outputs that often diverge from the expected trend. The prediction of the PINN model exhibits good performance with R2, mean squared error, and mean absolute percentage error scores of 0.882, 1.228, and 18.1%, respectively. The results are with high interpretability due to the physical-informed inherent feature. Then, the CLHP process is studied, and the relationships between hydrogen yield and operating temperature, gas flow rate, and mass fraction of iron oxide are established. This work shows the difference in the prediction curves between the different models. By training various general models and comparing their predictive performance on chemical looping data, we can gain valuable insights to guide subsequent predictions for CLHP. It will be beneficial for the design of oxygen carriers and the optimization of the CLHP process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model Performance Metrics After Feature Fusion and Normalization.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
mAP comparison of three kinds of industrial equipment detection by ROMS R-CNN algorithm with different structures and functions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparison of parameters and complexity of different network structures in feature extraction.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Performance Evaluation of Feature Fusion and normalization, AFR, and Hybrid Methods.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
Feature selection is an important technique for data mining before a machine learning algorithm is applied. Despite its importance, most studies of feature selection are restricted to batch learning. Unlike traditional batch learning methods, online learning represents a promising family of efficient and scalable machine learning algorithms for large-scale applications. Most existing studies of online learning require accessing all the attributes/features of training instances. Such a classical setting is not always appropriate for real-world applications when data instances are of high dimensionality or it is expensive to acquire the full set of attributes/features. To address this limitation, we investigate the problem of Online Feature Selection (OFS) in which an online learner is only allowed to maintain a classifier involved only a small and fixed number of features. The key challenge of Online Feature Selection is how to make accurate prediction using a small and fixed number of active features. This is in contrast to the classical setup of online learning where all the features can be used for prediction. We attempt to tackle this challenge by studying sparsity regularization and truncation techniques. Specifically, this article addresses two different tasks of online feature selection: (1) learning with full input where an learner is allowed to access all the features to decide the subset of active features, and (2) learning with partial input where only a limited number of features is allowed to be accessed for each instance by the learner. We present novel algorithms to solve each of the two problems and give their performance analysis. We evaluate the performance of the proposed algorithms for online feature selection on several public datasets, and demonstrate their applications to real-world problems including image classification in computer vision and microarray gene expression analysis in bioinformatics. The encouraging results of our experiments validate the efficacy and efficiency of the proposed techniques.Related Publication: Hoi, S. C., Wang, J., Zhao, P., & Jin, R. (2012). Online feature selection for mining big data. In Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications (pp. 93-100). ACM. http://dx.doi.org/10.1145/2351316.2351329 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2402/ Wang, J., Zhao, P., Hoi, S. C., & Jin, R. (2014). Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 26(3), 698-710. http://dx.doi.org/10.1109/TKDE.2013.32 Full text available in InK: http://ink.library.smu.edu.sg/sis_research/2277/