Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Regression of articles’ imbalance (square-root-transformed) on relevant predictors and their interactions with the dummy-coded direction of article polarity (estimates in parentheses result if the dummy variable gets a value of zero for conventional perspectives).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample size (n) of the full dataset generated under each class-imbalance ratio (IR) to achieve a target balanced sample size (nb).
Facebook
Twitterhttps://cubig.ai/store/terms-of-servicehttps://cubig.ai/store/terms-of-service
1) Data Introduction • The Student Performance Dataset is a survey of secondary school mathematics students and is a dataset containing a variety of information in a table format, including student demographics, family environment, parents' education and occupation, health, family relationships, and grades.
2) Data Utilization (1) Student Performance Dataset has characteristics that: • Each row contains a total of 33 different characteristics, including school ID, gender, age, family size, parents' educational level and occupation, family relationship, health status, and grades. • It is suitable for a variety of data analysis and prediction exercises, including regression analysis and categorical variable imbalance analysis, including the target variable Grade. (2) Student Performance Dataset can be used to: • Analyzing academic achievement prediction and influencing factors: It can be used to analyze the impact of various factors such as student's background, family environment, and parental characteristics on grades and to develop a grade prediction model. • Establishing educational policies and customized support strategies: Based on student-specific characteristics and grade data, it can be applied to establishing educational policies such as closing educational gaps, supporting vulnerable student groups, and providing customized learning guidance.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Regression model Input variables and resulting regression coefficients by cluster for the calibration dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Prior work applied hierarchical clustering, coarsened exact matching (CEM), time series regressions with lagged variables as inputs, and microsimulation to data from three randomized clinical trials (RCTs) and a large German observational study (OS) to predict pregabalin pain reduction outcomes for patients with painful diabetic peripheral neuropathy. Here, data were added from six RCTs to reduce covariate bias of the same OS and improve accuracy and/or increase the variety of patients for pain response prediction. Using hierarchical cluster analysis and CEM, a matched dataset was created from the OS (N = 2642) and nine total RCTs (N = 1320). Using a maximum likelihood method, we estimated weekly pain scores for pregabalin-treated patients for each cluster (matched dataset); the models were validated with RCT data that did not match with OS data. We predicted novel ‘virtual’ patient pain scores over time using simulations including instance-based machine learning techniques to assign novel patients to a cluster, then applying cluster-specific regressions to predict pain response trajectories. Six clusters were identified according to baseline variables (gender, age, insulin use, body mass index, depression history, pregabalin monotherapy, prior gabapentin, pain score, and pain-related sleep interference score). CEM yielded 1766 patients (matched dataset) having lower covariate imbalances. Regression models for pain performed well (adjusted R-squared 0.90–0.93; root mean square errors 0.41–0.48). Simulations showed positive predictive values for achieving >50% and >30% change-from-baseline pain score improvements (range 68.6–83.8% and 86.5–93.9%, respectively). Using more RCTs (nine vs. the earlier three) enabled matching of 46.7% more patients in the OS dataset, with substantially reduced global imbalance vs. not matching. This larger RCT pool covered 66.8% of possible patient characteristic combinations (vs. 25.0% with three original RCTs) and made prediction possible for a broader spectrum of patients.Trial Registration: www.clinicaltrials.gov (as applicable): NCT00156078, NCT00159679, NCT00143156, NCT00553475.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Statistical comparison of clusters within the matched dataset, within the validation dataset, and between the matched and validation datasetsa.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of patients from RCTs included in virtual Lab 2.0 by maintenance dose.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Precision medicine knowledge regression model results.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Trust regression models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GMM regression of Eq (38) (dependent variable: lnrpgdp).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Two-stage least squares (2SLS) instrumental variable method regression results.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variables and data resources in the study.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In the field of data-driven material development, an imbalance in data sets where data points are concentrated in certain regions often causes difficulties in building regression models when machine learning methods are applied. One example of inorganic functional materials facing such difficulties is photocatalysts. Therefore, advanced data-driven approaches are expected to help efficiently develop novel photocatalytic materials even if an imbalance exists in data sets. We propose a two-stage machine learning model aimed at handling imbalanced data sets without data thinning. In this study, we used two types of data sets that exhibit the imbalance: the Materials Project data set (openly shared due to its public domain data) and the in-house metal-sulfide photocatalyst data set (not openly shared due to the confidentiality of experimental data). This two-stage machine learning model consists of the following two parts: the first regression model, which predicts the target quantitatively, and the second classification model, which determines the reliability of the values predicted by the first regression model. We also propose a search scheme for variables related to the experimental conditions based on the proposed two-stage machine learning model. This scheme is designed for photocatalyst exploration, taking experimental conditions into account as the optimal set of variables for these conditions is unknown. The proposed two-stage machine learning model improves the prediction accuracy of the target compared with that of the one-stage model.