57 datasets found
  1. f

    Data from: Using Lasso for Predictor Selection and to Assuage Overfitting: A...

    • tandf.figshare.com
    • search.datacite.org
    docx
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences [Dataset]. https://tandf.figshare.com/articles/dataset/Using_Lasso_for_Predictor_Selection_and_to_Assuage_Overfitting_A_Method_Long_Overlooked_in_Behavioral_Sciences/1573029
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Daniel M. McNeish
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Ordinary least squares and stepwise selection are widespread in behavioral science research; however, these methods are well known to encounter overfitting problems such that R2 and regression coefficients may be inflated while standard errors and p values may be deflated, ultimately reducing both the parsimony of the model and the generalizability of conclusions. More optimal methods for selecting predictors and estimating regression coefficients such as regularization methods (e.g., Lasso) have existed for decades, are widely implemented in other disciplines, and are available in mainstream software, yet, these methods are essentially invisible in the behavioral science literature while the use of sub optimal methods continues to proliferate. This paper discusses potential issues with standard statistical models, provides an introduction to regularization with specific details on both Lasso and its related predecessor ridge regression, provides an example analysis and code for running a Lasso analysis in R and SAS, and discusses limitations and related methods.

  2. f

    Data from: Sample-wise Combined Missing Effect Model with Penalization

    • tandf.figshare.com
    bin
    Updated Feb 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jialu Li; Guan Yu; Qizhai Li; Yufeng Liu (2024). Sample-wise Combined Missing Effect Model with Penalization [Dataset]. http://doi.org/10.6084/m9.figshare.19651419.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Feb 14, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Jialu Li; Guan Yu; Qizhai Li; Yufeng Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Modern high-dimensional statistical inference often faces the problem of missing data. In recent decades, many studies have focused on this topic and provided strategies including complete-sample analysis and imputation procedures. However, complete-sample analysis discards information of incomplete samples, while imputation procedures have accumulative errors from each single imputation. In this paper, we propose a new method, Sample-wise COmbined missing effect Model with penalization (SCOM), to deal with missing data occurring in predictors. Instead of imputing the predictors, SCOM estimates the combined effect caused by all missing data for each incomplete sample. SCOM makes full use of all available data. It is robust with respect to various missing mechanisms. Theoretical studies show the oracle inequality for the proposed estimator, and the consistency of variable selection and combined missing effect selection. Simulation studies and an application to the Residential Building Data also illustrate the effectiveness of the proposed SCOM.

  3. f

    Table_1_Total muscle-to-fat ratio influences urinary incontinence in United...

    • frontiersin.figshare.com
    docx
    Updated Mar 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongmei Hong; Hui Zhang; Yong Yu; Huijie Qian; Xiya Yu; Lize Xiong (2024). Table_1_Total muscle-to-fat ratio influences urinary incontinence in United States adult women: a population-based study.docx [Dataset]. http://doi.org/10.3389/fendo.2024.1309082.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Mar 28, 2024
    Dataset provided by
    Frontiers
    Authors
    Dongmei Hong; Hui Zhang; Yong Yu; Huijie Qian; Xiya Yu; Lize Xiong
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PurposeThis study aims to investigate the relationship between the total muscle-to-fat ratio (tMFR) and female urinary incontinence (UI), determine whether tMFR can serve as a useful index for predicting UI, and identify factors that may influence this relationship.MethodsWe retrospectively analyzed data from 4391 adult women participating in the National Health and Nutrition Examination Survey (NHANES) conducted between 2011 and 2018. The correlation between tMFR and UI was examined using a dose-response curve generated through a restricted cubic spline (RCS) function, LASSO and multivariate logistic regression. Furthermore, predictive models were constructed incorporating factors such as age, race, hypertension, diabetes, cotinine levels, and tMFR. The performance of these predictive models was evaluated using training and test datasets, employing calibration curves, receiver operating characteristic curves, and clinical decision curves. Mediation effects were also analyzed to explore potential relationships between tMFR and female UI.ResultsIn a sample of 4391 adult women, 1073 (24.4%) self-reported experiencing UI, while 3318 (75.6%) reported not having UI. Based on the analyses involving LASSO regression and multivariate logistic regression, it was found that tMFR exhibited a negative association with UI (OR = 0.599, 95% CI: 0.497-0.719, P < 0.001). The results from the restricted cubic spline chart indicated a decreasing risk of UI in women as tMFR increased. Furthermore, the model constructed based on logistic regression analysis demonstrated a certain level of accuracy (in the training dataset: area under the curve (AUC) = 0.663; in the test dataset: AUC = 0.662) and clinical applicability. The mediation analysis revealed that the influence of tMFR on the occurrence of UI in women might potentially occur through the blood index lymphocyte count (P = 0.040).ConclusionA high tMFR serves as a protective factor against UI in women. Furthermore, lymphocyte might be involved in the relationship between tMFR and female UI.

  4. f

    Data from: A Fast Solution to the Lasso Problem with Equality Constraints

    • tandf.figshare.com
    zip
    Updated Feb 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lam Tran; Gen Li; Lan Luo; Hui Jiang (2024). A Fast Solution to the Lasso Problem with Equality Constraints [Dataset]. http://doi.org/10.6084/m9.figshare.24496603.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 6, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Lam Tran; Gen Li; Lan Luo; Hui Jiang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The equality-constrained lasso problem augments the standard lasso by imposing additional structure on regression coefficients. Despite the broad utilities of the equality-constrained lasso, existing algorithms are typically computationally inefficient and only applicable to linear and logistic models. In this article, we devise a fast solution to the equality-constrained lasso problem with a two-stage algorithm: first obtaining candidate covariate subsets of increasing size from unconstrained lasso problems and then leveraging an efficient combined alternating direction method of multipliers/Newton-Raphson algorithm. Our proposed algorithm leads to substantial speedups in getting the solution path of the constrained lasso and can be easily adapted to generalized linear models and Cox proportional hazards models. We conduct extensive simulation studies to demonstrate the computational advantage of the proposed method over existing solvers. To further show the unique utility of our method, we consider two real-world data examples: a microbiome regression analysis and a myeloma survival analysis; neither example could be solved by naively fitting the constrained lasso problem on the full predictor set. Supplementary materials for this article are available online.

  5. Data from: Genetic assignment of individuals to source populations using...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Jun 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markku Kuismin; Markku Kuismin; Dilan Saatoglu; Alina Niskanen; Alina Niskanen; Henrik Jensen; Mikko Sillanpää; Dilan Saatoglu; Henrik Jensen; Mikko Sillanpää (2022). Genetic assignment of individuals to source populations using network estimation tools [Dataset]. http://doi.org/10.5061/dryad.gqnk98sh8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 2, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Markku Kuismin; Markku Kuismin; Dilan Saatoglu; Alina Niskanen; Alina Niskanen; Henrik Jensen; Mikko Sillanpää; Dilan Saatoglu; Henrik Jensen; Mikko Sillanpää
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dispersal, the movement of individuals between populations, is crucial in many ecological and genetic processes. However, direct identification of dispersing individuals is difficult or impossible in natural populations. By using genetic assignment methods, individuals with unknown genetic origin can be assigned to source populations. This knowledge is necessary in studying many key questions in ecology, evolution and conservation.

    We introduce a network-based tool BONE (Baseline Oriented Network Estimation) for genetic population assignment, which borrows concepts from undirected graph inference. In particular, we use sparse multinomial Least Absolute Shrinkage and Selection Operator (LASSO) regression to estimate probability of the origin of all mixture individuals and their mixture proportions without tedious selection of the LASSO tuning parameter. We compare BONE with three genetic assignment methods implemented in R packages radmixture, assignPOP and RUBIAS.

    Probability of the origin and mixture proportion estimates of both simulated and real data (an insular house sparrow metapopulation and Chinook salmon populations) given by BONE are competitive or superior compared to other assignment methods. Our examples illustrate how the network estimation method adapts to population assignment, combining the efficiency and attractive properties of sparse network representation and model selection properties of the L1 regularization. As far as we know, this is the first approach showing how one can use network tools for genetic identification of individuals' source populations.

    BONE is aimed at any researcher performing genetic assignment and trying to infer the genetic population structure. Compared to other methods, our approach also identifies outlying mixture individuals that could originate outside of the baseline populations. BONE is a freely available R package under the GPL license and can be downloaded at GitHub. In addition to the R package, a tutorial for BONE is available at https://github.com/markkukuismin/BONE/.

  6. d

    Data from: Prediction model of in-hospital mortality in intensive care unit...

    • search.dataone.org
    • zenodo.org
    • +1more
    Updated May 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jingmin Zhou; Fuhai Li; Yu Song; Mingqiang Fu; Xueting Han; Junbo Ge (2025). Prediction model of in-hospital mortality in intensive care unit patients with heart failure: machine learning-based, retrospective analysis of the MIMIC-III database [Dataset]. http://doi.org/10.5061/dryad.0p2ngf1zd
    Explore at:
    Dataset updated
    May 4, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Jingmin Zhou; Fuhai Li; Yu Song; Mingqiang Fu; Xueting Han; Junbo Ge
    Time period covered
    Jun 25, 2021
    Description

    Objective: The predictors of in-hospital mortality for intensive care units (ICU)-admitted HF patients remain poorly characterized.We aimed to develop and validate a prediction model for all-cause in-hospital mortality among ICU-admitted HF patients.

    Design: A retrospective cohort study.

    Setting and Participants: Data were extracted from the MIMIC-III database. Data on 1,177 heart failure patients were analysed.

    Methods: Patients meeting the inclusion criteria were identified from the MIMIC-III database and randomly divided into derivation and validation groups. Independent risk factors for in-hospital mortality were screened using XGBoost and LASSO regression models in the derivation sample. Multivariable logistic regression analysis was used to build prediction models. Discrimination, calibration, and clinical usefulness of the predicting model were assessed using the C-index, calibration plot, and decision curve analysis. After pairwise comparison, the best performing model ...

  7. f

    Simulation results for the location level of contamination in the model(1 −...

    • plos.figshare.com
    xls
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Wahid; Dost Muhammad Khan; Ijaz Hussain (2023). Simulation results for the location level of contamination in the model(1 − δ)N(0, 1) + δN(−10, 1) in high-dimensional data set. [Dataset]. http://doi.org/10.1371/journal.pone.0183518.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Abdul Wahid; Dost Muhammad Khan; Ijaz Hussain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Simulation results for the location level of contamination in the model(1 − δ)N(0, 1) + δN(−10, 1) in high-dimensional data set.

  8. f

    Lasso.

    • figshare.com
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin (2023). Lasso. [Dataset]. http://doi.org/10.1371/journal.pone.0107957.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Permutation tests for equality of the group distributions using distance components analysis (lines 2 to 6), and permutation F-tests for the presence of 2-by-2 interactions (lines 7 to 16). Results based on 999 permutations.Lasso.

  9. f

    Data from: Robust Lasso Regression Using Tukey's Biweight Criterion

    • tandf.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Le Chang; Steven Roberts; Alan Welsh (2023). Robust Lasso Regression Using Tukey's Biweight Criterion [Dataset]. http://doi.org/10.6084/m9.figshare.4758391.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Le Chang; Steven Roberts; Alan Welsh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The adaptive lasso is a method for performing simultaneous parameter estimation and variable selection. The adaptive weights used in its penalty term mean that the adaptive lasso achieves the oracle property. In this work, we propose an extension of the adaptive lasso named the Tukey-lasso. By using Tukey's biweight criterion, instead of squared loss, the Tukey-lasso is resistant to outliers in both the response and covariates. Importantly, we demonstrate that the Tukey-lasso also enjoys the oracle property. A fast accelerated proximal gradient (APG) algorithm is proposed and implemented for computing the Tukey-lasso. Our extensive simulations show that the Tukey-lasso, implemented with the APG algorithm, achieves very reliable results, including for high-dimensional data where p > n. In the presence of outliers, the Tukey-lasso is shown to offer substantial improvements in performance compared to the adaptive lasso and other robust implementations of the lasso. Real-data examples further demonstrate the utility of the Tukey-lasso. Supplementary materials for this article are available online.

  10. f

    dataverse_files.zip from A data-driven approach shows that individuals'...

    • rs.figshare.com
    zip
    Updated Feb 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gert Stulp; Lars Top; Xiao Xu; Elizaveta Sivak (2024). dataverse_files.zip from A data-driven approach shows that individuals' characteristics are more important than their networks in predicting fertility preferences [Dataset]. http://doi.org/10.6084/m9.figshare.24792555.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 12, 2024
    Dataset provided by
    The Royal Society
    Authors
    Gert Stulp; Lars Top; Xiao Xu; Elizaveta Sivak
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    People's networks are considered key in explaining fertility outcomes—whether people want and have children. Existing research on social influences on fertility is limited because data often come from small networks or highly selective samples, only few network variables are considered, and the strength of network effects is not properly assessed. We use data from a representative sample of Dutch women reporting on over 18 000 relationships. A data-driven approach including many network characteristics accounted for 0 to 40% of the out-of-sample variation in different outcomes related to fertility preferences. Individual characteristics were more important for all outcomes than network variables. Network composition was also important, particularly those people in the network desiring children or those choosing to be childfree. Structural network characteristics, which feature prominently in social influence theories and are based on the relations between people in the networks, hardly mattered. We discuss to what extent our results provide support for different mechanisms of social influence, and the advantages and disadvantages of our data-driven approach in comparison to traditional approaches.

  11. f

    Data from: Benchmarking Machine Learning Models for Polymer Informatics: An...

    • acs.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lei Tao; Vikas Varshney; Ying Li (2023). Benchmarking Machine Learning Models for Polymer Informatics: An Example of Glass Transition Temperature [Dataset]. http://doi.org/10.1021/acs.jcim.1c01031.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    ACS Publications
    Authors
    Lei Tao; Vikas Varshney; Ying Li
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    In the field of polymer informatics, utilizing machine learning (ML) techniques to evaluate the glass transition temperature Tg and other properties of polymers has attracted extensive attention. This data-centric approach is much more efficient and practical than the laborious experimental measurements when encountered a daunting number of polymer structures. Various ML models are demonstrated to perform well for Tg prediction. Nevertheless, they are trained on different data sets, using different structure representations, and based on different feature engineering methods. Thus, the critical question arises on selecting a proper ML model to better handle the Tg prediction with generalization ability. To provide a fair comparison of different ML techniques and examine the key factors that affect the model performance, we carry out a systematic benchmark study by compiling 79 different ML models and training them on a large and diverse data set. The three major components in setting up an ML model are structure representations, feature representations, and ML algorithms. In terms of polymer structure representation, we consider the polymer monomer, repeat unit, and oligomer with longer chain structure. Based on that feature, representation is calculated, including Morgan fingerprinting with or without substructure frequency, RDKit descriptors, molecular embedding, molecular graph, etc. Afterward, the obtained feature input is trained using different ML algorithms, such as deep neural networks, convolutional neural networks, random forest, support vector machine, LASSO regression, and Gaussian process regression. We evaluate the performance of these ML models using a holdout test set and an extra unlabeled data set from high-throughput molecular dynamics simulation. The ML model’s generalization ability on an unlabeled data set is especially focused, and the model’s sensitivity to topology and the molecular weight of polymers is also taken into consideration. This benchmark study provides not only a guideline for the Tg prediction task but also a useful reference for other polymer informatics tasks.

  12. f

    Panel Data Models With Interactive Fixed Effects and Multiple Structural...

    • tandf.figshare.com
    text/x-tex
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Degui Li; Junhui Qian; Liangjun Su (2023). Panel Data Models With Interactive Fixed Effects and Multiple Structural Breaks [Dataset]. http://doi.org/10.6084/m9.figshare.1627951
    Explore at:
    text/x-texAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Degui Li; Junhui Qian; Liangjun Su
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this article, we consider estimation of common structural breaks in panel data models with unobservable interactive fixed effects. We introduce a penalized principal component (PPC) estimation procedure with an adaptive group fused LASSO to detect the multiple structural breaks in the models. Under some mild conditions, we show that with probability approaching one the proposed method can correctly determine the unknown number of breaks and consistently estimate the common break dates. Furthermore, we estimate the regression coefficients through the post-LASSO method and establish the asymptotic distribution theory for the resulting estimators. The developed methodology and theory are applicable to the case of dynamic panel data models. Simulation results demonstrate that the proposed method works well in finite samples with low false detection probability when there is no structural break and high probability of correctly estimating the break numbers when the structural breaks exist. We finally apply our method to study the environmental Kuznets curve for 74 countries over 40 years and detect two breaks in the data. Supplementary materials for this article are available online.

  13. f

    Data from: Learning Coefficient Heterogeneity over Networks: A Distributed...

    • tandf.figshare.com
    zip
    Updated Feb 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xin Zhang; Jia Liu; Zhengyuan Zhu (2024). Learning Coefficient Heterogeneity over Networks: A Distributed Spanning-Tree-Based Fused-Lasso Regression [Dataset]. http://doi.org/10.6084/m9.figshare.21235586.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 15, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Xin Zhang; Jia Liu; Zhengyuan Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Identifying the latent cluster structure based on model heterogeneity is a fundamental but challenging task arises in many machine learning applications. In this article, we study the clustered coefficient regression problem in the distributed network systems, where the data are locally collected and held by nodes. Our work aims to improve the regression estimation efficiency by aggregating the neighbors’ information while also identifying the cluster membership for nodes. To achieve efficient estimation and clustering, we develop a distributed spanning-tree-based fused-lasso regression (DTFLR) approach. In particular, we propose an adaptive spanning-tree-based fusion penalty for the low-complexity clustered coefficient regression. We show that our proposed estimator satisfies statistical oracle properties. Additionally, to solve the problem parallelly, we design a distributed generalized alternating direction method of multiplier algorithm, which has a simple node-based implementation scheme and enjoys a linear convergence rate. Collectively, our results in this article contribute to the theories of low-complexity clustered coefficient regression and distributed optimization over networks. Thorough numerical experiments and real-world data analysis are conducted to verify our theoretical results, which show that our approach outperforms existing works in terms of estimation accuracy, computation speed, and communication costs. Supplementary materials for this article are available online.

  14. f

    Ridge-regression vs lasso.

    • figshare.com
    xls
    Updated Jun 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin (2023). Ridge-regression vs lasso. [Dataset]. http://doi.org/10.1371/journal.pone.0107957.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Permutation tests for equality of the group distributions using distance components analysis (lines 2 to 6), and permutation F-tests for the presence of 2-by-2 interactions (lines 7 to 16), in the comparison of ridge-regression vs lasso. Results based on 999 permutations.Ridge-regression vs lasso.

  15. Additional file 9 of Systematic evaluation of supervised machine learning...

    • figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie Chih-yu Chen; Andrea D. Tyler (2023). Additional file 9 of Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data [Dataset]. http://doi.org/10.6084/m9.figshare.13364282.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Julie Chih-yu Chen; Andrea D. Tyler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 9: Table S3. Prediction performance of all reported models on mystery data.

  16. f

    Data_Sheet_1_Determining Predictors of Weight Loss in a Behavioral...

    • frontiersin.figshare.com
    docx
    Updated Jun 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carly Lupton-Smith; Elizabeth A. Stuart; Emma E. McGinty; Arlene T. Dalcin; Gerald J. Jerome; Nae-Yuh Wang; Gail L. Daumit (2023). Data_Sheet_1_Determining Predictors of Weight Loss in a Behavioral Intervention: A Case Study in the Use of Lasso Regression.docx [Dataset]. http://doi.org/10.3389/fpsyt.2021.707707.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 6, 2023
    Dataset provided by
    Frontiers
    Authors
    Carly Lupton-Smith; Elizabeth A. Stuart; Emma E. McGinty; Arlene T. Dalcin; Gerald J. Jerome; Nae-Yuh Wang; Gail L. Daumit
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectiveThis study investigates predictors of weight loss among individuals with serious mental illness participating in an 18-month behavioral weight loss intervention, using Lasso regression to select the most powerful predictors.MethodsData were analyzed from the intervention group of the ACHIEVE trial, an 18-month behavioral weight loss intervention in adults with serious mental illness. Lasso regression was employed to identify predictors of at least five-pound weight loss across the intervention time span. Once predictors were identified, classification trees were created to show examples of how to classify participants into having likely outcomes based on characteristics at baseline and during the intervention.ResultsThe analyzed sample contained 137 participants. Seventy-one (51.8%) individuals had a net weight loss of at least five pounds from baseline to 18 months. The Lasso regression selected weight loss from baseline to 6 months as a primary predictor of at least five pound 18-month weight loss, with a standardized coefficient of 0.51 (95% CI: −0.37, 1.40). Three other variables were also selected in the regression but added minimal predictive ability.ConclusionsThe analyses in this paper demonstrate the importance of tracking weight loss incrementally during an intervention as an indicator for overall weight loss, as well as the challenges in predicting long-term weight loss with other variables commonly available in clinical trials. The methods used in this paper also exemplify how to effectively analyze a clinical trial dataset containing many variables and identify factors related to desired outcomes.

  17. Additional file 7 of Systematic evaluation of supervised machine learning...

    • springernature.figshare.com
    xlsx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie Chih-yu Chen; Andrea D. Tyler (2023). Additional file 7 of Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data [Dataset]. http://doi.org/10.6084/m9.figshare.13364276.v1
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Julie Chih-yu Chen; Andrea D. Tyler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 7: Table S1. Sequencing data information.

  18. f

    Small Tuning Parameter Selection for the Debiased Lasso

    • tandf.figshare.com
    zip
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Akira Shinkyu; Naoya Sueishi (2025). Small Tuning Parameter Selection for the Debiased Lasso [Dataset]. http://doi.org/10.6084/m9.figshare.27992813.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Akira Shinkyu; Naoya Sueishi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this study, we investigate the bias and variance properties of the debiased Lasso in linear regression when the tuning parameter of the node-wise Lasso is selected to be smaller than in previous studies. We consider the case where the number of covariates p is bounded by a constant multiple of the sample size n. First, we show that the bias of the debiased Lasso can be reduced without diverging the asymptotic variance by setting the order of the tuning parameter to 1/n. This implies that the debiased Lasso has asymptotic normality provided that the number of nonzero coefficients s0 satisfies s0=o(n/ log p), whereas previous studies require s0=o(n/ log p) if no sparsity assumption is imposed on the inverse of the second moment matrix of covariates. Second, we propose a data-driven tuning parameter selection procedure for the node-wise Lasso that is consistent with our theoretical results. Simulation studies show that our procedure yields confidence intervals with good coverage properties in various settings. We also present a real economic data example to demonstrate the efficacy of our selection procedure.

  19. f

    Simulation Studies as Designed Experiments: The Comparison of Penalized...

    • figshare.com
    ai
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin (2023). Simulation Studies as Designed Experiments: The Comparison of Penalized Regression Models in the “Large p, Small n” Setting [Dataset]. http://doi.org/10.1371/journal.pone.0107957
    Explore at:
    aiAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Elias Chaibub Neto; J. Christopher Bare; Adam A. Margolin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    New algorithms are continuously proposed in computational biology. Performance evaluation of novel methods is important in practice. Nonetheless, the field experiences a lack of rigorous methodology aimed to systematically and objectively evaluate competing approaches. Simulation studies are frequently used to show that a particular method outperforms another. Often times, however, simulation studies are not well designed, and it is hard to characterize the particular conditions under which different methods perform better. In this paper we propose the adoption of well established techniques in the design of computer and physical experiments for developing effective simulation studies. By following best practices in planning of experiments we are better able to understand the strengths and weaknesses of competing algorithms leading to more informed decisions about which method to use for a particular task. We illustrate the application of our proposed simulation framework with a detailed comparison of the ridge-regression, lasso and elastic-net algorithms in a large scale study investigating the effects on predictive performance of sample size, number of features, true model sparsity, signal-to-noise ratio, and feature correlation, in situations where the number of covariates is usually much larger than sample size. Analysis of data sets containing tens of thousands of features but only a few hundred samples is nowadays routine in computational biology, where “omics” features such as gene expression, copy number variation and sequence data are frequently used in the predictive modeling of complex phenotypes such as anticancer drug response. The penalized regression approaches investigated in this study are popular choices in this setting and our simulations corroborate well established results concerning the conditions under which each one of these methods is expected to perform best while providing several novel insights.

  20. f

    Estimated coefficients and model error results, for various regularization...

    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Wahid; Dost Muhammad Khan; Ijaz Hussain (2023). Estimated coefficients and model error results, for various regularization procedures applied to the prostate data. [Dataset]. http://doi.org/10.1371/journal.pone.0183518.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Abdul Wahid; Dost Muhammad Khan; Ijaz Hussain
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dashed entries correspond to predictors that are estimated “0”.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences [Dataset]. https://tandf.figshare.com/articles/dataset/Using_Lasso_for_Predictor_Selection_and_to_Assuage_Overfitting_A_Method_Long_Overlooked_in_Behavioral_Sciences/1573029

Data from: Using Lasso for Predictor Selection and to Assuage Overfitting: A Method Long Overlooked in Behavioral Sciences

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
Taylor & Francis
Authors
Daniel M. McNeish
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Ordinary least squares and stepwise selection are widespread in behavioral science research; however, these methods are well known to encounter overfitting problems such that R2 and regression coefficients may be inflated while standard errors and p values may be deflated, ultimately reducing both the parsimony of the model and the generalizability of conclusions. More optimal methods for selecting predictors and estimating regression coefficients such as regularization methods (e.g., Lasso) have existed for decades, are widely implemented in other disciplines, and are available in mainstream software, yet, these methods are essentially invisible in the behavioral science literature while the use of sub optimal methods continues to proliferate. This paper discusses potential issues with standard statistical models, provides an introduction to regularization with specific details on both Lasso and its related predecessor ridge regression, provides an example analysis and code for running a Lasso analysis in R and SAS, and discusses limitations and related methods.

Search
Clear search
Close search
Google apps
Main menu