100+ datasets found
  1. Dataset for: Robust versus consistent variance estimators in marginal...

    • wiley.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dirk Enders; Susanne Engel; Roland Linder; Iris Pigeot (2023). Dataset for: Robust versus consistent variance estimators in marginal structural Cox models [Dataset]. http://doi.org/10.6084/m9.figshare.6203456.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Dirk Enders; Susanne Engel; Roland Linder; Iris Pigeot
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In survival analyses, inverse-probability-of-treatment (IPT) and inverse-probability-of-censoring (IPC) weighted estimators of parameters in marginal structural Cox models (Cox MSMs) are often used to estimate treatment effects in the presence of time-dependent confounding and censoring. In most applications, a robust variance estimator of the IPT and IPC weighted estimator is calculated leading to conservative confidence intervals. This estimator assumes that the weights are known rather than estimated from the data. Although a consistent estimator of the asymptotic variance of the IPT and IPC weighted estimator is generally available, applications and thus information on the performance of the consistent estimator are lacking. Reasons might be a cumbersome implementation in statistical software, which is further complicated by missing details on the variance formula. In this paper, we therefore provide a detailed derivation of the variance of the asymptotic distribution of the IPT and IPC weighted estimator and explicitly state the necessary terms to calculate a consistent estimator of this variance. We compare the performance of the robust and the consistent variance estimator in an application based on routine health care data and in a simulation study. The simulation reveals no substantial differences between the two estimators in medium and large data sets with no unmeasured confounding, but the consistent variance estimator performs poorly in small samples or under unmeasured confounding, if the number of confounders is large. We thus conclude that the robust estimator is more appropriate for all practical purposes.

  2. Low Variance Dataset

    • kaggle.com
    zip
    Updated Jun 17, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ezgi Turalı (2021). Low Variance Dataset [Dataset]. https://www.kaggle.com/ezgitural/low-variance-dataset
    Explore at:
    zip(422949 bytes)Available download formats
    Dataset updated
    Jun 17, 2021
    Authors
    Ezgi Turalı
    Description

    Context

    I needed a low variance dataset for my project to make a point. I could not find it in here. So, I got it somehow and there you go!

  3. H

    Script for calculate variance partition method

    • dataverse.harvard.edu
    • dataone.org
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriela Alves-Ferreira (2022). Script for calculate variance partition method [Dataset]. http://doi.org/10.7910/DVN/SDXKGF
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Gabriela Alves-Ferreira
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Script for calculate variance partition method and hierarchical partition method for scales regional and local

  4. f

    Subfunctions for calculating variance.

    • datasetcatalog.nlm.nih.gov
    • plos.figshare.com
    Updated May 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jia, Xiaoyan; Zhang, Qinghui; Zhang, Meilin; Ding, Yang; LI, Junqiu; Jin, Yiting (2024). Subfunctions for calculating variance. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001274866
    Explore at:
    Dataset updated
    May 16, 2024
    Authors
    Jia, Xiaoyan; Zhang, Qinghui; Zhang, Meilin; Ding, Yang; LI, Junqiu; Jin, Yiting
    Description

    The analysis of critical states during fracture of wood materials is crucial for wood building safety monitoring, wood processing, etc. In this paper, beech and camphor pine are selected as the research objects, and the acoustic emission signals during the fracture process of the specimens are analyzed by three-point bending load experiments. On the one hand, the critical state interval of a complex acoustic emission signal system is determined by selecting characteristic parameters in the natural time domain. On the other hand, an improved method of b_value analysis in the natural time domain is proposed based on the characteristics of the acoustic emission signal. The K-value, which represents the beginning of the critical state of a complex acoustic emission signal system, is further defined by the improved method of b_value in the natural time domain. For beech, the analysis of critical state time based on characteristic parameters can predict the “collapse” time 8.01 s in advance, while for camphor pines, 3.74 s in advance. K-value can be analyzed at least 3 s in advance of the system “crash” time for beech and 4 s in advance of the system “crash” time for camphor pine. The results show that compared with traditional time-domain acoustic emission signal analysis, natural time-domain acoustic emission signal analysis can discover more available feature information to characterize the state of the signal. Both the characteristic parameters and Natural_Time_b_value analysis in the natural time domain can effectively characterize the time when the complex acoustic emission signal system enters the critical state. Critical state analysis can provide new ideas for wood health monitoring and complex signal processing, etc.

  5. Dataset for: A noniterative sample size procedure for tests based on t...

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongqiang Tang (2023). Dataset for: A noniterative sample size procedure for tests based on t distributions [Dataset]. http://doi.org/10.6084/m9.figshare.6151220.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Yongqiang Tang
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A noniterative sample size procedure is proposed for a general hypothesis test based on the t distribution by modifying and extending Guenther’s (1981) approach for the one sample and two sample t tests. The generalized procedure is employed to determine the sample size for treatment comparisons using the analysis of covariance (ANCOVA) and the mixed effects model for repeated measures (MMRM) in randomized clinical trials. The sample size is calculated by adding a few simple correction terms to the sample size from the normal approximation to account for the nonnormality of the t statistic and lower order variance terms, which are functions of the covariates in the model. But it does not require specifying the covariate distribution. The noniterative procedure is suitable for superiority tests, noninferiority tests and a special case of the tests for equivalence or bioequivalence, and generally yields the exact or nearly exact sample size estimate after rounding to an integer. The method for calculating the exact power of the two sample t test with unequal variance in superiority trials is extended to equivalence trials. We also derive accurate power formulae for ANCOVA and MMRM, and the formula for ANCOVA is exact for normally distributed covariates. Numerical examples demonstrate the accuracy of the proposed methods particularly in small samples.

  6. Z

    _Attention what is it like [Dataset]

    • data.niaid.nih.gov
    Updated Mar 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dinis Pereira, Vitor Manuel (2021). _Attention what is it like [Dataset] [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_780412
    Explore at:
    Dataset updated
    Mar 7, 2021
    Dataset provided by
    LanCog Research Group, Universidade de Lisboa
    Authors
    Dinis Pereira, Vitor Manuel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

    Supplement to Occipital and left temporal instantaneous amplitude and frequency oscillations correlated with access and phenomenal consciousness (https://philpapers.org/rec/PEROAL-2).

    Occipital and left temporal instantaneous amplitude and frequency oscillations correlated with access and phenomenal consciousness move from the features of the ERP characterized in Occipital and Left Temporal EEG Correlates of Phenomenal Consciousness (Pereira, 2015, https://doi.org/10.1016/b978-0-12-802508-6.00018-1, https://philpapers.org/rec/PEROAL) towards the instantaneous amplitude and frequency of event-related changes correlated with a contrast in access and in phenomenology.

    Occipital and left temporal instantaneous amplitude and frequency oscillations correlated with access and phenomenal consciousness proceed as following.

    In the first section, empirical mode decomposition (EMD) with post processing (Xie, G., Guo, Y., Tong, S., and Ma, L., 2014. Calculate excess mortality during heatwaves using Hilbert-Huang transform algorithm. BMC medical research methodology, 14, 35) Ensemble Empirical Mode Decomposition (postEEMD) and Hilbert-Huang Transform (HHT).

    In the second section, calculated the variance inflation factor (VIF).

    In the third section, partial least squares regression (PLSR): the minimal root mean squared error of prediction (RMSEP).

    In the last section, partial least squares regression (PLSR): significance multivariate correlation (sMC) statistic.

  7. Dataset for: Power analysis for multivariable Cox regression models

    • wiley.figshare.com
    • search.datacite.org
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emil Scosyrev; Ekkehard Glimm (2023). Dataset for: Power analysis for multivariable Cox regression models [Dataset]. http://doi.org/10.6084/m9.figshare.7010483.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Emil Scosyrev; Ekkehard Glimm
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    In power analysis for multivariable Cox regression models, variance of the estimated log-hazard ratio for the treatment effect is usually approximated by inverting the expected null information matrix. Because in many typical power analysis settings assumed true values of the hazard ratios are not necessarily close to unity, the accuracy of this approximation is not theoretically guaranteed. To address this problem, the null variance expression in power calculations can be replaced with one of alternative expressions derived under the assumed true value of the hazard ratio for the treatment effect. This approach is explored analytically and by simulations in the present paper. We consider several alternative variance expressions, and compare their performance to that of the traditional null variance expression. Theoretical analysis and simulations demonstrate that while the null variance expression performs well in many non-null settings, it can also be very inaccurate, substantially underestimating or overestimating the true variance in a wide range of realistic scenarios, particularly those where the numbers of treated and control subjects are very different and the true hazard ratio is not close to one. The alternative variance expressions have much better theoretical properties, confirmed in simulations. The most accurate of these expressions has a relatively simple form - it is the sum of inverse expected event counts under treatment and under control scaled up by a variance inflation factor.

  8. Stock Portfolio Optimization Dataset for Efficient

    • kaggle.com
    zip
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emmanuel Ochiba (2023). Stock Portfolio Optimization Dataset for Efficient [Dataset]. https://www.kaggle.com/datasets/chibss/stock-dataset-for-portfolio-optimization
    Explore at:
    zip(8610 bytes)Available download formats
    Dataset updated
    Aug 30, 2023
    Authors
    Emmanuel Ochiba
    Description

    This dataset has been meticulously curated to assist investment analysts, like you, in performing mean-variance optimization for constructing efficient portfolios. The dataset contains historical financial data for a selection of assets, enabling the calculation of risk and return characteristics necessary for portfolio optimization. The goal is to help you determine the most effective allocation of assets to achieve optimal risk-return trade-offs.

  9. H

    Data from: Estimation across Data Sets: Two-Stage Auxiliary Instrumental...

    • dataverse.harvard.edu
    Updated Dec 21, 2009
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charles H. Franklin (2009). Estimation across Data Sets: Two-Stage Auxiliary Instrumental Variables Estimation (2SAIV) [Dataset]. http://doi.org/10.7910/DVN/HL5YUY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 21, 2009
    Dataset provided by
    Harvard Dataverse
    Authors
    Charles H. Franklin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Theories demand much of data, often more than a single data collection can provide. For example, many important research questions are set in the past and must rely on data collected at that time and for other purposes. As a result, we often find that the data lack crucial variables. Another common problem arises when we wish to estimate the relationship between variables that are measured in different data sets. A variation of this occurs with a split half sample design in which one or more important variables appear on the "wrong" half. Finally, we may need panel data but have only cross sections available. In each of these cases our ability to estimate the theoretically determined equation is limited by the data that are available. In many cases there is simply no solution, and theory must await new opportunities for testing. Under certain circumstances, however, we may still be able to estimate relationships between variables even though they are not measured on the same set of observations. This technique, which I call two-stage auxiliary instrumental variables (2SAIV), provides some new leverage on such problems and offers the opportunity to test hypotheses that were previously out of reach. T his article develops the 2SAIV estimator, proves its consistency and derives its asymptotic variance. A set of simulations illustrates the performance of the estimator in finite samples and several applications are sketched out.

  10. H

    Multi-Laboratory Hematoxylin and Eosin Staining Variance Supervised Machine...

    • datasetcatalog.nlm.nih.gov
    • dataverse.harvard.edu
    • +1more
    Updated Nov 4, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruusuvuori, Pekka; Äyrämö, Sami; Pölönen, Ilkka; Prezja, Fabi; Kuopio, Teijo (2022). Multi-Laboratory Hematoxylin and Eosin Staining Variance Supervised Machine Learning Dataset [Dataset]. http://doi.org/10.7910/DVN/5YNF3B
    Explore at:
    Dataset updated
    Nov 4, 2022
    Authors
    Ruusuvuori, Pekka; Äyrämö, Sami; Pölönen, Ilkka; Prezja, Fabi; Kuopio, Teijo
    Description

    We provide the generated dataset used for supervised machine learning in the related article. The data are in tabular format and contain all principal components and ground truth labels per tissue type. Tissue type codes used are; C1 for kidney, C2 for skin, and C3 for colon. 'PC' stands for the principal component. For feature extraction specifications, please see the original design in the related article. Features have been extracted independently for each tissue type.

  11. IMP-8 Weimer Propagation Details at 1 min Resolution - Dataset - NASA Open...

    • data.nasa.gov
    Updated Apr 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). IMP-8 Weimer Propagation Details at 1 min Resolution - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/imp-8-weimer-propagation-details-at-1-min-resolution
    Explore at:
    Dataset updated
    Apr 1, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    IMP-8 Weimer propagated solar wind data and linearly interpolated time delay, cosine angle, and goodness information of propagated data at 1 min Resolution. This data set consists of propagated solar wind data that has first been propagated to a position just outside of the nominal bow shock (about 17, 0, 0 Re) and then linearly interpolated to 1 min resolution using the interp1.m function in MATLAB. The input data for this data set is a 1 min resolution processed solar wind data constructed by Dr. J.M. Weygand. The method of propagation is similar to the minimum variance technique and is outlined in Dan Weimer et al. [2003; 2004]. The basic method is to find the minimum variance direction of the magnetic field in the plane orthogonal to the mean magnetic field direction. This minimum variance direction is then dotted with the difference between final position vector minus the original position vector and the quantity is divided by the minimum variance dotted with the solar wind velocity vector, which gives the propagation time. This method does not work well for shocks and minimum variance directions with tilts greater than 70 degrees of the sun-earth line. This data set was originally constructed by Dr. J.M. Weygand for Prof. R.L. McPherron, who was the principle investigator of two National Science Foundation studies: GEM Grant ATM 02-1798 and a Space Weather Grant ATM 02-08501. These data were primarily used in superposed epoch studies References: Weimer, D. R. (2004), Correction to ‘‘Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique,’’ J. Geophys. Res., 109, A12104, doi:10.1029/2004JA010691. Weimer, D.R., D.M. Ober, N.C. Maynard, M.R. Collier, D.J. McComas, N.F. Ness, C. W. Smith, and J. Watermann (2003), Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique, J. Geophys. Res., 108, 1026, doi:10.1029/2002JA009405.

  12. r

    Origin of Variances in the Oldest-Old: Octogenarian Twins (OCTO-Twin) Wave 5...

    • researchdata.se
    • demo.researchdata.se
    • +1more
    Updated Apr 3, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Linda Hassing (2023). Origin of Variances in the Oldest-Old: Octogenarian Twins (OCTO-Twin) Wave 5 [Dataset]. https://researchdata.se/en/catalogue/dataset/2021-195-5
    Explore at:
    Dataset updated
    Apr 3, 2023
    Dataset provided by
    University of Gothenburg
    Authors
    Linda Hassing
    Area covered
    Sweden
    Description

    The OCTO-Twin Study aims to investigate the etiology of individual differences among twin-pairs age 80 and older, on a range of domains including health and functional capacity, cognitive functioning, psychological well-being, personality and personal control. In the study, twin pairs were withdrawn from the Swedish Twin Registry. At the first wave, the twins had to be born 1913 or earlier and both partners in the pair had to accept participation. At baseline in 1991-94, 351 twin pairs (149 monozygotic and 202 like-sex dizygotic pairs) were investigated (mean age: 83.6 years and 67% were female). The two-year longitudinal follow-ups were conducted on all twins who were alive and agreed to participate. Data have been collected at five waves over a total of eight years.

    In wave 5, 43 twin pairs participated, with a total of 222 individuals. Refer to the description of wave 1/the base line and the individual datasets in the NEAR portal for more details on variable groups and individual variables.

  13. d

    Data from: Selective increases in inter-individual variability in response...

    • datadryad.org
    • datasetcatalog.nlm.nih.gov
    • +3more
    zip
    Updated Oct 29, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julia C. Korholz; Sara Zocher; Anna N. Grzyb; Benjamin Morisse; Alexandra Poetzsch; Fanny Ehret; Christopher Schmied; Gerd Kempermann (2018). Selective increases in inter-individual variability in response to environmental enrichment in female mice [Dataset]. http://doi.org/10.5061/dryad.12cm083
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2018
    Dataset provided by
    Dryad
    Authors
    Julia C. Korholz; Sara Zocher; Anna N. Grzyb; Benjamin Morisse; Alexandra Poetzsch; Fanny Ehret; Christopher Schmied; Gerd Kempermann
    Time period covered
    Feb 22, 2018
    Area covered
    Not applicable
    Description

    Supplementary File1_phenotypesThe txt file contains the phenotypes assessed in our study for all mice under control (CTRL) or enriched (conditions). The file is a txt file, comma delimited. Eache mouse is one line. All abbreviations and the phenotypes are explained in the article.

  14. ACE Solar Wind Weimer Propagation Details at 1 min Resolution - Dataset -...

    • data.nasa.gov
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nasa.gov (2025). ACE Solar Wind Weimer Propagation Details at 1 min Resolution - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/ace-solar-wind-weimer-propagation-details-at-1-min-resolution
    Explore at:
    Dataset updated
    Aug 21, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    ACE Weimer propagated solar wind data and linearly interpolated time delay, cosine angle, and goodness information of propagated data at 1 min Resolution. This data set consists of propagated solar wind data that has first been propagated to a position just outside of the nominal bow shock (about 17, 0, 0 Re) and then linearly interpolated to 1 min resolution using the interp1.m function in MATLAB. The input data for this data set is a 1 min resolution processed solar wind data constructed by Dr. J.M. Weygand. The method of propagation is similar to the minimum variance technique and is outlined in Dan Weimer et al. [2003; 2004]. The basic method is to find the minimum variance direction of the magnetic field in the plane orthogonal to the mean magnetic field direction. This minimum variance direction is then dotted with the difference between final position vector minus the original position vector and the quantity is divided by the minimum variance dotted with the solar wind velocity vector, which gives the propagation time. This method does not work well for shocks and minimum variance directions with tilts greater than 70 degrees of the sun-earth line. This data set was originally constructed by Dr. J.M. Weygand for Prof. R.L. McPherron, who was the principle investigator of two National Science Foundation studies: GEM Grant ATM 02-1798 and a Space Weather Grant ATM 02-08501. These data were primarily used in superposed epoch studies References: Weimer, D. R. (2004), Correction to ‘‘Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique,’’ J. Geophys. Res., 109, A12104, doi:10.1029/2004JA010691. Weimer, D.R., D.M. Ober, N.C. Maynard, M.R. Collier, D.J. McComas, N.F. Ness, C. W. Smith, and J. Watermann (2003), Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique, J. Geophys. Res., 108, 1026, doi:10.1029/2002JA009405.

  15. Data from: Uncertainties Associated with Arithmetic Map Operations in GIS

    • scielo.figshare.com
    • figshare.com
    jpeg
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JORGE K. YAMAMOTO; ANTÔNIO T. KIKUDA; GUILHERME J. RAMPAZZO; CLAUDIO B.B. LEITE (2023). Uncertainties Associated with Arithmetic Map Operations in GIS [Dataset]. http://doi.org/10.6084/m9.figshare.6991718.v1
    Explore at:
    jpegAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    SciELOhttp://www.scielo.org/
    Authors
    JORGE K. YAMAMOTO; ANTÔNIO T. KIKUDA; GUILHERME J. RAMPAZZO; CLAUDIO B.B. LEITE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract Arithmetic map operations are very common procedures used in GIS to combine raster maps resulting in a new and improved raster map. It is essential that this new map be accompanied by an assessment of uncertainty. This paper shows how we can calculate the uncertainty of the resulting map after performing some arithmetic operation. Actually, the propagation of uncertainty depends on a reliable measurement of the local accuracy and local covariance, as well. In this sense, the use of the interpolation variance is proposed because it takes into account both data configuration and data values. Taylor series expansion is used to derive the mean and variance of the function defined by an arithmetic operation. We show exact results for means and variances for arithmetic operations involving addition, subtraction and multiplication and that it is possible to get approximate mean and variance for the quotient of raster maps.

  16. f

    Data from: Boosting Random Forests to Reduce Bias; One-Step Boosted Forest...

    • tandf.figshare.com
    pdf
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Indrayudh Ghosal; Giles Hooker (2023). Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and Its Variance Estimate [Dataset]. http://doi.org/10.6084/m9.figshare.12946990.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Indrayudh Ghosal; Giles Hooker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this article, we propose using the principle of boosting to reduce the bias of a random forest prediction in the regression setting. From the original random forest fit, we extract the residuals and then fit another random forest to these residuals. We call the sum of these two random forests a one-step boosted forest. We show with simulated and real data that the one-step boosted forest has a reduced bias compared to the original random forest. The article also provides a variance estimate of the one-step boosted forest by an extension of the infinitesimal Jackknife estimator. Using this variance estimate, we can construct prediction intervals for the boosted forest and we show that they have good coverage probabilities. Combining the bias reduction and the variance estimate, we show that the one-step boosted forest has a significant reduction in predictive mean squared error and thus an improvement in predictive performance. When applied on datasets from the UCI database, one-step boosted forest performs better than random forest and gradient boosting machine algorithms. Theoretically, we can also extend such a boosting process to more than one step and the same principles outlined in this article can be used to find variance estimates for such predictors. Such boosting will reduce bias even further but it risks over-fitting and also increases the computational burden. Supplementary materials for this article are available online.

  17. US Average, Maximum, and Minimum Temperatures

    • kaggle.com
    zip
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). US Average, Maximum, and Minimum Temperatures [Dataset]. https://www.kaggle.com/datasets/thedevastator/2015-us-average-maximum-and-minimum-temperatures
    Explore at:
    zip(9429155 bytes)Available download formats
    Dataset updated
    Jan 18, 2023
    Authors
    The Devastator
    Area covered
    United States
    Description

    US Average, Maximum, and Minimum Temperatures

    Analyzing Daily Temperatures Across the USA

    By Matthew Winter [source]

    About this dataset

    This dataset features the daily temperature summaries from various weather stations across the United States. It includes information such as location, average temperature, maximum temperature, minimum temperature, state name, state code, and zip code. All the data contained in this dataset has been filtered so that any values equaling -999 were removed. With this powerful set of data you to explore how climate conditions changed throughout the year and how they varied across different regions of the country. Dive into your own research today to uncover fascinating climate trends or use it to further narrow your studies specific to a region or city

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset offers a detailed look at daily average, minimum, and maximum temperatures across the United States. It contains information from 1120 weather stations throughout the year to provide a comprehensive look at temperature trends for the year.

    The data contains a variety of columns including station, station name, location (latitude and longitude), state name zip code and date. The primary focus of this dataset is on the AvgTemp, MaxTemp and MinTemp columns which provide daily average, maximum and minimum temperature records respectively in degrees Fahrenheit.

    To use this dataset effectively it is useful to consider multiple views before undertaking any analysis or making conclusions:
    - Plot each individual record versus time by creating a line graph with stations as labels on different lines indicating changes over time. Doing so can help identify outliers that may need further examination; much like viewing data on a scatterplot looking for confidence bands or examining variance between points that are otherwise hard to see when all points are plotted on one graph only.
    - A comparison of states can be made through creating grouped bar charts where states are grouped together with Avg/Max/Min temperatures included within each chart - thereby showing any variance that may exist between states during a specific period about which it's possible to make observations about themselves (rather than comparing them). For example - you could observe if there was an abnormally high temperature increase in California during July compared with other US states since all measurements would be represented visually providing opportunity for insights quickly compared with having to manually calculate figures from raw data sets only.

    With these two initial approaches there will also be further visualizations possible regarding correlations between particular geographical areas versus different climatic conditions or through population analysis such as correlating areas warmer/colder than median observances verses relative population densities etc.. providing additional opportunities for investigation particularly when combined with key metrics collected over multiple years versus one single year's results exclusively allowing wider inferences to be made depending upon what is being requested in terms of outcomes desired from those who may explore this data set further down the line beyond its original compilation starter point here today!

    Research Ideas

    • Using the Latitude and Longitude values, this dataset can be used to create a map of average temperatures across the USA. This would be useful for seeing which areas were consistently hotter or colder than others throughout the year.
    • Using the AvgTemp and StateName columns, predictors could use regression modeling to predict what temperature an area will have in a given month based on it's average temperature.
    • By using the Date column and plotting it alongside MaxTemp or MinTemp values, visualization methods such as timelines could be utilized to show how temperatures changed during different times of year across various states in the US

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: 2015 USA Weather Data FINAL.csv

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Matthew Winter.

  18. The data output from the analysis of variance (+Tukey's post hoc tests) to...

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jun 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mitchel Okumu (2020). The data output from the analysis of variance (+Tukey's post hoc tests) to determine the differences in the mean protein content of Naja ashei venom and antivenom [Dataset]. http://doi.org/10.6084/m9.figshare.12573425.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 26, 2020
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Mitchel Okumu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains the following: 1. ANOVA table (variate:protein content)2. Table of effects3. Table of means4. Standard errors of differences of means 5. Tukey's 95% confidence intervals

  19. B

    Data from: Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice:...

    • borealisdata.ca
    • search.dataone.org
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Mason; Michael Walker (2024). Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice: Validating a split-plot design that promotes refinement and reduction [Dataset]. http://doi.org/10.5683/SP3/EBN26Q
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2024
    Dataset provided by
    Borealis
    Authors
    Georgia Mason; Michael Walker
    License

    https://borealisdata.ca/api/datasets/:persistentId/versions/2.4/customlicense?persistentId=doi:10.5683/SP3/EBN26Qhttps://borealisdata.ca/api/datasets/:persistentId/versions/2.4/customlicense?persistentId=doi:10.5683/SP3/EBN26Q

    Time period covered
    May 2013 - Aug 2013
    Area covered
    Canada
    Dataset funded by
    Natural Sciences and Engineering Research Council of Canada
    Description

    Validating a novel housing method for inbred mice: mixed-strain housing. To see if this housing method affected strain-typical mouse phenotypes, if variance in the data was affected, and how statistical power was increased through this split-plot design.

  20. Life Expectancy WHO

    • kaggle.com
    zip
    Updated Jun 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
    Explore at:
    zip(121472 bytes)Available download formats
    Dataset updated
    Jun 19, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

    We use DECISION TREE MODEL for the analysis.

    • Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).
    • We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.
    • We use 5 fold cross validation method with CP (complexity parameter) being 0.01.
    • In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).
    • MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

    We use RANDOM FOREST for the analysis.

    • Run library(randomForest)
    • We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.
    • Predict Life expectancy through random forest model.
    • In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.
    • MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).
    • MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

    Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Dirk Enders; Susanne Engel; Roland Linder; Iris Pigeot (2023). Dataset for: Robust versus consistent variance estimators in marginal structural Cox models [Dataset]. http://doi.org/10.6084/m9.figshare.6203456.v1
Organization logo

Dataset for: Robust versus consistent variance estimators in marginal structural Cox models

Related Article
Explore at:
pdfAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Dirk Enders; Susanne Engel; Roland Linder; Iris Pigeot
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

In survival analyses, inverse-probability-of-treatment (IPT) and inverse-probability-of-censoring (IPC) weighted estimators of parameters in marginal structural Cox models (Cox MSMs) are often used to estimate treatment effects in the presence of time-dependent confounding and censoring. In most applications, a robust variance estimator of the IPT and IPC weighted estimator is calculated leading to conservative confidence intervals. This estimator assumes that the weights are known rather than estimated from the data. Although a consistent estimator of the asymptotic variance of the IPT and IPC weighted estimator is generally available, applications and thus information on the performance of the consistent estimator are lacking. Reasons might be a cumbersome implementation in statistical software, which is further complicated by missing details on the variance formula. In this paper, we therefore provide a detailed derivation of the variance of the asymptotic distribution of the IPT and IPC weighted estimator and explicitly state the necessary terms to calculate a consistent estimator of this variance. We compare the performance of the robust and the consistent variance estimator in an application based on routine health care data and in a simulation study. The simulation reveals no substantial differences between the two estimators in medium and large data sets with no unmeasured confounding, but the consistent variance estimator performs poorly in small samples or under unmeasured confounding, if the number of confounders is large. We thus conclude that the robust estimator is more appropriate for all practical purposes.

Search
Clear search
Close search
Google apps
Main menu