Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In survival analyses, inverse-probability-of-treatment (IPT) and inverse-probability-of-censoring (IPC) weighted estimators of parameters in marginal structural Cox models (Cox MSMs) are often used to estimate treatment effects in the presence of time-dependent confounding and censoring. In most applications, a robust variance estimator of the IPT and IPC weighted estimator is calculated leading to conservative confidence intervals. This estimator assumes that the weights are known rather than estimated from the data. Although a consistent estimator of the asymptotic variance of the IPT and IPC weighted estimator is generally available, applications and thus information on the performance of the consistent estimator are lacking. Reasons might be a cumbersome implementation in statistical software, which is further complicated by missing details on the variance formula. In this paper, we therefore provide a detailed derivation of the variance of the asymptotic distribution of the IPT and IPC weighted estimator and explicitly state the necessary terms to calculate a consistent estimator of this variance. We compare the performance of the robust and the consistent variance estimator in an application based on routine health care data and in a simulation study. The simulation reveals no substantial differences between the two estimators in medium and large data sets with no unmeasured confounding, but the consistent variance estimator performs poorly in small samples or under unmeasured confounding, if the number of confounders is large. We thus conclude that the robust estimator is more appropriate for all practical purposes.
Facebook
TwitterI needed a low variance dataset for my project to make a point. I could not find it in here. So, I got it somehow and there you go!
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Script for calculate variance partition method and hierarchical partition method for scales regional and local
Facebook
TwitterThe analysis of critical states during fracture of wood materials is crucial for wood building safety monitoring, wood processing, etc. In this paper, beech and camphor pine are selected as the research objects, and the acoustic emission signals during the fracture process of the specimens are analyzed by three-point bending load experiments. On the one hand, the critical state interval of a complex acoustic emission signal system is determined by selecting characteristic parameters in the natural time domain. On the other hand, an improved method of b_value analysis in the natural time domain is proposed based on the characteristics of the acoustic emission signal. The K-value, which represents the beginning of the critical state of a complex acoustic emission signal system, is further defined by the improved method of b_value in the natural time domain. For beech, the analysis of critical state time based on characteristic parameters can predict the “collapse” time 8.01 s in advance, while for camphor pines, 3.74 s in advance. K-value can be analyzed at least 3 s in advance of the system “crash” time for beech and 4 s in advance of the system “crash” time for camphor pine. The results show that compared with traditional time-domain acoustic emission signal analysis, natural time-domain acoustic emission signal analysis can discover more available feature information to characterize the state of the signal. Both the characteristic parameters and Natural_Time_b_value analysis in the natural time domain can effectively characterize the time when the complex acoustic emission signal system enters the critical state. Critical state analysis can provide new ideas for wood health monitoring and complex signal processing, etc.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A noniterative sample size procedure is proposed for a general hypothesis test based on the t distribution by modifying and extending Guenther’s (1981) approach for the one sample and two sample t tests. The generalized procedure is employed to determine the sample size for treatment comparisons using the analysis of covariance (ANCOVA) and the mixed effects model for repeated measures (MMRM) in randomized clinical trials. The sample size is calculated by adding a few simple correction terms to the sample size from the normal approximation to account for the nonnormality of the t statistic and lower order variance terms, which are functions of the covariates in the model. But it does not require specifying the covariate distribution. The noniterative procedure is suitable for superiority tests, noninferiority tests and a special case of the tests for equivalence or bioequivalence, and generally yields the exact or nearly exact sample size estimate after rounding to an integer. The method for calculating the exact power of the two sample t test with unequal variance in superiority trials is extended to equivalence trials. We also derive accurate power formulae for ANCOVA and MMRM, and the formula for ANCOVA is exact for normally distributed covariates. Numerical examples demonstrate the accuracy of the proposed methods particularly in small samples.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
Supplement to Occipital and left temporal instantaneous amplitude and frequency oscillations correlated with access and phenomenal consciousness (https://philpapers.org/rec/PEROAL-2).
Occipital and left temporal instantaneous amplitude and frequency oscillations correlated with access and phenomenal consciousness move from the features of the ERP characterized in Occipital and Left Temporal EEG Correlates of Phenomenal Consciousness (Pereira, 2015, https://doi.org/10.1016/b978-0-12-802508-6.00018-1, https://philpapers.org/rec/PEROAL) towards the instantaneous amplitude and frequency of event-related changes correlated with a contrast in access and in phenomenology.
Occipital and left temporal instantaneous amplitude and frequency oscillations correlated with access and phenomenal consciousness proceed as following.
In the first section, empirical mode decomposition (EMD) with post processing (Xie, G., Guo, Y., Tong, S., and Ma, L., 2014. Calculate excess mortality during heatwaves using Hilbert-Huang transform algorithm. BMC medical research methodology, 14, 35) Ensemble Empirical Mode Decomposition (postEEMD) and Hilbert-Huang Transform (HHT).
In the second section, calculated the variance inflation factor (VIF).
In the third section, partial least squares regression (PLSR): the minimal root mean squared error of prediction (RMSEP).
In the last section, partial least squares regression (PLSR): significance multivariate correlation (sMC) statistic.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In power analysis for multivariable Cox regression models, variance of the estimated log-hazard ratio for the treatment effect is usually approximated by inverting the expected null information matrix. Because in many typical power analysis settings assumed true values of the hazard ratios are not necessarily close to unity, the accuracy of this approximation is not theoretically guaranteed. To address this problem, the null variance expression in power calculations can be replaced with one of alternative expressions derived under the assumed true value of the hazard ratio for the treatment effect. This approach is explored analytically and by simulations in the present paper. We consider several alternative variance expressions, and compare their performance to that of the traditional null variance expression. Theoretical analysis and simulations demonstrate that while the null variance expression performs well in many non-null settings, it can also be very inaccurate, substantially underestimating or overestimating the true variance in a wide range of realistic scenarios, particularly those where the numbers of treated and control subjects are very different and the true hazard ratio is not close to one. The alternative variance expressions have much better theoretical properties, confirmed in simulations. The most accurate of these expressions has a relatively simple form - it is the sum of inverse expected event counts under treatment and under control scaled up by a variance inflation factor.
Facebook
TwitterThis dataset has been meticulously curated to assist investment analysts, like you, in performing mean-variance optimization for constructing efficient portfolios. The dataset contains historical financial data for a selection of assets, enabling the calculation of risk and return characteristics necessary for portfolio optimization. The goal is to help you determine the most effective allocation of assets to achieve optimal risk-return trade-offs.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Theories demand much of data, often more than a single data collection can provide. For example, many important research questions are set in the past and must rely on data collected at that time and for other purposes. As a result, we often find that the data lack crucial variables. Another common problem arises when we wish to estimate the relationship between variables that are measured in different data sets. A variation of this occurs with a split half sample design in which one or more important variables appear on the "wrong" half. Finally, we may need panel data but have only cross sections available. In each of these cases our ability to estimate the theoretically determined equation is limited by the data that are available. In many cases there is simply no solution, and theory must await new opportunities for testing. Under certain circumstances, however, we may still be able to estimate relationships between variables even though they are not measured on the same set of observations. This technique, which I call two-stage auxiliary instrumental variables (2SAIV), provides some new leverage on such problems and offers the opportunity to test hypotheses that were previously out of reach. T his article develops the 2SAIV estimator, proves its consistency and derives its asymptotic variance. A set of simulations illustrates the performance of the estimator in finite samples and several applications are sketched out.
Facebook
TwitterWe provide the generated dataset used for supervised machine learning in the related article. The data are in tabular format and contain all principal components and ground truth labels per tissue type. Tissue type codes used are; C1 for kidney, C2 for skin, and C3 for colon. 'PC' stands for the principal component. For feature extraction specifications, please see the original design in the related article. Features have been extracted independently for each tissue type.
Facebook
TwitterIMP-8 Weimer propagated solar wind data and linearly interpolated time delay, cosine angle, and goodness information of propagated data at 1 min Resolution. This data set consists of propagated solar wind data that has first been propagated to a position just outside of the nominal bow shock (about 17, 0, 0 Re) and then linearly interpolated to 1 min resolution using the interp1.m function in MATLAB. The input data for this data set is a 1 min resolution processed solar wind data constructed by Dr. J.M. Weygand. The method of propagation is similar to the minimum variance technique and is outlined in Dan Weimer et al. [2003; 2004]. The basic method is to find the minimum variance direction of the magnetic field in the plane orthogonal to the mean magnetic field direction. This minimum variance direction is then dotted with the difference between final position vector minus the original position vector and the quantity is divided by the minimum variance dotted with the solar wind velocity vector, which gives the propagation time. This method does not work well for shocks and minimum variance directions with tilts greater than 70 degrees of the sun-earth line. This data set was originally constructed by Dr. J.M. Weygand for Prof. R.L. McPherron, who was the principle investigator of two National Science Foundation studies: GEM Grant ATM 02-1798 and a Space Weather Grant ATM 02-08501. These data were primarily used in superposed epoch studies References: Weimer, D. R. (2004), Correction to ‘‘Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique,’’ J. Geophys. Res., 109, A12104, doi:10.1029/2004JA010691. Weimer, D.R., D.M. Ober, N.C. Maynard, M.R. Collier, D.J. McComas, N.F. Ness, C. W. Smith, and J. Watermann (2003), Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique, J. Geophys. Res., 108, 1026, doi:10.1029/2002JA009405.
Facebook
TwitterThe OCTO-Twin Study aims to investigate the etiology of individual differences among twin-pairs age 80 and older, on a range of domains including health and functional capacity, cognitive functioning, psychological well-being, personality and personal control. In the study, twin pairs were withdrawn from the Swedish Twin Registry. At the first wave, the twins had to be born 1913 or earlier and both partners in the pair had to accept participation. At baseline in 1991-94, 351 twin pairs (149 monozygotic and 202 like-sex dizygotic pairs) were investigated (mean age: 83.6 years and 67% were female). The two-year longitudinal follow-ups were conducted on all twins who were alive and agreed to participate. Data have been collected at five waves over a total of eight years.
In wave 5, 43 twin pairs participated, with a total of 222 individuals. Refer to the description of wave 1/the base line and the individual datasets in the NEAR portal for more details on variable groups and individual variables.
Facebook
TwitterSupplementary File1_phenotypesThe txt file contains the phenotypes assessed in our study for all mice under control (CTRL) or enriched (conditions). The file is a txt file, comma delimited. Eache mouse is one line. All abbreviations and the phenotypes are explained in the article.
Facebook
TwitterACE Weimer propagated solar wind data and linearly interpolated time delay, cosine angle, and goodness information of propagated data at 1 min Resolution. This data set consists of propagated solar wind data that has first been propagated to a position just outside of the nominal bow shock (about 17, 0, 0 Re) and then linearly interpolated to 1 min resolution using the interp1.m function in MATLAB. The input data for this data set is a 1 min resolution processed solar wind data constructed by Dr. J.M. Weygand. The method of propagation is similar to the minimum variance technique and is outlined in Dan Weimer et al. [2003; 2004]. The basic method is to find the minimum variance direction of the magnetic field in the plane orthogonal to the mean magnetic field direction. This minimum variance direction is then dotted with the difference between final position vector minus the original position vector and the quantity is divided by the minimum variance dotted with the solar wind velocity vector, which gives the propagation time. This method does not work well for shocks and minimum variance directions with tilts greater than 70 degrees of the sun-earth line. This data set was originally constructed by Dr. J.M. Weygand for Prof. R.L. McPherron, who was the principle investigator of two National Science Foundation studies: GEM Grant ATM 02-1798 and a Space Weather Grant ATM 02-08501. These data were primarily used in superposed epoch studies References: Weimer, D. R. (2004), Correction to ‘‘Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique,’’ J. Geophys. Res., 109, A12104, doi:10.1029/2004JA010691. Weimer, D.R., D.M. Ober, N.C. Maynard, M.R. Collier, D.J. McComas, N.F. Ness, C. W. Smith, and J. Watermann (2003), Predicting interplanetary magnetic field (IMF) propagation delay times using the minimum variance technique, J. Geophys. Res., 108, 1026, doi:10.1029/2002JA009405.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Arithmetic map operations are very common procedures used in GIS to combine raster maps resulting in a new and improved raster map. It is essential that this new map be accompanied by an assessment of uncertainty. This paper shows how we can calculate the uncertainty of the resulting map after performing some arithmetic operation. Actually, the propagation of uncertainty depends on a reliable measurement of the local accuracy and local covariance, as well. In this sense, the use of the interpolation variance is proposed because it takes into account both data configuration and data values. Taylor series expansion is used to derive the mean and variance of the function defined by an arithmetic operation. We show exact results for means and variances for arithmetic operations involving addition, subtraction and multiplication and that it is possible to get approximate mean and variance for the quotient of raster maps.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this article, we propose using the principle of boosting to reduce the bias of a random forest prediction in the regression setting. From the original random forest fit, we extract the residuals and then fit another random forest to these residuals. We call the sum of these two random forests a one-step boosted forest. We show with simulated and real data that the one-step boosted forest has a reduced bias compared to the original random forest. The article also provides a variance estimate of the one-step boosted forest by an extension of the infinitesimal Jackknife estimator. Using this variance estimate, we can construct prediction intervals for the boosted forest and we show that they have good coverage probabilities. Combining the bias reduction and the variance estimate, we show that the one-step boosted forest has a significant reduction in predictive mean squared error and thus an improvement in predictive performance. When applied on datasets from the UCI database, one-step boosted forest performs better than random forest and gradient boosting machine algorithms. Theoretically, we can also extend such a boosting process to more than one step and the same principles outlined in this article can be used to find variance estimates for such predictors. Such boosting will reduce bias even further but it risks over-fitting and also increases the computational burden. Supplementary materials for this article are available online.
Facebook
TwitterBy Matthew Winter [source]
This dataset features the daily temperature summaries from various weather stations across the United States. It includes information such as location, average temperature, maximum temperature, minimum temperature, state name, state code, and zip code. All the data contained in this dataset has been filtered so that any values equaling -999 were removed. With this powerful set of data you to explore how climate conditions changed throughout the year and how they varied across different regions of the country. Dive into your own research today to uncover fascinating climate trends or use it to further narrow your studies specific to a region or city
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset offers a detailed look at daily average, minimum, and maximum temperatures across the United States. It contains information from 1120 weather stations throughout the year to provide a comprehensive look at temperature trends for the year.
The data contains a variety of columns including station, station name, location (latitude and longitude), state name zip code and date. The primary focus of this dataset is on the AvgTemp, MaxTemp and MinTemp columns which provide daily average, maximum and minimum temperature records respectively in degrees Fahrenheit.
To use this dataset effectively it is useful to consider multiple views before undertaking any analysis or making conclusions:
- Plot each individual record versus time by creating a line graph with stations as labels on different lines indicating changes over time. Doing so can help identify outliers that may need further examination; much like viewing data on a scatterplot looking for confidence bands or examining variance between points that are otherwise hard to see when all points are plotted on one graph only.
- A comparison of states can be made through creating grouped bar charts where states are grouped together with Avg/Max/Min temperatures included within each chart - thereby showing any variance that may exist between states during a specific period about which it's possible to make observations about themselves (rather than comparing them). For example - you could observe if there was an abnormally high temperature increase in California during July compared with other US states since all measurements would be represented visually providing opportunity for insights quickly compared with having to manually calculate figures from raw data sets only.With these two initial approaches there will also be further visualizations possible regarding correlations between particular geographical areas versus different climatic conditions or through population analysis such as correlating areas warmer/colder than median observances verses relative population densities etc.. providing additional opportunities for investigation particularly when combined with key metrics collected over multiple years versus one single year's results exclusively allowing wider inferences to be made depending upon what is being requested in terms of outcomes desired from those who may explore this data set further down the line beyond its original compilation starter point here today!
- Using the Latitude and Longitude values, this dataset can be used to create a map of average temperatures across the USA. This would be useful for seeing which areas were consistently hotter or colder than others throughout the year.
- Using the AvgTemp and StateName columns, predictors could use regression modeling to predict what temperature an area will have in a given month based on it's average temperature.
- By using the Date column and plotting it alongside MaxTemp or MinTemp values, visualization methods such as timelines could be utilized to show how temperatures changed during different times of year across various states in the US
If you use this dataset in your research, please credit the original authors. Data Source
Unknown License - Please check the dataset description for more information.
File: 2015 USA Weather Data FINAL.csv
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Matthew Winter.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains the following: 1. ANOVA table (variate:protein content)2. Table of effects3. Table of means4. Standard errors of differences of means 5. Tukey's 95% confidence intervals
Facebook
Twitterhttps://borealisdata.ca/api/datasets/:persistentId/versions/2.4/customlicense?persistentId=doi:10.5683/SP3/EBN26Qhttps://borealisdata.ca/api/datasets/:persistentId/versions/2.4/customlicense?persistentId=doi:10.5683/SP3/EBN26Q
Validating a novel housing method for inbred mice: mixed-strain housing. To see if this housing method affected strain-typical mouse phenotypes, if variance in the data was affected, and how statistical power was increased through this split-plot design.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In survival analyses, inverse-probability-of-treatment (IPT) and inverse-probability-of-censoring (IPC) weighted estimators of parameters in marginal structural Cox models (Cox MSMs) are often used to estimate treatment effects in the presence of time-dependent confounding and censoring. In most applications, a robust variance estimator of the IPT and IPC weighted estimator is calculated leading to conservative confidence intervals. This estimator assumes that the weights are known rather than estimated from the data. Although a consistent estimator of the asymptotic variance of the IPT and IPC weighted estimator is generally available, applications and thus information on the performance of the consistent estimator are lacking. Reasons might be a cumbersome implementation in statistical software, which is further complicated by missing details on the variance formula. In this paper, we therefore provide a detailed derivation of the variance of the asymptotic distribution of the IPT and IPC weighted estimator and explicitly state the necessary terms to calculate a consistent estimator of this variance. We compare the performance of the robust and the consistent variance estimator in an application based on routine health care data and in a simulation study. The simulation reveals no substantial differences between the two estimators in medium and large data sets with no unmeasured confounding, but the consistent variance estimator performs poorly in small samples or under unmeasured confounding, if the number of confounders is large. We thus conclude that the robust estimator is more appropriate for all practical purposes.