Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Case-crossover study designs are observational studies used to assess post-market safety of medical products (e.g. vaccines or drugs). As a case-crossover study is self-controlled, its advantages include better control for confounding because the design controls for any time-invariant measured and unmeasured confounding, and potentially greater feasibility as only data from those experiencing an event (or cases) is required. However, self-matching also introduces correlation between case and control periods within a subject or matched unit. To estimate sample size in a case-crossover study, investigators currently use Dupont’s formula (Biometrics 1988; 43:1157- 1168), which was originally developed for a matched case-control study. This formula is relevant as it takes into account correlation in exposure between controls and cases which are expected to be high in self-controlled studies. However, in our study, we show that Dupont’s formula and other currently used methods to determine sample size for case-crossover studies may be inadequate. Specifically, these formulae tend to underestimate the true required sample size, determined through simulations, for a range of values in the parameter space. We present mathematical derivations to explain where some currently used methods fail and propose two new sample size estimation methods that provide a more accurate estimate of the true required sample size.
Facebook
TwitterThis brief provides more information about a how a State may, for planning purposes, calculate a sample size for the NYTD follow-up population. Metadata-only record linking to the original dataset. Open original dataset below.
Facebook
TwitterBackground Published formulas for case-control designs provide sample sizes required to determine that a given disease-exposure odds ratio is significantly different from one, adjusting for a potential confounder and possible interaction. Results The formulas are extended from one control per case to F controls per case and adjusted for a potential multi-category confounder in unmatched or matched designs. Interactive FORTRAN programs are described which compute the formulas. The effect of potential disease-exposure-confounder interaction may be explored. Conclusions Software is now available for computing adjusted sample sizes for case-control designs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This is the replication dataset corresponding to the publication "Sample size requirements for riverbank macrolitter characterization". We refer to this publication for the data description. Additionally the document contents_data_publication.pdf will provide an overview of the contents of this database.
Facebook
TwitterThese detailed tables show standard errors for sample sizes and population estimates from the 2012 National Survey on Drug Use and Health (NSDUH). Standard errors for samples sizes and population estimates are provided by age group, gender, race/ethnicity, education level, employment status, geographic area, pregnancy status, college enrollment status, and probation/parole status.
Facebook
TwitterThese detailed tables show sample sizes and population estimates pertaining to mental health from the 2010 National Survey on Drug Use and Health (NSDUH). Samples sizes and population estimates are provided by age group, gender, race/ethnicity, education level, employment status, poverty level, geographic area, insurance status.
Facebook
TwitterData collection techniques, study participants and sample size.
Facebook
TwitterThis dataset was created by Tatag Suryo Pambudi
Facebook
TwitterThis dataset tracks the updates made on the dataset "Sample Size and Population Estimates Tables (Standard Errors and P Values) - 8.1 to 8.13" as a repository for previous versions of the data and metadata.
Facebook
Twitter*B = Baylor, M = Massachusetts—Boston, MS = Michigan State, NC = North Carolina, O = Oregon StateActivity, sample size, study-site contributing data, and age ranges for each of the activities examined in this study
Facebook
TwitterThis dataset tracks the updates made on the dataset "Sample Size and Population Estimates Tables (Prevalence Estimates) - 8.1 to 8.13" as a repository for previous versions of the data and metadata.
Facebook
TwitterDHS datasets and sample size.
Facebook
TwitterThese detailed tables show sample sizes and population estimates from the 2012 National Survey on Drug Use and Health (NSDUH). Samples sizes and population estimates are provided by age group, gender, race/ethnicity, education level, employment status, geographic area, pregnancy status, college enrollment status, and probation/parole status.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accurate sample size estimation is a cornerstone of successful Institutional Review Board (IRB) proposals, as it establishes the feasibility of clinical studies and ensures they are sufficiently powered to detect meaningful effects. Underestimating sample size poses the risk of insufficient statistical power, compromising the ability to identify significant outcomes. Conversely, overestimating sample size can lead to prolonged data collection, wasting valuable time and resources. One of the primary challenges in sample size estimation lies in the uncertainty surrounding variance and effect size before the study begins. Group Sequential Design with Sample Size Re-estimation (GSD-SSR) effectively addresses this issue by utilizing interim data at predefined stages to refine these estimates. GSD-SSR enables dynamic adjustments to sample size during the study, optimizing resource allocation and improving overall efficiency. We offer a comprehensive introduction to the theoretical background of GSD-SSR and provide step-by-step guidance for its practical application in clinical research. To further facilitate adoption, we have developed a user-friendly online platform that streamlines the GSD-SSR process and integrates it seamlessly into IRB proposals. By incorporating GSD-SSR into the power analysis of IRB proposals, researchers can significantly increase the likelihood of successful clinical studies while enhancing budget efficiency and optimizing timelines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample size calculation per Cochrane review group; random review # generator (used to help pick reviews at random)
Facebook
TwitterSummary of compiled dataset sample sizes, location of lakes, and collection years.
Facebook
TwitterThese detailed tables show sample sizes and population estimates from the 2012 National Survey on Drug Use and Health (NSDUH) Mental Health Detailed Tables. Samples sizes and population estimates are provided age group, gender, race/ethnicity, education level, employment status, county type, poverty level, insurance status, overal health, and geographic area.
Facebook
TwitterThe dataset is a relational dataset of 8,000 households households, representing a sample of the population of an imaginary middle-income country. The dataset contains two data files: one with variables at the household level, the other one with variables at the individual level. It includes variables that are typically collected in population censuses (demography, education, occupation, dwelling characteristics, fertility, mortality, and migration) and in household surveys (household expenditure, anthropometric data for children, assets ownership). The data only includes ordinary households (no community households). The dataset was created using REaLTabFormer, a model that leverages deep learning methods. The dataset was created for the purpose of training and simulation and is not intended to be representative of any specific country.
The full-population dataset (with about 10 million individuals) is also distributed as open data.
The dataset is a synthetic dataset for an imaginary country. It was created to represent the population of this country by province (equivalent to admin1) and by urban/rural areas of residence.
Household, Individual
The dataset is a fully-synthetic dataset representative of the resident population of ordinary households for an imaginary middle-income country.
ssd
The sample size was set to 8,000 households. The fixed number of households to be selected from each enumeration area was set to 25. In a first stage, the number of enumeration areas to be selected in each stratum was calculated, proportional to the size of each stratum (stratification by geo_1 and urban/rural). Then 25 households were randomly selected within each enumeration area. The R script used to draw the sample is provided as an external resource.
other
The dataset is a synthetic dataset. Although the variables it contains are variables typically collected from sample surveys or population censuses, no questionnaire is available for this dataset. A "fake" questionnaire was however created for the sample dataset extracted from this dataset, to be used as training material.
The synthetic data generation process included a set of "validators" (consistency checks, based on which synthetic observation were assessed and rejected/replaced when needed). Also, some post-processing was applied to the data to result in the distributed data files.
This is a synthetic dataset; the "response rate" is 100%.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset has been created by using the open-source code released by LNDS (Luxembourg National Data Service). It is meant to be an example of the dataset structure anyone can generate and personalize in terms of some fixed parameter, including the sample size. The file format is .csv, and the data are organized by individual profiles on the rows and their personal features on the columns. The information in the dataset has been generated based on the statistical information about the age-structure distribution, the number of populations over municipalities, the number of different nationalities present in Luxembourg, and salary statistics per municipality. The STATEC platform, the statistics portal of Luxembourg, is the public source we used to gather the real information that we ingested into our synthetic generation model. Other features like Date of birth, Social matricule, First name, Surname, Ethnicity, and physical attributes have been obtained by a logical relationship between variables without exploiting any additional real information. We are in compliance with the law in putting close to zero the risk of identifying a real person completely by chance.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Case-crossover study designs are observational studies used to assess post-market safety of medical products (e.g. vaccines or drugs). As a case-crossover study is self-controlled, its advantages include better control for confounding because the design controls for any time-invariant measured and unmeasured confounding, and potentially greater feasibility as only data from those experiencing an event (or cases) is required. However, self-matching also introduces correlation between case and control periods within a subject or matched unit. To estimate sample size in a case-crossover study, investigators currently use Dupont’s formula (Biometrics 1988; 43:1157- 1168), which was originally developed for a matched case-control study. This formula is relevant as it takes into account correlation in exposure between controls and cases which are expected to be high in self-controlled studies. However, in our study, we show that Dupont’s formula and other currently used methods to determine sample size for case-crossover studies may be inadequate. Specifically, these formulae tend to underestimate the true required sample size, determined through simulations, for a range of values in the parameter space. We present mathematical derivations to explain where some currently used methods fail and propose two new sample size estimation methods that provide a more accurate estimate of the true required sample size.