Facebook
TwitterMedian values, interquartile range (IQR) and Number of outliers.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
DH represents 100% for the relative measure. Differences between medians and distributions were significant between all disciplines if indicated with * and were significantly different between GS and SG when marked with 1, significantly different between GS and DH if marked with 2 and significantly different between SG and DH if marked with 3. If no parameter was significantly different the column is empty. Columns marked with—indicate that the measure was not calculated.Median, interquartile range (IQR) and significance level of the difference between discipline medians and distributions for all parameters, and percentage of DH for GS and SG.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics, mean ± SD, range, median and interquartile range (IQR).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Median (interquartile range) of percentage of adult respondents with need for and access to care in 53 countries.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
*n = 1041 (35 missing data).BMI = body mass index (kg/m2); SD = standard deviation; IQR = interquartile range; EI energy intake (MJ/d); BMR = basal metabolic rate (MJ/d).
Facebook
TwitterThe median, interquartile range (IQR) and range of the minimum (Factors I, II, V, VII, VIII, IX, X) or maximum (PT/INR, aPTT, D-Dimer) factor concentrations/clotting times measured for the 146 patients during their hospital admission.
Facebook
TwitterWe include a description of the data sets in the meta-data as well as sample code and results from a simulated data set. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: The R code is available on line here: https://github.com/warrenjl/SpGPCW. Format: Abstract The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publicly available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. File format: R workspace file. Metadata (including data dictionary) • y: Vector of binary responses (1: preterm birth, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate). This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterMedian and interquartile range of R0 by serotype and by province.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The median (and interquartile range) of the individuals’ median and inter quartile range (range) of both the time on the stone and the number of steps on the stone in the standardized and nonstandardized configuration.
Facebook
TwitterGeoscience Australias GEOMACS model was utilised to produce hindcast hourly time series of continental shelf (~20 to 300 m depth) bed shear stress (unit of measure: Pascal, Pa) on a 0.1 degree grid covering the period March 1997 to February 2008 (inclusive). The hindcast data represents the combined contribution to the bed shear stress by waves, tides, wind and density-driven circulation. Included in the parameters that will be calculated to represent the magnitude of the bulk of the data are the quartiles of the distribution; Q25, Q50 and Q75 (i.e. the values for which 25, 50 and 75 percent of the observations fall below). The interquartile range, , of the GEOMACS output takes the observations from between Q25 and Q75 to provide an accurate representation of the spread of observations. The interquartile range was shown to provide a more robust representation of the observations than the standard deviation, which produced highly skewed observations (Hughes and Harris 2008). This dataset is a contribution to the CERF Marine Biodiversity Hub and is hosted temporarily by CMAR on behalf of Geoscience Australia.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.
Specifics of the Dataset:
The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.
One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:
Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule
Context of the Dataset:
The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:
The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
Facebook
TwitterThe Precipitation Estimation from Remotely Sensed Information using an Artificial Neural Network-Climate Data Record (PERSIANN-CDR) is a satellite-based precipitation dataset for hydrological and climate studies, spanning from 1983 to present. It is the longest satellite-based precipitation record available, with daily data at 0.25° resolution for the 60°S–60°N latitude band.PERSIANN rain rate estimates are generated at 0.25° resolution and calibrated to a monthly merged in-situ and satellite product from the Global Precipitation Climatology Project (GPCP). The model uses Gridded Satellite (GridSat-B1) infrared data at 3-hourly time steps, with the raw output (PERSIANN-B1) bias-corrected and accumulated to produce the daily PERSIANN-CDR.The maps show 31 years (1984–2014) of annual and seasonal median and interquartile range (IQR) data. The median represents the 50th percentile of precipitation, and the IQR reflects the range between the 75th and 25th percentiles, showing data variability. Median and IQR are preferred over mean and standard deviation as they are less influenced by extreme values and better represent non-normally distributed data, such as precipitation, which is skewed and zero-limited.Data and Metadata: NCEIThis is a component of the Gulf Data Atlas (V1.0) for the Physical topic area.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Median (interquartile range; IQR) demographic and clinical data of participants.
Facebook
TwitterThe Precipitation Estimation from Remotely Sensed Information using an Artificial Neural Network-Climate Data Record (PERSIANN-CDR) is a satellite-based precipitation dataset for hydrological and climate studies, spanning from 1983 to present. It is the longest satellite-based precipitation record available, with daily data at 0.25° resolution for the 60°S–60°N latitude band.PERSIANN rain rate estimates are generated at 0.25° resolution and calibrated to a monthly merged in-situ and satellite product from the Global Precipitation Climatology Project (GPCP). The model uses Gridded Satellite (GridSat-B1) infrared data at 3-hourly time steps, with the raw output (PERSIANN-B1) bias-corrected and accumulated to produce the daily PERSIANN-CDR.The maps show 31 years (1984–2014) of annual and seasonal median and interquartile range (IQR) data. The median represents the 50th percentile of precipitation, and the IQR reflects the range between the 75th and 25th percentiles, showing data variability. Median and IQR are preferred over mean and standard deviation as they are less influenced by extreme values and better represent non-normally distributed data, such as precipitation, which is skewed and zero-limited.Data and Metadata: NCEIThis is a component of the Gulf Data Atlas (V1.0) for the Physical topic area.
Facebook
TwitterMedian response times in seconds (interquartile range in parenthesis) as a function of response type and CRT problem.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AThere were no statistically significant differences in baseline characteristics between groups.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This airquality.csv file has 5,999 rows and 5 numeric columns—PM2.5, co2, no2, so2, and o3—with no missing values and no duplicate rows. The variables look like pollutant concentrations, each showing distinct spread: PM2.5 has a median of 18 with an interquartile range (IQR) 11–28 (range 3–48); co2 is the most variable with a median 1,183 and a long right tail (IQR 625–4,093, range 40–6,999); no2 centers at 48 (IQR 27–174, range 5–300); so2 at 59 (IQR 35–229, range 1–400); and o3 at 123 (IQR 77–167, range 10–250). In short, it’s a clean, fully numeric pollution dataset with notable dispersion—especially in co2 and so2—well-suited for quick EDA (distributions, outliers, correlations) or modeling once you decide on a prediction target.
Facebook
TwitterThe Precipitation Estimation from Remotely Sensed Information using an Artificial Neural Network-Climate Data Record (PERSIANN-CDR) is a satellite-based precipitation dataset for hydrological and climate studies, spanning from 1983 to present. It is the longest satellite-based precipitation record available, with daily data at 0.25° resolution for the 60°S–60°N latitude band.PERSIANN rain rate estimates are generated at 0.25° resolution and calibrated to a monthly merged in-situ and satellite product from the Global Precipitation Climatology Project (GPCP). The model uses Gridded Satellite (GridSat-B1) infrared data at 3-hourly time steps, with the raw output (PERSIANN-B1) bias-corrected and accumulated to produce the daily PERSIANN-CDR.The maps show 31 years (1984–2014) of annual and seasonal median and interquartile range (IQR) data. The median represents the 50th percentile of precipitation, and the IQR reflects the range between the 75th and 25th percentiles, showing data variability. Median and IQR are preferred over mean and standard deviation as they are less influenced by extreme values and better represent non-normally distributed data, such as precipitation, which is skewed and zero-limited.Data and Metadata: NCEIThis is a component of the Gulf Data Atlas (V1.0) for the Physical topic area.
Facebook
TwitterProportion of positive results, interquartile range (IQR), minimum-maximum range, and median per diagnostic test at three different time points (baseline) of 24 S. haematobium-positive subjects.
Facebook
TwitterMedian values, interquartile range (IQR) and Number of outliers.