Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.
Specifics of the Dataset:
The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.
One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:
Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule
Context of the Dataset:
The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:
The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Trusted Research Environments (TREs) enable analysis of sensitive data under strict security assertions that protect the data with technical organizational and legal measures from (accidentally) being leaked outside the facility. While many TREs exist in Europe, little information is available publicly on the architecture and descriptions of their building blocks & their slight technical variations. To shine light on these problems, we give an overview of existing, publicly described TREs and a bibliography linking to the system description. We further analyze their technical characteristics, especially in their commonalities & variations and provide insight on their data type characteristics and availability. Our literature study shows that 47 TREs worldwide provide access to sensitive data of which two-thirds provide data themselves, predominantly via secure remote access. Statistical offices make available a majority of available sensitive data records included in this study.
We performed a literature study covering 47 TREs worldwide using scholarly databases (Scopus, Web of Science, IEEE Xplore, Science Direct), a computer science library (dblp.org), Google and grey literature focusing on retrieving the following source material:
The goal for this literature study is to discover existing TREs, analyze their characteristics and data availability to give an overview on available infrastructure for sensitive data research as many European initiatives have been emerging in recent months.
This dataset consists of five comma-separated values (.csv) files describing our inventory:
Additionally, a MariaDB (10.5 or higher) schema definition .sql file is needed, properly modelling the schema for databases:
The analysis was done through Jupyter Notebook which can be found in our source code repository: https://gitlab.tuwien.ac.at/martin.weise/tres/-/blob/master/analysis.ipynb
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By FiveThirtyEight [source]
This dataset contains survey responses from people about their daily weather report usage and weather check. It consists of columns such as Do You Typically Check a Daily Weather Report?, How do you Typically Check the Weather?, If You Had a Smartwatch (like the Soon to be Released Apple Watch), How Likely or Unlikely Would You Be to Check the Weather on That Device? Age, What is Your Gender?, and US Region. With this data, we can explore usage patterns in checking for daily weather reports across different regions, genders, ages and preferences for smartwatch devices in doing so. This dataset offers an interesting insight into our current attitudes towards checking for the weather with technology - and by understanding these patterns better, we can create more engaging experiences tailored to individuals’ needs
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
To get started, it is helpful to first examine the columns in the dataset. The columns are Do you typically check a daily weather report?, How do you typically check the weather?, If you had a smartwatch (like the soon to be released Apple Watch), how likely or unlikely would you be to check the weather on that device?, Age, What is your gender?, US Region. Each row contains data for one survey participant, with their answers for each column included in each row.
The data can be used for exploring correlations between factors such as age, gender, region/location, daily weather checking habits/preferences etc.. Some of these variables are numerical (such as age) and others are categorical (such as gender). You can use this data when creating visualizations showing relationships between these factors. You may also want to create summary tables showing average values for different categories of each factor, allowing for easy comparison across groups or over time periods (depending on how much historical data is available).
- Analyzing trends in the usage of daily weather reports by age, gender and region.
- Exploring consumer preferences for checking the weather via smartwatches and mobile devices in comparison to other methods (e.g., TV or radio).
- Examining correlations between people's likelihood to check their daily weather report and their demographic characteristics (location, age, gender)
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: weather-check.csv | Column name | Description | |:-------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------| | Do you typically check a daily weather report? | This column indicates whether or not the respondent typically checks a daily weather report. (Categorical) | | How do you typically check the weather? | This column indicates how the respondent typically checks the weather. (Categorical) | | If you had a smartwatch (like the soon to be released Apple Watch), how likely or unlikely would you be to check the weather on that device? | This column indicates how likely or unlikely the respondent would be to check the weather on a smartwatch. (Categorical) | | Age | This column indicates the age of the respondent. (Numerical) ...
Facebook
TwitterBy Throwback Thursday [source]
Throwback Thursday: US Christmas Tree Sales
This dataset provides a comprehensive record of the annual Christmas tree sales in the United States from 2010 to 2016. The dataset consists of six columns which include relevant information about each year's sales data.
The Year column indicates the specific year in which the Christmas tree sales data was recorded, allowing analysts to compare and track trends over time.
The Type of tree column specifies the various species or types of Christmas trees that were sold during each particular year, enabling researchers to analyze market preferences and consumer choices.
The Number of trees sold column represents the total quantity of Christmas trees that were purchased by customers in a given year. Identifying fluctuations in this metric can offer insights into changes in demand and market performance.
The Average Tree Price column provides important information on pricing dynamics within the industry. By calculating and tracking this average price for each year, analysts can assess variations in consumer spending behavior as well as identify potential economic factors influencing purchasing decisions.
Finally, the Sales column presents valuable data on total revenue generated from these Christmas tree sales annually. This metric offers a holistic perspective on market performance and business profitability within the holiday season.
Overall, this detailed dataset serves as a reliable resource for researchers aiming to understand historical trends and patterns within the US Christmas tree industry from 2010 to 2016. By analyzing variations across years, types of trees, number of units sold, average prices, and total sales revenue statistics, professionals can gain meaningful insights into consumer preferences while also uncovering opportunities for growth or operational improvements within this festive market segment
Introduction:
Year: The column Year indicates the specific year in which the Christmas tree sales data was recorded. You can analyze trends over time by grouping data by year or comparing different years' performance.
Type of tree: The Type of tree column specifies the type or species of Christmas trees sold. This information allows you to analyze which types are popular among consumers and explore any notable shifts or preferences over time.
Number of trees sold: The Number of trees sold column represents the total count or quantity of Christmas trees sold in a given year. You can perform various analyses such as finding annual growth rates, identifying peak selling years, or comparing sales between different types of trees.
Average Tree Price: The Average Tree Price column indicates the average price at which each Christmas tree was sold in a particular year. By analyzing this data, you can identify pricing trends across different types of trees and understand consumer behavior regarding affordability and willingness to pay.
Sales: The Sales column represents the total revenue generated from Christmas tree sales in a given year. This information allows you to assess overall market performance, compare revenue generated by different types of trees, or calculate yearly growth rates.
Example Analysis:
a) Analyzing Revenue Over Time: Plotting a line graph with years on X-axis and sales revenue on Y-axis will help visualize if there is any increasing or decreasing trend in total revenue for all years combined.
b) Comparing Average Tree Prices: Create a bar chart comparing the average prices of different tree types. This analysis can reveal insights into consumer preferences and price elasticity for specific tree species.
c) Correlation Analysis: Explore the relationship between the number of trees sold and sales revenue by calculating correlation coefficients or creating a scatter plot. This will help identify if increased sales volume directly correlates to higher revenue.
d) Seasonal Variations: Analyze seasonal patterns in the dataset by grouping data month-wise or quarter-wise. This can provide insights into peak buying periods, allowing businesses to optimize marketing strategies around these times.
Conclusion:
- Analyzing the trends in Christmas tree sales over the years: By examining the number of trees sold, average tree price, and sales revenue for each year, this dataset can provide insights into consumer preferences and economic factors that ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Gulf Stream paths (daily, monthly, and annual) from 1993-01-01 to 2023-12-31 are identified via the longest 25-cm sea surface height contour in the Northwest Atlantic (75°W–55°W; 33°N–43°N) from the daily 1/8° resolution maps of absolute dynamic topography from the E.U. Copernicus Marine Service product Global Ocean Gridded Level 4 Sea Surface Heights and Derived Variables Reprocessed 1993 Ongoing, following the methodology of Andres (2016). The daily sea surface height fields are averaged to monthly and annual fields to identify the corresponding monthly and annual Gulf Stream paths.
Additionally, an updated Gulf Stream destabilization point time series (1993–2023), which builds upon the work of Andres (2016), was generated using the E.U. Copernicus Marine Service product Global Ocean Gridded Level 4 Sea Surface Heights and Derived Variables Reprocessed 1993 Ongoing (1/8°). Similar to Andres (2016), the monthly Gulf Stream path is identified as the 25-cm SSH contour from absolute dynamic topography maps. The 12 monthly mean paths are divided yearly into 0.5° longitude bins (from 75°W to 55°W). In some months, the Gulf Stream can take a meandering path and contort over itself in an “S” curve. In these cases, the northernmost latitude is used in the variance calculation to resolve the issue of multiple latitudes for a single longitude. The variance of the Gulf Stream position (latitude) is then calculated for each year using the 12 monthly mean paths. The destabilization point is defined as the first downstream distance (longitude) at which the variance of the Gulf Stream position exceeds 0.4(°)2, which differs from the original threshold value of 0.5(°)2 in Andres (2016). The threshold value of 0.4(°)2 is the 70th percentile of variance for all years, which marks the transition from a relatively stable jet to an unstable, meandering current in the new higher-resolution (1/8°) maps of absolute dynamic topography.
Thanks to improvements in processing and combining satellite altimeter data (Taburet et al., 2019), in recent years the maps of absolute dynamic topography are different than the maps used by Andres (2016), which had 1/4° resolution. To account for the differences in the resolution of the data and corrections to the processing standards of altimeter data, a new threshold value was chosen that is consistent with the methods of Andres (2016), i.e., the threshold still signifies the transition between a stable and unstable Gulf Stream. However, a lower threshold value is necessary in the new absolute dynamic topography maps since finer-resolution data can separate distinct local maxima in variance, which could be smoothed together in coarser data, and may cause the destabilization point to be identified further downstream if the threshold were not adjusted. The 70th percentile of variance (0.4(°)2) for all years (1993–2023) was chosen as the threshold because the distribution of variance is right-skewed with a long tail and the 70th percentile separate lower variance associated with meridional shifts in the Gulf Stream path from the extreme, vigorous meadnering that occurs downstream of the "destabilization point".
The daily, monthly, annual Gulf Stream paths, and the updated destabilization point time series were generated using the E.U. Copernicus Marine Service product Global Ocean Gridded Level 4 Sea Surface Heights and Derived Variables Reprocessed 1993 Ongoing (https://doi.org/10.48670/moi-00148).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Besides variation among animal species in ground colour and in the number, size, shape, and distribution of pattern elements, there is also considerable intraspecific variation in colour patterns that can manifest both between populations inhabiting different environments, and among individuals within populations. In previous investigations into the consequences of inter-individual variation in colour patterns in moths we have relied on a discrete classification with three categories: non-variable; variable; or highly variable colour patterns, as jointly assessed by Per-Eric Betzholtz and Markus Franzén (e.g., Forsman et al. 2015, 2016, Franzén et al. 2019). Here we provide the raw data from the anonymized assessments of colour pattern variation of 489 species of Erebidae and Noctuidae moths in Sweden performed by twelve lepidopterologists with extensive experience and expertise of the moth fauna in Sweden. In addition, the raw data (on a discrete scale) provided by the experts is used to generate a continuously distributed measure of the intra-specific colour pattern variation in moths. Despite variation among the independent scorers in their assessments of the average level of forewing colour pattern variation, there were statistically significant consistent differences in average colour pattern variation among the different species of moths (for details see Supporting Information I in Betzholtz et al. 2019).
References
Betzholtz, P.-E., A. Forsman, and M. Franzén. 2019. Inter-individual variation in colour patterns in noctuid moths characterizes long-distance dispersers and agricultural pests. Journal of Applied Entomology 143: 992-999.
Forsman, A., P. E. Betzholtz, and M. Franzén. 2015. Variable coloration is associated with dampened population fluctuations in noctuid moths. Proceedings of the Royal Society B 282: 20142922.
Forsman, A., P. E. Betzholtz, and M. Franzén. 2016. Faster poleward range shifts in moths with more variable colour patterns. Scientific Reports 6: 36265.
Franzén, M., P. E. Betzholtz, and A. Forsman. 2019. Variable color patterns influence continental range size and species-area relationships on islands. Ecosphere 10: e02577.
Methods Data on expert assessments of colour pattern variation in moths
To obtain a quantitative measure of the inter-individual variation in colour pattern of moths a questionnaire with instructions was distributed together with an Excel file that contained a list of 489 moth species in the family Noctuidae and Erebidae (according to Aarvik et al. 2017) to 26 lepidopterologists with extensive experience and expertise of the moth fauna in Sweden (for details see Supporting Information I-II in Betzholtz et al. 2019). Respondents were asked to classify species for inter-individual colour pattern variability as 0 (non-variable), 1 (variable), or 2 (highly variable) based on their own personal experience of individuals observed or captured in Sweden until 2015. Species that were sexually dichromatic were classified as having variable coloration only if variation was apparent within one or both sexes, otherwise they were considered non-variable. Twelve of the expert lepidopterologists provided independent classifications of colour pattern variation for all or for a subset of the 489 moth species. To obtain a continuous measure of colour pattern variability for each species, a mean value of colour pattern variation was calculated for each species across the values assigned by the twelve independent experts. The mean values were then divided by 2 (the highest possible value) to generate a variable with a continuous distribution that ranges from 0 to 1. Finally, the resulting values (+0.1) were transformed to natural logarithms to generate an approximately normal distribution (see Supporting Information Fig S2 in Betzholtz et al. 2019).
References
Aarvik, L., B. Å. Bengtsson, H. Elven, P. Ivinskis, U. Jürivete, O. Karsholt, M. Mutanen, and N. Savenkov. 2017. Nordic-Baltic Checklist of Lepidoptera. Norwegian Journal of Entomology Supplement 3: 1-236.
Betzholtz, P.-E., A. Forsman, and M. Franzén. 2019. Inter-individual variation in colour patterns in noctuid moths characterizes long-distance dispersers and agricultural pests. Journal of Applied Entomology 143: 992-999.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Intensive longitudinal interventions (ILIs) have gained prominence as powerful tools for treating and preventing mental and behavioral disorders (Heron & Smyth, 2010). However, most studies analyze ILI data use traditional methods like ANOVA or linear mixed models, which overlook individual differences and the autocorrelation structure inherent in time series data (Hedeker et al., 2008). Moreover, existing methods typically assess intervention effects based solely on changes in the mean level of key variables (e.g., anxiety). This study demonstrates how to model ILI data within the framework of dynamic structural equation modeling (DSEM) to evaluate intervention effects across three dimensions: mean, autoregression, and individual intra-variation (IIV), for two intervention designs: non-randomized single-arm trial (NST) and randomized control trial (RCT). We conducted two simulation studies to investigate sample size recommendations for DSEM in ILI studies, considering both statistical power and accuracy in parameter estimation (AIPE). Additionally, we compared the two designs based on type I error rate in a separate simulation. Finally, we illustrated sample size planning using data from a pre-ILI study focused on reducing appearance anxiety.Simulation Studies 1 and 2 investigated the power and AIPE across varying sample sizes, as well as the required sample size for both NST and RCT designs. The effect sizes of intervention effects for mean, autoregression and IIV were fixed at the medium level. Two factors regarding sample size were manipulated: number of participants (N = 30, 60, 100,150, 200, 300,400), number of time-points (T= 10, 20, 40, 60, 80, 100). The data-generating models and fitted models were identical, with analysis conducted using Mplus 8.10 and Bayesian estimation. Model performance was assessed in terms of convergence rate, power and AIPE for intervention effects, as well as bias in the standard errors of the intervention effects. Simulation Study 3 assessed the type I error rate for both designs when changes in the control group was different from zero, indicating a change (on average) due to time. Last, the empirical study conducted sample size planning based on a pre-study aimed at reducing appearance anxiety using an ILI design.The results are as following. First, there were no convergence issues under all the conditions. Second, power increased, width of the credible intervals decreased as either N or T increased. However, a minimum of 60 participants was required to achieve adequate power (i.e., ). The relative bias in intervention effect was generally small. Except in the NST design, the intervention effects on autoregression and IIV were underestimated when the number of time-points was low (i.e., T=10 or 20), while in the RCT design, the intervention effect on mean was underestimated when sample size in both levels were small (i.e., N=30 or 60, T=10). Bias in the standard error was also minimal across conditions. Third, a credible interval width contours plot could be applied to recommend sample sizes in DSEM. The sample size requirements based on power and AIPE were different under NST design and RCT design, with RCT requiring larger samples due to the addition of a control group. Fourth, when a natural change (on average) occurred between pre- and post-intervention phrases, the NST design led to inflated type I error rates compared to the RCT design, particularly with larger sample sizes.In conclusion, we first recommend using DSEM to analyze ILI data, as it better captures intervention effects on mean, autoregression, and IIV. Second, practitioners should select either the NST or RCT design based on theoretical and empirical considerations. While the RCT design controls for confounding factors like time-related changes in mean, it requires a larger sample size. NST designs were usually conducted before large RCTs with relatively small samples, especially for rare participants. Finally, choosing the true parameters for the data-generating model was crucial in sample size planning using a monte carlo method. We suggested derive these parameters from pre-studies, similar empirical studies or meta-analysis when possible, as many parameters (i.e., regarding to fixed effects and random effects) should be set in DSEM. If no prior information is available, we suggest following the procedures outlined in this study.This database includes the code for data generating and analysis in simulation studies, and data, code and results in empirical example.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
It has long been suspected that the rate of mutation varies across the human genome at a large scale based on the divergence between humans and other species. However, it is now possible to directly investigate this question using the large number of de novo mutations (DNMs) that have been discovered in humans through the sequencing of trios. We investigate a number of questions pertaining to the distribution of mutations using more than 130,000 DNMs from three large datasets. We demonstrate that the amount and pattern of variation differs between datasets at the 1MB and 100KB scales probably as a consequence of differences in sequencing technology and processing. In particular, datasets show different patterns of correlation to genomic variables such as replication time. Never-the-less there are many commonalities between datasets, which likely represent true patterns. We show that there is variation in the mutation rate at the 100KB, 1MB and 10MB scale that cannot be explained by variation at smaller scales, however the level of this variation is modest at large scales–at the 1MB scale we infer that ~90% of regions have a mutation rate within 50% of the mean. Different types of mutation show similar levels of variation and appear to vary in concert which suggests the pattern of mutation is relatively constant across the genome. We demonstrate that variation in the mutation rate does not generate large-scale variation in GC-content, and hence that mutation bias does not maintain the isochore structure of the human genome. We find that genomic features explain less than 40% of the explainable variance in the rate of DNM. As expected the rate of divergence between species is correlated to the rate of DNM. However, the correlations are weaker than expected if all the variation in divergence was due to variation in the mutation rate. We provide evidence that this is due the effect of biased gene conversion on the probability that a mutation will become fixed. In contrast to divergence, we find that most of the variation in diversity can be explained by variation in the mutation rate. Finally, we show that the correlation between divergence and DNM density declines as increasingly divergent species are considered.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Seed size affects individual fitness in wild plant populations, but its ability to evolve may be limited by low narrow-sense heritability (h2). h2 is estimated as the proportion of total phenotypic variance (s2P) attributable to additive genetic variance (s2A), so low values of h2 may be due to low s2A (potentially eroded by natural selection) or to high values of the other factors that contribute to s2P, such as extranuclear maternal effects (m2) and environmental variance effects (e2). Here, we reviewed the published literature and performed a meta-analysis to determine whether h2 of seed size is routinely low in wild populations and, if so, which components of s2P contribute most strongly to total phenotypic variance. We analyzed available estimates of narrow-sense heritability (h2) of seed size, as well as the variance components contributing to these parameters. Maternal and environmental components of s2P were significantly greater than s2A, dominance, paternal, and epistatic components. These results suggest that low h2 of seed size in wild populations (the mean value observed in this study was 0.13) is due to both high values of maternally derived and environmental (residual) s2, and low values of s2A in seed size. The type of breeding design used to estimate h2 and m2 also influenced their values, with studies using diallel designs generating lower variance ratios than nested and other designs. e2 was not influenced by breeding design. For some breeding designs, the number of genotypes included in a study also influenced the resulting h2 and e2 estimates, but not m2. Our data support the view that a diallel design is better suited than the alternatives for the accurate estimation of s2A in seed size due to its factorial design and the inclusion of reciprocal crosses, which allows the independent estimation of both additive and non-additive components of variance. Methods Data search We performed a literature search to investigate published estimates of variance components for seed size in wild plant species. We used a query containing the following keywords: (“heritab*” or “variance component*”) and (“seed size” or “seed mass” or “seed weight”). We included only studies that examined the quantitative genetics of seed size of wild plant populations and excluded those that investigated agricultural or commercial species. We also excluded studies that pooled genotypes from multiple populations to estimate species-level heritabilities. Studies included for analysis reported narrow- and/or broad-sense heritabilities from individual populations, the variance components (as raw values) used to estimate these parameters, or both. We performed our search using the Web of Science search engine. Data extraction From all selected studies, we extracted the following estimated parameters: 1) raw values of variance components of seed size (s2A, s2M, s2P, s2E, s2D, s2K; Table 1); and 2) narrow-sense heritability (h2 = s2A: s2P) when available. Whenever a reported value of a raw variance component was measured for the same population but in different year, we averaged them and reported a single value. When parameters were reported for more than one population, we included each estimate in our data set. In cases where seed size was measured in multiple ways (e.g., seed weight, seed length, or seed area), we used the value of seed weight and discarded the others. Additionally, for each published parameter we recorded the following: publication identity, breeding design, and the number of maternal and paternal genotypes used in the breeding design that generated the parameter estimate. Breeding design refers to the pattern by which controlled pollinations were performed or by which natural pollinations occurred in each study. Categories examined here include: diallel and nested designs, clonal replication (hereafter “clones”), and autogamously self-fertilizing genotypes (hereafter “selfing”). Studies that reported heritability estimates but no breeding design (e.g., naturally pollinated maternal lines) were also excluded because such estimates were derived from open-pollinated genotypes and were likely to be confounded by environmentally induced maternal effects. Data analysis Standardization of variance components. In order to compare the relative contributions to seed size of each type of variance component, in each published study used here, we standardized the raw variance components by calculating variance ratios, which were computed by dividing a given raw variance component by the total phenotypic variance in seed size (i.e., the sum of all reported variance components, including the focal raw variance component). h2, for instance, is equivalent to additive genetic variance divided by total phenotypic variance (s2A:s2P). In the same manner, we defined m2 as the maternal variance component divided by total phenotypic variance (s2M:s2P); and e2 as the environmental (residual) variance component divided by total phenotypic variance (s2E:s2P). The same procedure was applied to the remaining components of seed size: paternal variance (Pat2 or s2Pat:s2P), dominance variance (d2 ors2D:s2P), and epistasis variance (k2 or s2K:s2P). Such standardized values are well-suited for comparisons among species, among independently conducted studies, and among different units of measurement (e.g., mass vs. linear measures) because standardized components are unitless. Model construction. Because the variance ratios analyzed here originate from publications that, in some cases, reported multiple estimates of the same variance component per publication, such estimates were not fully independent, thereby violating an assumption of ordinary least-squares methods. To account for variation among publications in the variance ratios of seed size, we analyzed the data with generalized linear mixed-effects models (GLMMs) that included publication identity as a random effect. We chose to include only publication identity as a random effect because most studies estimated variance components for a single species, making publication and species identity contain nearly the same information, such that their simultaneous inclusion in the model as random effects would be problematic. The number of maternal and paternal genotypes used to estimate each variance component varied greatly among published studies (range = 2-170 genotypes; Table S1), and we reasoned that the number of genotypes contributing to a genetically determined parameter estimate might influence its estimated value. Accordingly, we controlled for variation among studies in the number of genotypes sampled by including a weighting factor in the GLMMs. Weights were estimated as the number of genotypes—maternal or paternal genotypes, depending on the variance component—used in each study. Specifically, we used the number of paternal genotypes as weightings for models using estimates of h2 or Pat2 because these parameters are usually estimated using trait variation derived from the paternal lines. We used the number of maternal genotypes as weightings for models using estimates of m2 and e2 because maternal and environmental sources of variance in seed size are more likely to influenced by maternal than by paternal genotypes. We used the number of maternal genotypes as weightings for models using estimates of d2 and k2 because these parameters may be influenced by the number of either maternal or paternal genotypes. Among the studies analyzed here, the number of maternal genotypes equaled or exceeded the number of paternal genotypes (Table S1), so using the higher number would take this into account. GLMMs are robust tools for the analysis of variables that vary over multiple levels and can be used with alternative distributional assumptions of the residuals. Because the variance ratios that we used as response variables have values in the closed interval from zero to one (with many zero values present), distributions commonly employed to model proportion data, such as the beta distribution, were inappropriate for our response variables. Because of this, we used a pseudo-likelihood approach where the variance structure between the mean and the variance of the observations, and the range of the response (but not its precise distribution), are assumed. Specifically, we used a quasi-binomial GLMM with a logit link function for all models, which uses the variance structure of a binomial distribution while allowing for continuous values in the [0, 1] range. We fitted all GLMMs with the ‘glmmPQL’ function of the R package ‘MASS’ version 7.3-58.1 , which uses penalized quasi-likelihood (PQL) for parameter estimation. Comparison of variance components of seed size. To compare the means of the distinct proportional variance component types (h2, d2, pat2, m2, e2, and k2), we used a quasi-binomial GLMM with the observed value of each variance ratio estimate as a response, and the type of variance ratio for each observation as a categorical predictor. Publication identity was included as a random effect, and each observation was weighted based on the number of paternal or maternal genotypes used in estimating the reported ratio. As described above, whether we used the number of maternal or paternal genotypes depended on the identity of the variance ratio to which each observation corresponded. Marginal means of each variance ratio type and pairwise statistical comparisons between ratio types were conducted using the ‘emmeans’ function of the ‘emmeans’ R package version 1.7.5, with Tukey contrasts used to assess the significance of pairwise differences, and a Tukey correction of p-values to account for multiple hypothesis testing. Heritability and variance components of seed mass in relation to breeding design and number of genotypes. To determine whether breeding design and number of genotypes influenced h2, m2, and e2 of seed size, we fitted
Facebook
TwitterAs drought is the major bottleneck for the rain fed tef (Eragrostis tef) production, developing workable strategy that can mitigate its impacts is mandatory. To draw this strategy knowledge on how the rainfall behaves in the past decades is important. The central theme for this paper is studying the rainfall behavior over the past six decades in relation to the major rainfall induced risks for the rain-fed “tef” production system using 59 years of rainfall data. Risk of dry spell during germination and flowering is computed whereas crop water requirement satisfaction index is generated using water balance approach. The study shows strong intra annual variation but no trend on the annual and monthly mean rainfall totals, and number of rain days. The existence of this intra annual variation has enabled a wide range of possible planting dates that runs from late June to late August and there was no indication of trend that the planting date has a tendency to be either later or earlier in recent years. The result also depicts once in five years early and once in nine years late onset of the rain. Existence of these wide range of possible planting dates, early and late onset of the rain, high intra year variability in rainfall amount and number of rain days and absence of any apparent trend on the rainfall amount and number of rain days may shed some light how farmers are now facing frequent extremes that may consequence frequent crop failures. This signifies the need for every year rainfall forecasts and their appropriate analysis to have successful planting as well to minimize related risks and consequently to have better and consistent production system.
Facebook
TwitterHow population size influences quantitative genetic variation and differentiation among natural, fragmented populations remains unresolved. Small, isolated populations might occupy poor quality habitats and lose genetic variation more rapidly due to genetic drift than large populations. Genetic drift might furthermore overcome selection as population size decreases. Collectively, this might result in directional changes in additive genetic variation (VA) and trait differentiation (QST) from small to large population size. Alternatively, small populations might exhibit larger variation in VA and QST if habitat fragmentation increases variability in habitat types. We explored these alternatives by investigating VA and QST using nine fragmented populations of brook trout varying 50-fold in census size N (179-8416) and 10-fold in effective number of breeders, Nb (18-135). Across 15 traits, no evidence was found for consistent differences in VA and QST with population size and almost no evid...
Facebook
TwitterFor many animals, options abound when choosing a mate in socially complex environments like a breeding chorus or lek. In such environments, receivers often choose their mate based on individual differences in signal repetition rate. However, signallers also differ in the regularity with which they produce repeated signals. Irregularity in signalling introduces uncertainty in decision-making by masking the among-individual variation in signalling rate that is a target of mate choice. At present, we know little about how the complexity of the choice environment impacts selection on rate and regularity, two signalling behaviours that receivers can only compare after sampling series of signals produced by multiple signallers. In this study of female grey treefrogs (Hyla chrysoscelis), we measured multivariate sexual selection on the rate and regularity of male calling behaviour using two-, four-, and eight-choice tests. Receivers overwhelmingly chose faster, more regular calling rates in tw..., These are data generated from phonotaxis (movement toward sound) of female gray treefrogs Hyla chrysoscelis. We calculated the Relative Fitness score for each hypothetical male phenotype in the dataset by dividing the number of times he , , # Complex choice environments shelter unattractive signallers from sexual selection
https://doi.org/10.5061/dryad.gb5mkkx1v
We measured sexual selection on call rate and call regularity (the coefficient of variation within males in call rate; standard deviation scaled by the mean) in Cope's gray treefrog* Hyla chrysoscelis*. We used phonotaxis tests of gravid gray treefrog females to infer how the selection was acting on male phenotypes in choice set sizes of 2, 4, or 8 competing males. We generated 'stimulus sets' of 8 hypothetical male phenotypes by drawing their mean values from normal distributions to make a distribution representing their unique combination of call rate and regularity. Then we drew each instantaneous call rate in a sequence of calls from that distribution to generate an audio track that had some given level of rate and regularity. Complete details are given in the paper.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Birds build nests primarily as a receptacle to lay their eggs in, but they can also provide secondary benefits including structural support, camouflage, and adjustment of the microclimate surrounding the eggs and offspring. The factors underlying intraspecific variation in nest characteristics are poorly understood. In this study, we aim to identify the environmental factors that predict nest height variation and the duration of nest building in blue tits (Cyanistes caeruleus), evaluating latitude, elevation, temperature, and the timing of egg-laying as predictors of nest height, while also taking into account female and male parental identity. Using 713 nest height observations collected over a period of five years along a 220km transect in Scotland, we found that if the annual mean timing of egg-laying was earlier, nests were taller. However, there was no correlation between nest height and elevation, latitude, the minimum temperature in the 14 days pre-egg-laying or the phenology of birds within a year. Female parental identity accounted for a large amount of variation in nest height, suggesting that individual behaviour has an influence on nest structure. We also found that nest building duration was shorter in years when egg laying occurred earlier in the year, and that across all observations taller nests took longer to build. Overall, our results show that blue tits are able to alter their nest characteristics based on conditions environmental gradients like latitude (in the case of building duration) and the annual mean phenological variation timing of egg laying, and that birds build relatively taller nests faster.
Methods Nest data were collected along a 220km transect spanning from Edinburgh (55.98°N -3.40°E) to Dornoch (57.89°N -4.08°E) in Scotland from 2014-2018. Six to eight nest boxes (front-opening Schwegler 1B with 26mm entrance holes located 141mm above the outer bottom of the nest box and a 120mm internal diameter) were placed at 44 different sites along this transect; 40 sites were installed for the period 2014 – 2016 with a further four added in 2017. Pre-2017 all but one site had six nest boxes, and post-2017 the number of nest boxes was increased to eight at most sites. Nest boxes were placed at approximately 40m intervals, ca 1.5m off the ground and facing away from the prevailing south-west wind. Nest box location was determined using a handheld GPS and site elevation was obtained from the Google Maps elevation API, with elevations ranging from 10 to 433 m.a.s.l. Each site was visited every two days from mid-March (2014-2015) or early April (2016-2018) until the breeding season had finished (early- to mid-June).
Nest height was measured at each visit from the first visit in which nesting material was found in a nest box (either the first or second day of nest building) until the first egg was laid, and the dates of commencement of both nest building and egg-laying were recorded in ordinal dates. Nest height was recorded using a ruler placed at the outside of the nest box to the topmost point of the nest, discounting straggly parts. In 2014, nest height was measured from the inside of the nest box, but this was changed to the current method to improve between-recorder accuracy as the bottom of the nest box is more repeatedly determinable than the top of the inside. To correct for this difference in technique, the average height of the lower ledge of the nest box (30mm) was added to the 2014 height measurements.
The ordinal date of the first egg laid in every nest box was recorded based on whether one or two eggs were present, assuming a one-egg-per-day lay-rate. Adult birds were caught and individually identified by a metal ring when the chicks were over 10 days old. Temperature data was available for each site from two iButton dataloggers (Maxim: DS1922L) accurate to a resolution of 0.0625°C and placed at opposite ends of each site from mid-February, recording every hour. They were placed in water-tight cylinders on the north side of trees, with a hole at the bottom to allow ambient air circulation.
The "Nest Height Data" file includes three fixed effect predictors that capture various aspects of breeding phenology: the annual mean first egg date across all sites to control for early versus late seasons (yearmeanfedcentre), the annual mean phenological deviation of each site from the study-wide annual mean first egg date to capture spatial variation in phenology between sites (siteyearfeddevs), and the deviation of each nest box from the site-year mean to capture between-individual variation in phenology within a site and year (relfeddevs). Mean centred site latitude, elevation, and mean site minimum temperature in the 14 days before the first egg was laid are also included. Where a parental identity was unknown it was assigned a unique identifier, so that nest height observations with missing parental information could be retained (femaleID and maleID). The term "sitebox" corresponds to the nest box number at each site (1-8) and site ID. Also included are the year in which the nest was built, the site abbreviation, and the "nestrecorder" who measured the nest height.
The "Building Duration Data" file includes the same "year, "nestrecorder", "femaleID", "sitebox", and meancentred elevation, and latitudevalues as in the "Nest Height Data" file. Also included in the "Building Duration Data" is meancentred nest height (heightmeancentre) and first egg date across all sites and years (fedmeancentre). The "buildingtime" variable was calculated as the date at which the final nest height +/- 3mm was reached minus the date of first recorded nest building. The "mintempsmeancentre" is the mean minimum site temperature in the 7 days pre-egg-laying.
Facebook
TwitterOpen Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
License information was derived automatically
Description: Spatial information on the distribution of juvenile Pacific salmon is needed to support Marine Spatial Planning in the Pacific Region of Canada. Here we provide spatial estimates of the distribution of juvenile fish in the Strait of Georgia for all five species of Pacific salmon. These estimates were generated using a spatiotemporal generalized linear model and are based on standardized fishery-independent survey data from the Strait of Georgia mid-water juvenile salmon mid-water trawl survey from 2010 to 2020. We provide predicted catch per unit effort (CPUE), year-to-year variation in CPUE, and prediction uncertainty for both summer (June–July) and fall (September–October) at a 0.5 km resolution, covering the majority of the strait. These results show that the surface 75 m of the entire Strait of Georgia is habitat for juvenile salmon from June through early October, but that distributions within the strait differ across species and across seasons. While there is interannual variability in abundances and distributions, our analysis identifies areas that have consistently high abundances across years. The results from this study illustrate juvenile habitat use in the Strait of Georgia for the five species of Pacific salmon and can support ongoing marine spatial planning initiatives in the Pacific region of Canada. Methods: Juvenile Salmon Survey Data This analysis is based on surveys conducted between 2010 and 2020. Sets that lasted between 12 and 50 minutes and at depths less than or equal to 60 m (head rope depth) were included. The resulting survey dataset consists of 1588 sets. The analysis included all five species of Pacific salmon. For pink salmon, only even year surveys were included as they have a two-year life cycle and are effectively absent from the Strait in odd years. Geostatistical model of salmon abundance and Predictions We estimated the spatial distribution and abundance of each species of Pacific Salmon using geostatistical models fit with sdmTMB (Anderson et al. 2022). For each species, we modelled the number of individuals caught in a set, at a location and time using a negative binomial observation model with a log link. Predictions were made for each survey season (summer and fall) in each year from 2010 to 2020 over a 500 m by 500 m grid based on a 3 km buffer around the outer concave hull of the trawl coordinates. The concave hull was calculated using the ‘sf_concave_hull’ function from the sf package using a concavity ratio of 0.3, and excluding holes. Predictions were made as catch per unit effort (CPUE, for 60 minutes) for tows conducted in the surface waters (i.e., head rope at 0 m). Continuous estimates are provided at a 0.5 km resolution throughout the Strait of Georgia. These estimates consist of 1) mean catch per unit effort (CPUE), 2) year-to-year coefficient of variation (CV) of CPUE as a measure of the temporal variability, 3) binned biscale measures of mean vs. CV of CPUE to distinguish areas where abundance is consistently high vs. areas where it is high on average, but with high year-to-year variability, and 4) mean standard error in CPUE as a measure of uncertainty. See Thompson and Neville for full method details. Uncertainties: Although the models had relatively low uncertainty and the estimated spatial patterns reflected the spatial and temporal variation in CPUE in the surveys, it is important to understand the limitations of these model predictions. Because juvenile salmon are often aggregated, there is high variability in the CPUE in the survey data. Our model predictions represent the geometric mean CPUE and so are an average expectation, but do not reproduce the high inter-tow variability that is present in the survey data. Spatially, our predictions have low uncertainty in areas that are central within the standard survey track line. However, uncertainty is higher on the margins of the survey area, where there are fewer sets to inform those predictions. Data Sources: Juvenile salmon survey database from Salmon Marine Interactions Program, REEFF, ESD, Pacific Biological Station. Data is also available through Canadian Data Report of Fisheries and Aquatic Sciences publications.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Temporal Coverage: September 1992 - September 2024
This file contains Global Mean Sea Level (GMSL) variations computed at the NASA Goddard Space Flight Center under the auspices of the NASA Sea Level Change program. The GMSL was generated using the Integrated Multi-Mission Ocean Altimeter Data for Climate Research (http://podaac.jpl.nasa.gov/dataset/MERGED_TP_J1_OSTM_OST_ALL_V52). It combines Sea Surface Heights from the TOPEX/Poseidon, Jason-1, OSTM/Jason-2, Jason-3, and Sentinel-6 Michael Freilich missions to a common terrestrial reference frame with all inter-mission biases, range and geophysical corrections applied and placed onto a georeferenced orbit. This creates a consistent data record throughout time, regardless of the instrument used. Note, the most recent estimates of GMSL (post March 28, 2022) derived from the Sentinel-6 Michael Freilich mission are preliminary as validation and reprocessing procedures for Sentinel-6 are ongoing.
For information on how the data were generate please refer to: - Beckley, B. D., Callahan, P. S., Hancock, D. W., Mitchum, G. T., & Ray, R. D. (2017). On the 'cal-mode' correction to TOPEX satellite altimetry and its effect on the global mean sea level time series. Journal of Geophysical Research: Oceans, 122. https://doi.org/10.1002/2017JC013090 - Beckley, B.D., N. P. Zelensky, S. A. Holmes, F. G. Lemoine, R. D. Ray, G. T. Mitchum, S. D. Desai & S. T. Brown, Assessment of the Jason-2 Extension to the TOPEX/Poseidon, Jason-1 Sea-Surface Height Time Series for Global Mean Sea Level Monitoring, Marine Geodesy, Vol 33, Suppl 1, 2010. DOI:10.1080/01490419.2010.491029
Global Mean Sea Level (GMSL) variations from TPJAOS v5.1
Missing or bad value flag: 99900.000
TOPEX/Jason 1996-2016 collinear mean reference derived from cycles 121 to 858.
GSFC. 2024. Global Mean Sea Level Trend from Integrated Multi-Mission Ocean Altimeters TOPEX/Poseidon, Jason-1, OSTM/Jason-2, Jason-3, and Sentinel-6A Version 5.2. Ver. 5.2 PO.DAAC, CA, USA. Dataset accessed [2024-12-19] at https://doi.org/10.5067/GMSLM-TJ152.
Facebook
TwitterThe report on the Household Income and Expenditure Survey (HIES) of the Maldives 2002-2003 presents an analysis of findings from the survey. This is the first nationwide HIES conducted in the Maldives. The survey was conducted during the period September 2002 to June 2003 in four quarterly rounds during the months of September and December in 2002 and March and June in 2003. The survey was designed as a scientific random sample with separate strata for Male' and the five development regions. It covered the capital Male' and forty islands from the Atolls and includes data for 834 households.
As HIES data provide a snapshot of the socio-economic situation of the households prior to the Tsunami, it would also serve as significant baseline data in the future assessments and comparisons in this regard.
The tables presented in the report provide the data by standard classifications that include income quintiles, income classes, and daily per capita expenditure classes, cross-tabulated by demographic variables such as the size of household, number of earners, proportion of expenditure on different expenditure categories. To the extent possible, all information is presented separately for Male' and the Atolls and also at the regional level for the atolls and comparisons are made with the 1997/98 VPA.
National
persons, households, expenditure items
Sample survey data [ssd]
Sample design The frame The country is divided into 5 development regions and 20 administrative atolls. Administrative atolls consist of 199 inhabited islands with clearly marked census enumeration blocks. The capital town of Male' has separate administrative status. HIES uses the area frame thus administrative and geographic structure of the country is taken as a basis to make the sample representative. Required data for sampling are obtained from the Population census 2000. Major characteristics of the frame are given below.
Atolls are too big to take as sampling unit, while size of islands in terms of the number of household varies even after some exclusion from merely 20 to 1500. Initially, it was thought to split some big islands and combine smaller to get evenly distributed area unit. Alternatively, census enumeration blocks are chosen to be primary sampling units from practical considerations. The size of enumeration blocks varies from 20 to 64 households.
Sample size Estimated mean of a characteristic from the sample survey results may deviate from the population mean resulting in a margin of sampling error. Relative error of the sample mean can be controlled by determining the sample size based on coefficient of variation.
Variance for income of households was estimated from the results of earlier surveys. Distribution of households by income group was skewed as expected due to the small number of households of high-income groups. After some extreme cases were excluded distribution came to be fairly normal with the coefficient of variation of 7.5%.
In order to produce the results with same precision at 95% level of confidence sample size n was estimated at 885 households (inversely, probability that sample mean calculated from results survey of 885 households will contain more than 7.5% of sampling error does not exceed 5%). In the course of the allocation process, actual number of households to be surveyed turned to be 880 and that much of deviation was accepted.
Stratification and allocation of sample The purpose of the stratification is to divide the population into relatively homogeneous groups and thereby reduce the total variation by the margin of inter-group variation. Stratification allows proper allocation of sample in different groups and makes it more representative.
First, it was essential to treat Male' separately in the whole sampling procedure. The income opportunities and expenditure pattern in Male' is very much different from the rest of the country. It is also necessary to produce separate estimates for Male' like all other national surveys in Maldives. So, there are two domains of the survey namely: Male' and Atolls.
Stratification in Male' was done by wards and sub-wards to make the sample spread over different urban areas. Male has 5 wards out of which two wards each were split to 2 sub-wards due to their larger size.
Stratification in Atolls was done by development regions and by distance of islands to the atoll centre, which was aimed to group the islands of different regions remotely located from the atoll centres and those located nearby. It was believed that in most of the cases atoll centres are also economic centres providing more income opportunities and better access to different kind of social services. Islands remotely located from the atoll centres have limited facilities affecting on income and expenditure of households.
For grouping purpose, islands of each development region were listed in the ascending order of the distance from the respective atoll centres and a median value was located for the number of households. Then each atoll was divided to Central and Remote islands, where both groups were of more or less equal size in terms of the number of households. The average distance of Central islands in different region varied from 10 to 14 km and of Remote islands 27 to 37 km. Crossstratification by five development regions and two distance categories formed ten strata in the domain of atolls as shown in the chart.
Allocation of sample in domains was intended to make proportional to the number of households. However, it has not been possible due to the resource constraints. The survey cost per psu in Atolls was estimated from four to five thousand M.Rf., while survey cost in Male is limited to stationery because staffs do not get any extra allowances for working in Male.
Further allocation was made based on equal number of sample for all strata. The number of enumeration blocks as well as the number of households did not vary much, so the fixed number of samples over all strata resulted in sampling fraction ranging from 2% to 7% for enumeration blocks and 1.1% to 3.6% for households.
Allocation of sample in Male' strata is made on same principle. For each of the 7 strata equal number of 4 blocks are allocated with the rate of 10 households per stratum, which gives total number of 28 sample blocks and 280 households.
The sampling procedures are fully described in Appendix 1 of "Maldives Household Income and Expenditure Survey 2002-2003 - Final Report".
Face-to-face [f2f]
As mentioned above, the sample design for Male' included enumeration of the households during two rounds of the survey. It proved difficult to obtain the co-operation of some of the selected households for participation during the second round of enumeration and as a result, a larger than expected non-response was encountered. In total, 24 households, that is about 15 percent of the total, did not participate in their second period of enumeration. In the atolls, enumeration was not accomplished for only five households, or less than one percent of the total. In addition to the non-response of 29 households, in a small number of cases it was necessary to remove the household information during processing, basically because insufficient information was available. As a result, the final data set contains information for 834 households instead of the 880 in the design.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Supporting documentation on code lists, subject definitions, data accuracy, and statistical testing can be found on the American Community Survey website in the Data and Documentation section...Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section..Tell us what you think. Provide feedback to help make American Community Survey data more useful for you..Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, it is the Census Bureau''s Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns and estimates of housing units for states and counties..Explanation of Symbols:An ''**'' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate..An ''-'' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution..An ''-'' following a median estimate means the median falls in the lowest interval of an open-ended distribution..An ''+'' following a median estimate means the median falls in the upper interval of an open-ended distribution..An ''***'' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate..An ''*****'' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate. .An ''N'' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small..An ''(X)'' means that the estimate is not applicable or not available..Estimates of urban and rural population, housing units, and characteristics reflect boundaries of urban areas defined based on Census 2010 data. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization..While the 2016 American Community Survey (ACS) data generally reflect the February 2013 Office of Management and Budget (OMB) definitions of metropolitan and micropolitan statistical areas; in certain instances the names, codes, and boundaries of the principal cities shown in ACS tables may differ from the OMB definitions due to differences in the effective dates of the geographic entities..When information is missing or inconsistent, the Census Bureau logically assigns an acceptable value using the response to a related question or questions. If a logical assignment is not possible, data are filled using a statistical process called allocation, which uses a similar individual or household to provide a donor value. The "Allocated" section is the number of respondents who received an allocated value for a particular subject..Workers include members of the Armed Forces and civilians who were at work last week..The 12 selected states are Connecticut, Maine, Massachusetts, Michigan, Minnesota, New Hampshire, New Jersey, New York, Pennsylvania, Rhode Island, Vermont, and Wisconsin..Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables..Source: U.S. Census Bureau, 2016 American Community Survey 1-Year Estimates
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Supporting documentation on code lists, subject definitions, data accuracy, and statistical testing can be found on the American Community Survey website in the Data and Documentation section...Sample size and data quality measures (including coverage rates, allocation rates, and response rates) can be found on the American Community Survey website in the Methodology section..The number of people moving out of Alaska to a different state has been overestimated in previous years due to collection issues. See Errata Notes for details..This table provides geographical mobility for persons relative to their residence at the time they were surveyed. The characteristics crossed by geographical mobility reflect the current survey year..Tell us what you think. Provide feedback to help make American Community Survey data more useful for you..Although the American Community Survey (ACS) produces population, demographic and housing unit estimates, it is the Census Bureau''s Population Estimates Program that produces and disseminates the official estimates of the population for the nation, states, counties, cities and towns and estimates of housing units for states and counties..Explanation of Symbols:An ''**'' entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate..An ''-'' entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution..An ''-'' following a median estimate means the median falls in the lowest interval of an open-ended distribution..An ''+'' following a median estimate means the median falls in the upper interval of an open-ended distribution..An ''***'' entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate..An ''*****'' entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate. .An ''N'' entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small..An ''(X)'' means that the estimate is not applicable or not available..Estimates of urban and rural population, housing units, and characteristics reflect boundaries of urban areas defined based on Census 2010 data. As a result, data for urban and rural areas from the ACS do not necessarily reflect the results of ongoing urbanization..While the 2011-2015 American Community Survey (ACS) data generally reflect the February 2013 Office of Management and Budget (OMB) definitions of metropolitan and micropolitan statistical areas; in certain instances the names, codes, and boundaries of the principal cities shown in ACS tables may differ from the OMB definitions due to differences in the effective dates of the geographic entities..Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted roughly as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables..Source: U.S. Census Bureau, 2011-2015 American Community Survey 5-Year Estimates
Facebook
TwitterReporting of Aggregate Case and Death Count data was discontinued on May 11, 2023, with the expiration of the COVID-19 public health emergency declaration. Although these data will continue to be publicly available, this dataset will no longer be updated.
The surveillance case definition for COVID-19, a nationally notifiable disease, was first described in a position statement from the Council for State and Territorial Epidemiologists, which was later revised. However, there is some variation in how jurisdictions implemented these case definitions. More information on how CDC collects COVID-19 case surveillance data can be found at FAQ: COVID-19 Data and Surveillance.
Aggregate Data Collection Process Since the beginning of the COVID-19 pandemic, data were reported from state and local health departments through a robust process with the following steps:
This process was collaborative, with CDC and jurisdictions working together to ensure the accuracy of COVID-19 case and death numbers. County counts provided the most up-to-date numbers on cases and deaths by report date. Throughout data collection, CDC retrospectively updated counts to correct known data quality issues.
Description This archived public use dataset focuses on the cumulative and weekly case and death rates per 100,000 persons within various sociodemographic factors across all states and their counties. All resulting data are expressed as rates calculated as the number of cases or deaths per 100,000 persons in counties meeting various classification criteria using the US Census Bureau Population Estimates Program (2019 Vintage).
Each county within jurisdictions is classified into multiple categories for each factor. All rates in this dataset are based on classification of counties by the characteristics of their population, not individual-level factors. This applies to each of the available factors observed in this dataset. Specific factors and their corresponding categories are detailed below.
Population-level factors Each unique population factor is detailed below. Please note that the “Classification” column describes each of the 12 factors in the dataset, including a data dictionary describing what each numeric digit means within each classification. The “Category” column uses numeric digits (2-6, depending on the factor) defined in the “Classification” column.
Metro vs. Non-Metro – “Metro_Rural” Metro vs. Non-Metro classification type is an aggregation of the 6 National Center for Health Statistics (NCHS) Urban-Rural classifications, where “Metro” counties include Large Central Metro, Large Fringe Metro, Medium Metro, and Small Metro areas and “Non-Metro” counties include Micropolitan and Non-Core (Rural) areas. 1 – Metro, including “Large Central Metro, Large Fringe Metro, Medium Metro, and Small Metro” areas 2 – Non-Metro, including “Micropolitan, and Non-Core” areas
Urban/rural - “NCHS_Class” Urban/rural classification type is based on the 2013 National Center for Health Statistics Urban-Rural Classification Scheme for Counties. Levels consist of:
1 Large Central Metro
2 Large Fringe Metro
3 Medium Metro
4 Small Metro
5 Micropolitan
6 Non-Core (Rural)
American Community Survey (ACS) data were used to classify counties based on their age, race/ethnicity, household size, poverty level, and health insurance status distributions. Cut points were generated by using tertiles and categorized as High, Moderate, and Low percentages. The classification “Percent non-Hispanic, Native Hawaiian/Pacific Islander” is only available for “Hawaii” due to low numbers in this category for other available locations. This limitation also applies to other race/ethnicity categories within certain jurisdictions, where 0 counties fall into the certain category. The cut points for each ACS category are further detailed below:
Age 65 - “Age65”
1 Low (0-24.4%) 2 Moderate (>24.4%-28.6%) 3 High (>28.6%)
Non-Hispanic, Asian - “NHAA”
1 Low (<=5.7%) 2 Moderate (>5.7%-17.4%) 3 High (>17.4%)
Non-Hispanic, American Indian/Alaskan Native - “NHIA”
1 Low (<=0.7%) 2 Moderate (>0.7%-30.1%) 3 High (>30.1%)
Non-Hispanic, Black - “NHBA”
1 Low (<=2.5%) 2 Moderate (>2.5%-37%) 3 High (>37%)
Hispanic - “HISP”
1 Low (<=18.3%) 2 Moderate (>18.3%-45.5%) 3 High (>45.5%)
Population in Poverty - “Pov”
1 Low (0-12.3%) 2 Moderate (>12.3%-17.3%) 3 High (>17.3%)
Population Uninsured- “Unins”
1 Low (0-7.1%) 2 Moderate (>7.1%-11.4%) 3 High (>11.4%)
Average Household Size - “HH”
1 Low (1-2.4) 2 Moderate (>2.4-2.6) 3 High (>2.6)
Community Vulnerability Index Value - “CCVI” COVID-19 Community Vulnerability Index (CCVI) scores are from Surgo Ventures, which range from 0 to 1, were generated based on tertiles and categorized as:
1 Low Vulnerability (0.0-0.4) 2 Moderate Vulnerability (0.4-0.6) 3 High Vulnerability (0.6-1.0)
Social Vulnerability Index Value – “SVI" Social Vulnerability Index (SVI) scores (vintage 2020), which also range from 0 to 1, are from CDC/ASTDR’s Geospatial Research, Analysis & Service Program. Cut points for CCVI and SVI scores were generated based on tertiles and categorized as:
1 Low Vulnerability (0-0.333) 2 Moderate Vulnerability (0.334-0.666) 3 High Vulnerability (0.667-1)
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
The dataset has been created specifically for practicing Python, NumPy, Pandas, and Matplotlib. It is designed to provide a hands-on learning experience in data manipulation, analysis, and visualization using these libraries.
Specifics of the Dataset:
The dataset consists of 5000 rows and 20 columns, representing various features with different data types and distributions. The features include numerical variables with continuous and discrete distributions, categorical variables with multiple categories, binary variables, and ordinal variables. Each feature has been generated using different probability distributions and parameters to introduce variations and simulate real-world data scenarios. The dataset is synthetic and does not represent any real-world data. It has been created solely for educational purposes.
One of the defining characteristics of this dataset is the intentional incorporation of various real-world data challenges:
Certain columns are randomly selected to be populated with NaN values, effectively simulating the common challenge of missing data. - The proportion of these missing values in each column varies randomly between 1% to 70%. - Statistical noise has been introduced in the dataset. For numerical values in some features, this noise adheres to a distribution with mean 0 and standard deviation 0.1. - Categorical noise is introduced in some features', with its categories randomly altered in about 1% of the rows. Outliers have also been embedded in the dataset, resonating with the Interquartile Range (IQR) rule
Context of the Dataset:
The dataset aims to provide a comprehensive playground for practicing Python, NumPy, Pandas, and Matplotlib. It allows learners to explore data manipulation techniques, perform statistical analysis, and create visualizations using the provided features. By working with this dataset, learners can gain hands-on experience in data cleaning, preprocessing, feature engineering, and visualization. Sources of the Dataset:
The dataset has been generated programmatically using Python's random number generation functions and probability distributions. No external sources or real-world data have been used in creating this dataset.