Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.
A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.
Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
Facebook
TwitterThe Best Management Practices Statistical Estimator (BMPSE) version 1.2.0 was developed by the U.S. Geological Survey (USGS), in cooperation with the Federal Highway Administration (FHWA) Office of Project Delivery and Environmental Review to provide planning-level information about the performance of structural best management practices for decision makers, planners, and highway engineers to assess and mitigate possible adverse effects of highway and urban runoff on the Nation's receiving waters (Granato 2013, 2014; Granato and others, 2021). The BMPSE was assembled by using a Microsoft Access® database application to facilitate calculation of BMP performance statistics. Granato (2014) developed quantitative methods to estimate values of the trapezoidal-distribution statistics, correlation coefficients, and the minimum irreducible concentration (MIC) from available data. Granato (2014) developed the BMPSE to hold and process data from the International Stormwater Best Management Practices Database (BMPDB, www.bmpdatabase.org). Version 1.0 of the BMPSE contained a subset of the data from the 2012 version of the BMPDB; the current version of the BMPSE (1.2.0) contains a subset of the data from the December 2019 version of the BMPDB. Selected data from the BMPDB were screened for import into the BMPSE in consultation with Jane Clary, the data manager for the BMPDB. Modifications included identifying water quality constituents, making measurement units consistent, identifying paired inflow and outflow values, and converting BMPDB water quality values set as half the detection limit back to the detection limit. Total polycyclic aromatic hydrocarbons (PAH) values were added to the BMPSE from BMPDB data; they were calculated from individual PAH measurements at sites with enough data to calculate totals. The BMPSE tool can sort and rank the data, calculate plotting positions, calculate initial estimates, and calculate potential correlations to facilitate the distribution-fitting process (Granato, 2014). For water-quality ratio analysis the BMPSE generates the input files and the list of filenames for each constituent within the Graphical User Interface (GUI). The BMPSE calculates the Spearman’s rho (ρ) and Kendall’s tau (τ) correlation coefficients with their respective 95-percent confidence limits and the probability that each correlation coefficient value is not significantly different from zero by using standard methods (Granato, 2014). If the 95-percent confidence limit values are of the same sign, then the correlation coefficient is statistically different from zero. For hydrograph extension, the BMPSE calculates ρ and τ between the inflow volume and the hydrograph-extension values (Granato, 2014). For volume reduction, the BMPSE calculates ρ and τ between the inflow volume and the ratio of outflow to inflow volumes (Granato, 2014). For water-quality treatment, the BMPSE calculates ρ and τ between the inflow concentrations and the ratio of outflow to inflow concentrations (Granato, 2014; 2020). The BMPSE also calculates ρ between the inflow and the outflow concentrations when a water-quality treatment analysis is done. The current version (1.2.0) of the BMPSE also has the option to calculate urban-runoff quality statistics from inflows to BMPs by using computer code developed for the Highway Runoff Database (Granato and Cazenas, 2009;Granato, 2019). Granato, G.E., 2013, Stochastic empirical loading and dilution model (SELDM) version 1.0.0: U.S. Geological Survey Techniques and Methods, book 4, chap. C3, 112 p., CD-ROM https://pubs.usgs.gov/tm/04/c03 Granato, G.E., 2014, Statistics for stochastic modeling of volume reduction, hydrograph extension, and water-quality treatment by structural stormwater runoff best management practices (BMPs): U.S. Geological Survey Scientific Investigations Report 2014–5037, 37 p., http://dx.doi.org/10.3133/sir20145037. Granato, G.E., 2019, Highway-Runoff Database (HRDB) Version 1.1.0: U.S. Geological Survey data release, https://doi.org/10.5066/P94VL32J. Granato, G.E., and Cazenas, P.A., 2009, Highway-Runoff Database (HRDB Version 1.0)--A data warehouse and preprocessor for the stochastic empirical loading and dilution model: Washington, D.C., U.S. Department of Transportation, Federal Highway Administration, FHWA-HEP-09-004, 57 p. https://pubs.usgs.gov/sir/2009/5269/disc_content_100a_web/FHWA-HEP-09-004.pdf Granato, G.E., Spaetzel, A.B., and Medalie, L., 2021, Statistical methods for simulating structural stormwater runoff best management practices (BMPs) with the stochastic empirical loading and dilution model (SELDM): U.S. Geological Survey Scientific Investigations Report 2020–5136, 41 p., https://doi.org/10.3133/sir20205136
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Objective: Build a robust predictive model to estimate the log_price of homestay listings based on
comprehensive analysis of their characteristics, amenities, and host information.
First make sure that the entire dataset is clean and ready to be used.
1. Feature Engineering:
Task: Enhance the dataset by creating actionable and insightful features. Calculate Host_Tenure by
determining the number of years from host_since to the current date, providing a measure of host
experience. Generate Amenities_Count by counting the items listed in the amenities array to quantify
property offerings. Determine Days_Since_Last_Review by calculating the days between last_review
and today to assess listing activity and relevance.
2. Exploratory Data Analysis (EDA):
Task: Conduct a deep dive into the dataset to uncover underlying patterns and relationships. Analyze how
pricing (log_price) correlates with both categorical (such as room_type and property_type) and
numerical features (like accommodates and number_of_reviews). Utilize statistical tools and
visualizations such as correlation matrices, histograms for distribution analysis, and scatter plots to explore
relationships between variables.
3. Geospatial Analysis:
Task: Investigate the geographical data to understand regional pricing trends. Plot listings on a map using
latitude and longitude data to visually assess price distribution. Examine if certain neighbourhoods or
proximity to city centres influence pricing, providing a spatial perspective to the pricing strategy.
4. Sentiment Analysis on Textual Data:
Task: Apply advanced natural language processing techniques to the description texts to extract
sentiment scores. Use sentiment analysis tools to determine whether positive or negative descriptions
influence listing prices, incorporating these findings into the predictive model being trained as a feature.
5. Amenities Analysis:
Task: Thoroughly parse and analyse the amenities provided in the listings. Identify which amenities are
most associated with higher or lower prices by applying statistical tests to determine correlations, thereby
informing both pricing strategy and model inputs.
6. Categorical Data Encoding:
Task: Convert categorical data into a format suitable for machine learning analysis. Apply one-hot encoding
to variables like room_type, city, and property_type, ensuring that the model can interpret these as
distinct features without any ordinal implication.
7. Model Development and Training:
Task: Design and train predictive models to estimate log_price. Begin with a simple linear regression to
establish a baseline, then explore more complex models such as RandomForest and GradientBoosting to
better capture non-linear relationships and interactions between features. Document (briefly within
Jupyter notebook itself) the model-building process, specifying the choice of algorithms and rationale.
8. Model Optimization and Validation:
Task: Systematically optimize the models to achieve the best performance. Employ techniques like grid
search to experiment with different hyperparameters settings. Validate model choices through techniques
like k-fold cross-validation, ensuring the model generalizes well to unseen data.
9. Feature Importance and Model Insights:
Task: Analyze the trained models to identify which features most significantly impact log_price. Utilize
model-specific methods like feature importance scores for tree-based models and SHAP values for an in
depth understanding of feature contributions.
10. Predictive Performance Assessment:
Task: Critically evaluate the performance of the final model on a reserved test set. Use metrics such as
Root Mean Squared Error (RMSE) and R-squared to assess accuracy and goodness of fit. Provide a detailed
analysis of the residuals to check for any patterns that might suggest model biases or misfit.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The aim of this study was to use linear body measurements to develop and validate a regression-based model for prediction of live weights (LW) of pigs reared under smallholder settings in rural areas in the southern highlands of Tanzania. LW of 400 pigs (range 7 to 91 kg) was measured, along with their heart girths (HG) and body lengths (BL). BL was measured from the midpoint between the ears to the tail base. HG was measured as chest circumference just behind the front legs. LW was determined using a portable hanging scale. An analysis of covariance was performed to test for differences in LW between male and female pigs, including age, HG and BL as covariates. LW was regressed on HG and BL using simple and multiple linear regressions. Models were developed for all pig ages, and separately for market/breeding-age pigs and those below market/breeding age. Model validation was done using a split-samples approach, followed by PRESS-related statistics. Model efficiency and accuracy were assessed using the coefficient of determination, R2, and standard deviation of the random error, respectively. Model stability was determined by assessing ‘shrinkage’ of R2 value. Results showed that HG was the best predictor of LW in market/breeding-age pigs (model equation: LW = 1.22HG—52.384; R2 = 0.94, error = 3.7). BL, age and sex of pigs did not influence LW estimates. It is expected that LW estimation tools will be developed to enable more accurate estimation of LW in the pig value chain in the area.
Facebook
TwitterBy Noah Rippner [source]
This dataset offers a unique opportunity to examine the pattern and trends of county-level cancer rates in the United States at the individual county level. Using data from cancer.gov and the US Census American Community Survey, this dataset allows us to gain insight into how age-adjusted death rate, average deaths per year, and recent trends vary between counties – along with other key metrics like average annual counts, met objectives of 45.5?, recent trends (2) in death rates, etc., captured within our deep multi-dimensional dataset. We are able to build linear regression models based on our data to determine correlations between variables that can help us better understand cancers prevalence levels across different counties over time - making it easier to target health initiatives and resources accurately when necessary or desired
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This kaggle dataset provides county-level datasets from the US Census American Community Survey and cancer.gov for exploring correlations between county-level cancer rates, trends, and mortality statistics. This dataset contains records from all U.S counties concerning the age-adjusted death rate, average deaths per year, recent trend (2) in death rates, average annual count of cases detected within 5 years, and whether or not an objective of 45.5 (1) was met in the county associated with each row in the table.
To use this dataset to its fullest potential you need to understand how to perform simple descriptive analytics which includes calculating summary statistics such as mean, median or other numerical values; summarizing categorical variables using frequency tables; creating data visualizations such as charts and histograms; applying linear regression or other machine learning techniques such as support vector machines (SVMs), random forests or neural networks etc.; differentiating between supervised vs unsupervised learning techniques etc.; reviewing diagnostics tests to evaluate your models; interpreting your findings; hypothesizing possible reasons and patterns discovered during exploration made through data visualizations ; Communicating and conveying results found via effective presentation slides/documents etc.. Having this understanding will enable you apply different methods of analysis on this data set accurately ad effectively.
Once these concepts are understood you are ready start exploring this data set by first importing it into your visualization software either tableau public/ desktop version/Qlikview / SAS Analytical suite/Python notebooks for building predictive models by loading specified packages based on usage like Scikit Learn if Python is used among others depending on what tool is used . Secondly a brief description of the entire table's column structure has been provided above . Statistical operations can be carried out with simple queries after proper knowledge of basic SQL commands is attained just like queries using sub sets can also be performed with good command over selecting columns while specifying conditions applicable along with sorting operations being done based on specific attributes as required leading up towards writing python codes needed when parsing specific portion of data desired grouping / aggregating different categories before performing any kind of predictions / models can also activated create post joining few tables possible , when ever necessary once again varying across tools being used Thereby diving deep into analyzing available features determined randomly thus creating correlation matrices figures showing distribution relationships using correlation & covariance matrixes , thus making evaluations deducing informative facts since revealing trends identified through corresponding scatter plots from a given metric gathered from appropriate fields!
- Building a predictive cancer incidence model based on county-level demographic data to identify high-risk areas and target public health interventions.
- Analyzing correlations between age-adjusted death rate, average annual count, and recent trends in order to develop more effective policy initiatives for cancer prevention and healthcare access.
- Utilizing the dataset to construct a machine learning algorithm that can predict county-level mortality rates based on socio-economic factors such as poverty levels and educational attainment rates
If you use this dataset i...
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Repeated measure MANOVA table combining all three performance measures of all 10 classifiers.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
Welcome to Kaggle's dataset, where we provide rich and detailed insights into professional football players. Analyze player performance and team data with over 125 different metrics covering everything from goal involvement to tackles won, errors made and clean sheets kept. With the high levels of granularity included in our analysis, you can identify which players are underperforming or stand out from their peers for areas such as defense, shot stopping and key passes. Discover current trends in the game or uncover players' hidden value with this comprehensive dataset - a must-have resource for any aspiring football analyst!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
Define Performance: The first step of using this dataset is defining what type of performance you are measuring. Are you looking at total goals scored? Assists made? Shots on target? This will allow you to choose which metrics from the dataset best fit your criteria.
Descriptive Analysis: Once you have chosen your metric(s), it's time for descriptive analysis. This means analyzing the patterns within the data that contribute towards that metric(s). Does one team have more potential assist makers than another? What about shot accuracy or tackles won %? With descriptive analysis, we'll look for general trends across teams or specific players that influence performance in a meaningful way.
Predictive Analysis: Finally, we can move onto predictive analysis. This type of analysis seeks to answer two questions: what are factors that predict player performance? And which factors are most important when predicting performance? Utilizing various predictive models—ex – Logistic regression or Random forest -we can determine which variables in our dataset best explain a certain metric’s outcome—for example –expected goals per match -and build models that accurately predict future outcomes based on given input values associated with those factors.
By following these steps outlined here, you'll be able to get started in finding relationships between different metrics from this dataset and leveraging these insights into predictions about player performance!
- Creating an advanced predictive analytics model: By using the data in this dataset, it would be possible to create an advanced predictive analytics model that can analyze player performance and provide more accurate insights on which players are likely to have the most impact during a given season.
- Using Machine Learning algorithms to identify potential transfer targets: By using a variety of metrics included in this dataset, such as shots, shots on target and goals scored, it would be possible to use Machine Learning algorithms to identify potential transfer targets for a team.
- Analyzing positional differences between players: This dataset contains information about each player's position as well as their performance metrics across various aspects of the game (e.g., crosses attempted, defensive clearances). Thus it could be used for analyzing how certain positional groupings perform differently from one another in certain aspects of their play over different stretches of time or within one season or matchday in particular.
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: DEF PerApp 2GWs.csv | Column name | Description | |:----------------------------|:------------------------------------------------------------| | Name | Name of the player. (String) | | App. | Number of appearances. (Integer) | | Minutes | Number of minutes played. (Integer) | | Shots | Number of shots taken. (Integer) | | Shots on Target | Number of shots on target. (Integer) ...
Facebook
TwitterThis data release consists of Microsoft Excel workbooks, shapefiles, and a figure (png format) related to a cooperative project between the U.S. Geological Survey (USGS) and the South Florida Water Management District (SFWMD) to derive projected future change factors for precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida. The change factors were computed as the ratio of projected future (2050-2089) to historical (1966-2005) extreme precipitation depths fitted to extreme precipitation data using a constrained maximum likelihood (CML) approach. The change factors are tabulated by duration (1, 3, and 7 days) and return period (5, 10, 25, 50, 100, and 200 years). The official historical NOAA Atlas 14 DDF curves based on partial-duration series (PDS) can be multiplied by the change factors derived in this project to determine projected future extreme precipitation for events of a given duration and return period. Various statistical, dynamical and hybrid downscaled precipitation datasets were used to derive the change factors at the grid cells closest to the NOAA Atlas 14 stations including (1) the Coordinated Regional Downscaling Experiment (CORDEX), (2) the Localized Constructed Analogues (LOCA) dataset, (3) the Multivariate Adaptive Constructed Analogs (MACA) dataset, (4) the Analog Resampling and Statistical Scaling Method by Jupiter Intelligence using the Weather Research and Forecasting Model (JupiterWRF). The emission scenarios evaluated include representative concentration pathways RCP4.5 and RCP8.5 from the Coupled Model Intercomparison Project Phase 5 (CMIP5) for the downscaled climate datasets CORDEX, LOCA, and MACA. The emission scenarios evaluated for the JupiterWRF downscaled dataset include RCP8.5 from CMIP5, and shared socioeconomic pathways SSP2-4.5 and SSP5-8.5 from the Coupled Model Intercomparison Project Phase 6 (CMIP6). Only daily durations are evaluated for JupiterWRF. When applying change factors to the historical NOAA Atlas 14 DDF curves to derive projected future precipitation DDF curves for the entire range of durations and return periods evaluated as part of this project, there is a possibility that the resulting projected future DDF curves may be inconsistent across duration and return period. By inconsistent it is meant that the precipitation depths may decrease for longer durations instead of increasing. Depending on the change factors used, this may happen in up to 6% of cases. In such a case, it is recommended that users use the higher of the projected future precipitation depths derived for the duration of interest and the previous shorter duration. This data release consists of four shapefiles: (1) polygons for the basins defined in the South Florida Water Management District (SFWMD)'s ArcHydro Enhanced Database (AHED) (AHED_basins.shp); (2) polygons of climate regions (Climate_regions.shp); (3) polygons of Areal Reduction Factor (ARF) regions for the state of Florida (ARF_regions.shp); and (4) point locations of NOAA Atlas 14 stations in central and south Florida for which depth-duration-frequency curves and change factors of precipitation depths were developed as part of this project (Atlas14_stations.shp). This data release also includes 21 tables. Four tables contain computed change factors for the four downscaled climate datasets: (1) CORDEX (CF_CORDEX_future_to_historical.xlsx); (2) LOCA (CF_LOCA_future_to_historical.xlsx); (3) MACA (CF_MACA_future_to_historical.xlsx); and (4) JupiterWRF (CF_JupiterWRF_future_to_historical.xlsx). Eight tables contain the corresponding DDF values for the historical and projected future periods in each of the four downscaled climate datasets: (1) CORDEX historical (DDF_CORDEX_historical.xlsx); (2) CORDEX projected future (DDF_CORDEX_future.xlsx); (3) LOCA historical (DDF_LOCA_historical.xlsx); (4) LOCA projected future (DDF_LOCA_future.xlsx); (5) MACA historical (DDF_MACA_historical.xlsx); (6) MACA projected future (DDF_MACA_future.xlsx); (7) JupiterWRF historical (DDF_JupiterWRF_historical.xlsx); and (8) JupiterWRF projected future (DDF_JupiterWRF_future.xlsx). Six tables contain quantiles of change factors at 174 NOAA Atlas 14 stations in central and south Florida derived from various downscaled climate datasets considering: (1) all models and all future emission scenarios evaluated (CFquantiles_future_to_historical_all_models_allRCPs.xlsx); (2) all models and only the RCP4.5 and SSP2-4.5 future emission scenarios (CFquantiles_future_to_historical_all_models_RCP4.5.xlsx); (3) all models and only the RCP8.5 and SSP5-8.5 future emission scenarios (CFquantiles_future_to_historical_all_models_RCP8.5.xlsx); (4) best models and all future emission scenarios evaluated (CFquantiles_future_to_historical_best_models_allRCPs.xlsx); (5) best models and only the RCP4.5 and SSP2-4.5 future emission scenarios (CFquantiles_future_to_historical_best_models_RCP4.5.xlsx); and (6) best models and only the RCP8.5 and SSP5-8.5 future emission scenarios (CFquantiles_future_to_historical_best_models_RCP8.5.xlsx). Finally, three tables contain miscellaneous information: (1) information about downscaled climate datasets and National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations used in this project (Datasets_station_information.xlsx); (2) best models for each downscaled climate dataset and for all downscaled climate datasets considered together (Best_model_lists.xlsx); and (3) areal reduction factors by region in Florida (Areal_reduction_factors.xlsx). An R script is provided which generates boxplots of change factors at a NOAA Atlas 14 station, or for all NOAA Atlas 14 stations in an ArcHydro Enhanced Database (AHED) basin or county (create_boxplot.R). A Microsoft Word file documenting code usage and available options is also provided within this data release (Documentation_R_script_create_boxplot.docx). Disclaimer: As a reminder, projected future (2050-89) and historical (1966-2005) DDF curves fitted to extreme precipitation data from models in each downscaled climate dataset are provided as part of this data release as a way to verify the computed change factors. However, these model-based projected future and historical DDF curves are expected to be biased and only their ratio (change factor) is considered a reasonable approximation of how historically-observed DDF depths might be multiplicatively amplified or muted in the future period 2050-89. An error was identified in the bias-corrected CORDEX data used as described at https://na-cordex.org/bias-correction-error.html. Datasets developed previously by the USGS for this data release were based on these erroneous data and were originally published at: Irizarry-Ortiz, M.M., and Stamm, J.F., 2021, Change factors to derive future precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida: U.S. Geological Survey data release, https://doi.org/10.5066/P9KEMHYM. Data downloaded from that ScienceBase page prior to April 1, 2022 are based on this erroneous bias-corrected CORDEX dataset and has been superseded by the data on this page. On January 10, 2022, the University Corporation for Atmospheric Research notified the USGS that a revised set of bias-corrected CORDEX data were available for download. The USGS recomputed Depth Duration-Frequency (DDF) curves and change factors based on the revised CORDEX dataset and the updated results were posted on this ScienceBase page on April 1, 2022. Data downloaded on this page are based on the revised bias-corrected CORDEX dataset. To obtain the previous superseded dataset, please contact Michelle Irizarry-Ortiz at mirizarry-ortiz@usgs.gov. First release: October 2021 Revised: March 2022
Facebook
TwitterLabel-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set’s suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named ’lfproQC’ and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
Facebook
TwitterThis study evaluates the consistency between in-situ measurements and gridded datasets for precipitation and temperature within the Great Salt Lake Basin, highlighting the significant implications for hydrological modelling and climate analysis. We analysed five widely recognized gridded datasets: GRIDMET, DAYMET, PRISM, NLDAS-2, and CONUS404, utilizing statistical metrics such as the Pearson Correlation Coefficient, Root Mean Square Error (RMSE), and Kling-Gupta Efficiency to assess their accuracy and reliability against ground truth data from 30 meteorological stations. Our findings indicate that the PRISM dataset outperformed others, demonstrating the lowest median RMSE values for both precipitation (approximately 1.9 mm/day) and temperature (approximately 0.9°C), which is attributed to its advanced interpolation methods that effectively incorporate orographic adjustments. In contrast, NLDAS-2 and CONUS404, despite their finer temporal resolutions, showed greater error variability and lower performance metrics, which may limit their utility for detailed hydrological applications. Through the use of visual analytical tools such as heatmaps and boxplots, we were able to vividly illustrate the performance disparities across the datasets, thereby providing a clear comparative analysis that underscores the strengths and weaknesses of each dataset. The study emphasizes the need for careful selection of gridded datasets based on specific regional characteristics to improve the accuracy and reliability of hydro climatological studies and supports better-informed decisions in climate-related adaptations and policy-making. The insights gained from this analysis aim to guide researchers and practitioners in selecting the most appropriate datasets that align with the unique climatic and topographical conditions of the Great Salt Lake Basin, enhancing the efficacy of environmental forecasting and resource management strategies.
Facebook
TwitterThe provided content consists of two parts: First, the code of the agent-based simulation model (in the folder "Model") and, second, the data that is generated using the model and analyzed in the manuscript (in the folder "Datasets").
This agent-based simulation model aims to analyze the effects of limited information access (and limited memory) in Holmström's hidden action model
on the principal’s and the agent’s respective utilities,
the effort (action) the agent makes to perform a task, and
the premium parameter in the rule to share the outcome between the principal and the agent.
MATLAB R2019b or higher is required to run the model and to analyze the datasets. In addition, the following packages are required to run the model:
Parallel Computing Toolbox
Symbolic Math Toolbox
Optimization Toolbox
Global Optimization Toolbox
Statistics and Machine Learning Toolbox
Open the folder "Model". Find and double-click the file main.m (in the folder "agentization"). The MATLAB editor opens, and you can change the simulation parameters.
To run the model, you can either:
Type the script name (main) in the command line and press enter
Select the main.m file in the editor and press the run button (green triangle)
Please note: If a message pops up with the options "Change Folger", "Add to Path", "Cancel", and "Help", please choose "Add to Path".
You can set all relevant parameters in the file main.m
umwSD: This is the standard deviation of the normal distribution from which the environmental variable is drawn. It is defined relative (in %) to the optimal outcome. We set it to either 5, 25, or 45.
jto: This is the number of simulation runs. We set it to 700 in all scenarios. You are free to change it to any number. However, please note that performing many simulation runs might take a long time.
limitedMemoryP: This parameter defines whether the principal’s memory is limited or not. The variable can be set to either true or false. If set to false, the principal’s memory is unlimited and changes in the variable "memoryP" have no effects.
limitedMemoryA: This parameter defines whether the agent’s memory is limited or not. The variable can be set to either true or false. If set to false, the agent’s memory is unlimited and changes in the variable "memoryA" have no effects.
memoryP: This variable defines the length of the principal’s memory (in periods). We set it either to 1, 3, or 5.
memoryA: This variable defines the length of the agent’s memory (in periods). We set it either to 1, 3, or 5.
The simulation model creates the folder "Results" in the project directory. This folder consists of at least one subfolder. The subfolder’s name consists, amongst others, of the values assigned to the variables umwSD (environment) and jto (number of simulation runs). This subfolder consists of two further folders named "einzelneSims" (in which only intermediate results are saved) and "final" (in which the final simulation data are saved). The simluation output includes 61 variables. However, not all of these variables are used in the analysis because some are saved for verification only. The most important variables are the following (the ones used in the study are printed in bold font):
opta: The effort level proposed by the second-best solution of Holmström’s model.
a_A_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the effort made by the agent to perform a task (in every timestep).
a_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the effort levels incited by the principal (in every timestep).
optp: The premium parameter proposed by the second-best solution of Holmström’s model.
p_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the premium parameter set by the principal (in every timestep).
optUA: The agent's utility proposed by the second-best solution of Holmström’s model.
UA_A_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the (agent’s) utility expected by the agent (in every timestep).
UA_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the agent's utility expected by the principal (in every timestep).
UA_realized_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the utility that realized for the agent (in every timestep).
lostUA-sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains the difference between the optimal and the realized utility for the agent in every timestep (i.e., the optimal minus the achieved utility of the agent).
optUP: The principal's utility proposed by the second-best solution of Holmström’s model.
UP_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the (principal’s) utility expected by the principal (in every timestep).
UP_realized_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the utility that realized for the principal (in every timestep).
lostUP-sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains the difference between the optimal and the realized utility for the principal in every timestep (i.e., the optimal minus the achieved utility of the principal).
optoutcome: The outcome proposed by the second-best solution of Holmström’s model.
outcome_realized_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the outcome that materialized (in every timestep).
limitedMemoryA: This variable gives information on whether the agent’s memory was limited or not. Either set to 1 or 0 (If set to 0, the agent’s memory is unlimited).
limitedMemoryP: This variable gives information on whether the principal’s memory was limited or not. Either set to 1 or 0 (If set to 0, the principal’s memory is unlimited).
lostoutcome_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains the result of the optimal outcome minus the achieved outcome (in every timestep).
uwmM: This variable gives information on the mean of the normally distributed environmental factor (set to 0 in our scenarios).
umwSD: This variable contains the standard deviation of the environmental variable; it is calculated as the chosen deviation in main.m multiplied by the optoutcome.
jto: This is the number of simulation runs (we set it to 700 in all scenarios).
The folder "Datasets" contains simulation data for scenarios with both limited and unlimited memory, covering all four observations (premium parameter, agent's effort, principal's utility, and agent's utility) in CSV format. Each row represents one simulation run, and columns represent timesteps within each run. For unlimited memory scenarios, the first column details environmental turbulence, followed by 200 columns representing the timesteps of a single run. For limited memory scenarios, the first three columns provide information on environmental turbulence, the principal's memory, and the agent's memory, followed by 20 columns that capture the results of each simulation run.
For any reamining questions, please contact me via stephan.leitner@aau.at
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Two Buttes by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Two Buttes across both sexes and to determine which sex constitutes the majority.
Key observations
There is a majority of female population, with 62.5% of total population being female. Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Two Buttes Population by Gender. You can refer the same here
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Two Rivers town by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Two Rivers town across both sexes and to determine which sex constitutes the majority.
Key observations
There is a majority of male population, with 57.96% of total population being male. Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Two Rivers town Population by Race & Ethnicity. You can refer the same here
Facebook
TwitterThe open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored ‘1’ or ‘0’ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course
Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24–0314
The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context
The dataset tabulates the population of Two Inlets township by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Two Inlets township across both sexes and to determine which sex constitutes the majority.
Key observations
There is a majority of male population, with 54.02% of total population being male. Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.
Scope of gender :
Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.
Variables / Data Columns
Good to know
Margin of Error
Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.
Custom data
If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.
Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.
This dataset is a part of the main dataset for Two Inlets township Population by Gender. You can refer the same here
Facebook
TwitterWith increasing concerns about freshwater cyanobacteria blooms, there is a need to identify which waterbodies are at risk for developing these blooms, especially those that produce cyanotoxins. To address this concern, we developed spatial statistical models using the US National Lakes Assessment, a survey with over 3,000 spring and summer observations of cyanobacteria abundance and microcystin concentration in lakes across the conterminous US. We combined these observations with other nationally available data to model which lake and watershed factors best explain the presence of harmful cyanobacterial blooms. We then used these models to estimate the cyanobacteria abundance and probability of microcystin detection in 124,500 lakes across the CONUS. This dataset includes the compiled data used to generate the models and the dataset used to generate prediction for a much larger population of lakes. The data package includes two tabular data files, two tabular metadata files, and one methods document.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Summary:
Estimated stand-off distance between ADS-B equipped aircraft and obstacles. Obstacle information was sourced from the FAA Digital Obstacle File and the FHWA National Bridge Inventory. Aircraft tracks were sourced from processed data curated from the OpenSky Network. Results are presented as histograms organized by aircraft type and distance away from runways.
Description:
For many aviation safety studies, aircraft behavior is represented using encounter models, which are statistical models of how aircraft behave during close encounters. They are used to provide a realistic representation of the range of encounter flight dynamics where an aircraft collision avoidance system would be likely to alert. These models currently and have historically have been limited to interactions between aircraft; they have not represented the specific interactions between obstacles and aircraft equipped transponders. In response, we calculated the standoff distance between obstacles and ADS-B equipped manned aircraft.
For robustness, this assessment considered two different datasets of manned aircraft tracks and two datasets of obstacles. For robustness, MIT LL calculated the standoff distance using two different datasets of aircraft tracks and two datasets of obstacles. This approach aligned with the foundational research used to support the ASTM F3442/F3442M-20 well clear criteria of 2000 feet laterally and 250 feet AGL vertically.
The two datasets of processed tracks of ADS-B equipped aircraft curated from the OpenSky Network. It is likely that rotorcraft were underrepresented in these datasets. There were also no considerations for aircraft equipped only with Mode C or not equipped with any transponders. The first dataset was used to train the v1.3 uncorrelated encounter models and referred to as the “Monday” dataset. The second dataset is referred to as the “aerodrome” dataset and was used to train the v2.0 and v3.x terminal encounter model. The Monday dataset consisted of 104 Mondays across North America. The other dataset was based on observations at least 8 nautical miles within Class B, C, D aerodromes in the United States for the first 14 days of each month from January 2019 through February 2020. Prior to any processing, the datasets required 714 and 847 Gigabytes of storage. For more details on these datasets, please refer to "Correlated Bayesian Model of Aircraft Encounters in the Terminal Area Given a Straight Takeoff or Landing" and “Benchmarking the Processing of Aircraft Tracks with Triples Mode and Self-Scheduling.”
Two different datasets of obstacles were also considered. First was point obstacles defined by the FAA digital obstacle file (DOF) and consisted of point obstacle structures of antenna, lighthouse, meteorological tower (met), monument, sign, silo, spire (steeple), stack (chimney; industrial smokestack), transmission line tower (t-l tower), tank (water; fuel), tramway, utility pole (telephone pole, or pole of similar height, supporting wires), windmill (wind turbine), and windsock. Each obstacle was represented by a cylinder with the height reported by the DOF and a radius based on the report horizontal accuracy. We did not consider the actual width and height of the structure itself. Additionally, we only considered obstacles at least 50 feet tall and marked as verified in the DOF.
The other obstacle dataset, termed as “bridges,” was based on the identified bridges in the FAA DOF and additional information provided by the National Bridge Inventory. Due to the potential size and extent of bridges, it would not be appropriate to model them as point obstacles; however, the FAA DOF only provides a point location and no information about the size of the bridge. In response, we correlated the FAA DOF with the National Bridge Inventory, which provides information about the length of many bridges. Instead of sizing the simulated bridge based on horizontal accuracy, like with the point obstacles, the bridges were represented as circles with a radius of the longest, nearest bridge from the NBI. A circle representation was required because neither the FAA DOF or NBI provided sufficient information about orientation to represent bridges as rectangular cuboid. Similar to the point obstacles, the height of the obstacle was based on the height reported by the FAA DOF. Accordingly, the analysis using the bridge dataset should be viewed as risk averse and conservative. It is possible that a manned aircraft was hundreds of feet away from an obstacle in actuality but the estimated standoff distance could be significantly less. Additionally, all obstacles are represented with a fixed height, the potentially flat and low level entrances of the bridge are assumed to have the same height as the tall bridge towers. The attached figure illustrates an example simulated bridge.
It would had been extremely computational inefficient to calculate the standoff distance for all possible track points. Instead, we define an encounter between an aircraft and obstacle as when an aircraft flying 3069 feet AGL or less comes within 3000 feet laterally of any obstacle in a 60 second time interval. If the criteria were satisfied, then for that 60 second track segment we calculate the standoff distance to all nearby obstacles. Vertical separation was based on the MSL altitude of the track and the maximum MSL height of an obstacle.
For each combination of aircraft track and obstacle datasets, the results were organized seven different ways. Filtering criteria were based on aircraft type and distance away from runways. Runway data was sourced from the FAA runways of the United States, Puerto Rico, and Virgin Islands open dataset. Aircraft type was identified as part of the em-processing-opensky workflow.
License
This dataset is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International(CC BY-NC-ND 4.0).
This license requires that reusers give credit to the creator. It allows reusers to copy and distribute the material in any medium or format in unadapted form and for noncommercial purposes only. Only noncommercial use of your work is permitted. Noncommercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Exceptions are given for the not for profit standards organizations of ASTM International and RTCA.
MIT is releasing this dataset in good faith to promote open and transparent research of the low altitude airspace. Given the limitations of the dataset and a need for more research, a more restrictive license was warranted. Namely it is based only on only observations of ADS-B equipped aircraft, which not all aircraft in the airspace are required to employ; and observations were source from a crowdsourced network whose surveillance coverage has not been robustly characterized.
As more research is conducted and the low altitude airspace is further characterized or regulated, it is expected that a future version of this dataset may have a more permissive license.
Distribution Statement
DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.
© 2021 Massachusetts Institute of Technology.
Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.
This material is based upon work supported by the Federal Aviation Administration under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Federal Aviation Administration.
This document is derived from work done for the FAA (and possibly others); it is not the direct product of work done for the FAA. The information provided herein may include content supplied by third parties. Although the data and information contained herein has been produced or processed from sources believed to be reliable, the Federal Aviation Administration makes no warranty, expressed or implied, regarding the accuracy, adequacy, completeness, legality, reliability or usefulness of any information, conclusions or recommendations provided herein. Distribution of the information contained herein does not constitute an endorsement or warranty of the data or information provided herein by the Federal Aviation Administration or the U.S. Department of Transportation. Neither the Federal Aviation Administration nor the U.S. Department of
Facebook
TwitterThese data represent the predicted (modeled) prevalence of Diabetes among adults (Age 18+) for each census tract in Colorado. Diabetes is defined as ever being diagnosed with Diabetes by a doctor, nurse, or other health professional, and this definition does not include gestational, borderline, or pre-diabetes.The estimate for each census tract represents an average that was derived from multiple years of Colorado Behavioral Risk Factor Surveillance System data (2014-2017).CDPHE used a model-based approach to measure the relationship between age, race, gender, poverty, education, location and health conditions or risk behavior indicators and applied this relationship to predict the number of persons' who have the health conditions or risk behavior for each census tract in Colorado. We then applied these probabilities, based on demographic stratification, to the 2013-2017 American Community Survey population estimates and determined the percentage of adults with the health conditions or risk behavior for each census tract in Colorado.The estimates are based on statistical models and are not direct survey estimates. Using the best available data, CDPHE was able to model census tract estimates based on demographic data and background knowledge about the distribution of specific health conditions and risk behaviors.The estimates are displayed in both the map and data table using point estimate values for each census tract and displayed using a Quintile range. The high and low value for each color on the map is calculated based on dividing the total number of census tracts in Colorado (1249) into five groups based on the total range of estimates for all Colorado census tracts. Each Quintile range represents roughly 20% of the census tracts in Colorado. No estimates are provided for census tracts with a known population of less than 50. These census tracts are displayed in the map as "No Est, Pop < 50."No estimates are provided for 7 census tracts with a known population of less than 50 or for the 2 census tracts that exclusively contain a federal correctional institution as 100% of their population. These 9 census tracts are displayed in the map as "No Estimate."
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.
A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.
Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.