91 datasets found

n
Data from: WiBB: An integrated method for quantifying the relative...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+1more
zip
Updated Aug 20, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xsj3tx9g1
Dataset updated
Aug 20, 2021
Dataset provided by
Field Museum of Natural History
Beijing Normal University
Authors
Qin Li; Xiaojun Kou
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.
d
Data from: Best Management Practices Statistical Estimator (BMPSE) Version...
catalog.data.gov
data.usgs.gov
Updated Nov 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Best Management Practices Statistical Estimator (BMPSE) Version 1.2.0 [Dataset]. https://catalog.data.gov/dataset/best-management-practices-statistical-estimator-bmpse-version-1-2-0
Explore at:
Dataset updated
Nov 27, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
The Best Management Practices Statistical Estimator (BMPSE) version 1.2.0 was developed by the U.S. Geological Survey (USGS), in cooperation with the Federal Highway Administration (FHWA) Office of Project Delivery and Environmental Review to provide planning-level information about the performance of structural best management practices for decision makers, planners, and highway engineers to assess and mitigate possible adverse effects of highway and urban runoff on the Nation's receiving waters (Granato 2013, 2014; Granato and others, 2021). The BMPSE was assembled by using a Microsoft Access® database application to facilitate calculation of BMP performance statistics. Granato (2014) developed quantitative methods to estimate values of the trapezoidal-distribution statistics, correlation coefficients, and the minimum irreducible concentration (MIC) from available data. Granato (2014) developed the BMPSE to hold and process data from the International Stormwater Best Management Practices Database (BMPDB, www.bmpdatabase.org). Version 1.0 of the BMPSE contained a subset of the data from the 2012 version of the BMPDB; the current version of the BMPSE (1.2.0) contains a subset of the data from the December 2019 version of the BMPDB. Selected data from the BMPDB were screened for import into the BMPSE in consultation with Jane Clary, the data manager for the BMPDB. Modifications included identifying water quality constituents, making measurement units consistent, identifying paired inflow and outflow values, and converting BMPDB water quality values set as half the detection limit back to the detection limit. Total polycyclic aromatic hydrocarbons (PAH) values were added to the BMPSE from BMPDB data; they were calculated from individual PAH measurements at sites with enough data to calculate totals. The BMPSE tool can sort and rank the data, calculate plotting positions, calculate initial estimates, and calculate potential correlations to facilitate the distribution-fitting process (Granato, 2014). For water-quality ratio analysis the BMPSE generates the input files and the list of filenames for each constituent within the Graphical User Interface (GUI). The BMPSE calculates the Spearman’s rho (ρ) and Kendall’s tau (τ) correlation coefficients with their respective 95-percent confidence limits and the probability that each correlation coefficient value is not significantly different from zero by using standard methods (Granato, 2014). If the 95-percent confidence limit values are of the same sign, then the correlation coefficient is statistically different from zero. For hydrograph extension, the BMPSE calculates ρ and τ between the inflow volume and the hydrograph-extension values (Granato, 2014). For volume reduction, the BMPSE calculates ρ and τ between the inflow volume and the ratio of outflow to inflow volumes (Granato, 2014). For water-quality treatment, the BMPSE calculates ρ and τ between the inflow concentrations and the ratio of outflow to inflow concentrations (Granato, 2014; 2020). The BMPSE also calculates ρ between the inflow and the outflow concentrations when a water-quality treatment analysis is done. The current version (1.2.0) of the BMPSE also has the option to calculate urban-runoff quality statistics from inflows to BMPs by using computer code developed for the Highway Runoff Database (Granato and Cazenas, 2009;Granato, 2019). Granato, G.E., 2013, Stochastic empirical loading and dilution model (SELDM) version 1.0.0: U.S. Geological Survey Techniques and Methods, book 4, chap. C3, 112 p., CD-ROM https://pubs.usgs.gov/tm/04/c03 Granato, G.E., 2014, Statistics for stochastic modeling of volume reduction, hydrograph extension, and water-quality treatment by structural stormwater runoff best management practices (BMPs): U.S. Geological Survey Scientific Investigations Report 2014–5037, 37 p., http://dx.doi.org/10.3133/sir20145037. Granato, G.E., 2019, Highway-Runoff Database (HRDB) Version 1.1.0: U.S. Geological Survey data release, https://doi.org/10.5066/P94VL32J. Granato, G.E., and Cazenas, P.A., 2009, Highway-Runoff Database (HRDB Version 1.0)--A data warehouse and preprocessor for the stochastic empirical loading and dilution model: Washington, D.C., U.S. Department of Transportation, Federal Highway Administration, FHWA-HEP-09-004, 57 p. https://pubs.usgs.gov/sir/2009/5269/disc_content_100a_web/FHWA-HEP-09-004.pdf Granato, G.E., Spaetzel, A.B., and Medalie, L., 2021, Statistical methods for simulating structural stormwater runoff best management practices (BMPs) with the stochastic empirical loading and dilution model (SELDM): U.S. Geological Survey Scientific Investigations Report 2020–5136, 41 p., https://doi.org/10.3133/sir20205136
f
Strengths and weaknesses of different methods.
plos.figshare.com
xls
Updated Oct 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha (2023). Strengths and weaknesses of different methods. [Dataset]. http://doi.org/10.1371/journal.pgph.0002475.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pgph.0002475.t002
Dataset updated
Oct 31, 2023
Dataset provided by
PLOS Global Public Health
Authors
Mrinal Saha; Aparna Deb; Imtiaz Sultan; Sujat Paul; Jishan Ahmed; Goutam Saha
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Vitamin D insufficiency appears to be prevalent in SLE patients. Multiple factors potentially contribute to lower vitamin D levels, including limited sun exposure, the use of sunscreen, darker skin complexion, aging, obesity, specific medical conditions, and certain medications. The study aims to assess the risk factors associated with low vitamin D levels in SLE patients in the southern part of Bangladesh, a region noted for a high prevalence of SLE. The research additionally investigates the possible correlation between vitamin D and the SLEDAI score, seeking to understand the potential benefits of vitamin D in enhancing disease outcomes for SLE patients. The study incorporates a dataset consisting of 50 patients from the southern part of Bangladesh and evaluates their clinical and demographic data. An initial exploratory data analysis is conducted to gain insights into the data, which includes calculating means and standard deviations, performing correlation analysis, and generating heat maps. Relevant inferential statistical tests, such as the Student’s t-test, are also employed. In the machine learning part of the analysis, this study utilizes supervised learning algorithms, specifically Linear Regression (LR) and Random Forest (RF). To optimize the hyperparameters of the RF model and mitigate the risk of overfitting given the small dataset, a 3-Fold cross-validation strategy is implemented. The study also calculates bootstrapped confidence intervals to provide robust uncertainty estimates and further validate the approach. A comprehensive feature importance analysis is carried out using RF feature importance, permutation-based feature importance, and SHAP values. The LR model yields an RMSE of 4.83 (CI: 2.70, 6.76) and MAE of 3.86 (CI: 2.06, 5.86), whereas the RF model achieves better results, with an RMSE of 2.98 (CI: 2.16, 3.76) and MAE of 2.68 (CI: 1.83,3.52). Both models identify Hb, CRP, ESR, and age as significant contributors to vitamin D level predictions. Despite the lack of a significant association between SLEDAI and vitamin D in the statistical analysis, the machine learning models suggest a potential nonlinear dependency of vitamin D on SLEDAI. These findings highlight the importance of these factors in managing vitamin D levels in SLE patients. The study concludes that there is a high prevalence of vitamin D insufficiency in SLE patients. Although a direct linear correlation between the SLEDAI score and vitamin D levels is not observed, machine learning models suggest the possibility of a nonlinear relationship. Furthermore, factors such as Hb, CRP, ESR, and age are identified as more significant in predicting vitamin D levels. Thus, the study suggests that monitoring these factors may be advantageous in managing vitamin D levels in SLE patients. Given the immunological nature of SLE, the potential role of vitamin D in SLE disease activity could be substantial. Therefore, it underscores the need for further large-scale studies to corroborate this hypothesis.
Homestays data
kaggle.com
zip
Updated May 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyanshu shukla (2024). Homestays data [Dataset]. https://www.kaggle.com/datasets/priyanshu594/homestays-data
Explore at:
zip(44330689 bytes)Available download formats
Dataset updated
May 25, 2024
Authors
Priyanshu shukla
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Objective: Build a robust predictive model to estimate the log_price of homestay listings based on comprehensive analysis of their characteristics, amenities, and host information. First make sure that the entire dataset is clean and ready to be used. 1. Feature Engineering: Task: Enhance the dataset by creating actionable and insightful features. Calculate Host_Tenure by determining the number of years from host_since to the current date, providing a measure of host experience. Generate Amenities_Count by counting the items listed in the amenities array to quantify property offerings. Determine Days_Since_Last_Review by calculating the days between last_review and today to assess listing activity and relevance. 2. Exploratory Data Analysis (EDA): Task: Conduct a deep dive into the dataset to uncover underlying patterns and relationships. Analyze how pricing (log_price) correlates with both categorical (such as room_type and property_type) and numerical features (like accommodates and number_of_reviews). Utilize statistical tools and visualizations such as correlation matrices, histograms for distribution analysis, and scatter plots to explore relationships between variables. 3. Geospatial Analysis: Task: Investigate the geographical data to understand regional pricing trends. Plot listings on a map using latitude and longitude data to visually assess price distribution. Examine if certain neighbourhoods or proximity to city centres influence pricing, providing a spatial perspective to the pricing strategy. 4. Sentiment Analysis on Textual Data: Task: Apply advanced natural language processing techniques to the description texts to extract sentiment scores. Use sentiment analysis tools to determine whether positive or negative descriptions influence listing prices, incorporating these findings into the predictive model being trained as a feature. 5. Amenities Analysis: Task: Thoroughly parse and analyse the amenities provided in the listings. Identify which amenities are most associated with higher or lower prices by applying statistical tests to determine correlations, thereby informing both pricing strategy and model inputs. 6. Categorical Data Encoding: Task: Convert categorical data into a format suitable for machine learning analysis. Apply one-hot encoding to variables like room_type, city, and property_type, ensuring that the model can interpret these as distinct features without any ordinal implication. 7. Model Development and Training: Task: Design and train predictive models to estimate log_price. Begin with a simple linear regression to establish a baseline, then explore more complex models such as RandomForest and GradientBoosting to better capture non-linear relationships and interactions between features. Document (briefly within Jupyter notebook itself) the model-building process, specifying the choice of algorithms and rationale. 8. Model Optimization and Validation: Task: Systematically optimize the models to achieve the best performance. Employ techniques like grid search to experiment with different hyperparameters settings. Validate model choices through techniques like k-fold cross-validation, ensuring the model generalizes well to unseen data. 9. Feature Importance and Model Insights: Task: Analyze the trained models to identify which features most significantly impact log_price. Utilize model-specific methods like feature importance scores for tree-based models and SHAP values for an in depth understanding of feature contributions. 10. Predictive Performance Assessment: Task: Critically evaluate the performance of the final model on a reserved test set. Use metrics such as Root Mean Squared Error (RMSE) and R-squared to assess accuracy and goodness of fit. Provide a detailed analysis of the residuals to check for any patterns that might suggest model biases or misfit.
Data from: S1 Dataset -
plos.figshare.com
xlsx
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mwemezi L. Kabululu (2023). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0295433.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0295433.s001
Dataset updated
Dec 6, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Mwemezi L. Kabululu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The aim of this study was to use linear body measurements to develop and validate a regression-based model for prediction of live weights (LW) of pigs reared under smallholder settings in rural areas in the southern highlands of Tanzania. LW of 400 pigs (range 7 to 91 kg) was measured, along with their heart girths (HG) and body lengths (BL). BL was measured from the midpoint between the ears to the tail base. HG was measured as chest circumference just behind the front legs. LW was determined using a portable hanging scale. An analysis of covariance was performed to test for differences in LW between male and female pigs, including age, HG and BL as covariates. LW was regressed on HG and BL using simple and multiple linear regressions. Models were developed for all pig ages, and separately for market/breeding-age pigs and those below market/breeding age. Model validation was done using a split-samples approach, followed by PRESS-related statistics. Model efficiency and accuracy were assessed using the coefficient of determination, R2, and standard deviation of the random error, respectively. Model stability was determined by assessing ‘shrinkage’ of R2 value. Results showed that HG was the best predictor of LW in market/breeding-age pigs (model equation: LW = 1.22HG—52.384; R2 = 0.94, error = 3.7). BL, age and sex of pigs did not influence LW estimates. It is expected that LW estimation tools will be developed to enable more accurate estimation of LW in the pig value chain in the area.
Cancer County-Level
kaggle.com
zip
Updated Dec 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Cancer County-Level [Dataset]. https://www.kaggle.com/datasets/thedevastator/exploring-county-level-correlations-in-cancer-ra
Explore at:
zip(146998 bytes)Available download formats
Dataset updated
Dec 3, 2022
Authors
The Devastator
Description
Exploring County-Level Correlations in Cancer Rates and Trends

A Multivariate Ordinary Least Squares Regression Model

By Noah Rippner [source]

About this dataset

This dataset offers a unique opportunity to examine the pattern and trends of county-level cancer rates in the United States at the individual county level. Using data from cancer.gov and the US Census American Community Survey, this dataset allows us to gain insight into how age-adjusted death rate, average deaths per year, and recent trends vary between counties – along with other key metrics like average annual counts, met objectives of 45.5?, recent trends (2) in death rates, etc., captured within our deep multi-dimensional dataset. We are able to build linear regression models based on our data to determine correlations between variables that can help us better understand cancers prevalence levels across different counties over time - making it easier to target health initiatives and resources accurately when necessary or desired

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This kaggle dataset provides county-level datasets from the US Census American Community Survey and cancer.gov for exploring correlations between county-level cancer rates, trends, and mortality statistics. This dataset contains records from all U.S counties concerning the age-adjusted death rate, average deaths per year, recent trend (2) in death rates, average annual count of cases detected within 5 years, and whether or not an objective of 45.5 (1) was met in the county associated with each row in the table.

To use this dataset to its fullest potential you need to understand how to perform simple descriptive analytics which includes calculating summary statistics such as mean, median or other numerical values; summarizing categorical variables using frequency tables; creating data visualizations such as charts and histograms; applying linear regression or other machine learning techniques such as support vector machines (SVMs), random forests or neural networks etc.; differentiating between supervised vs unsupervised learning techniques etc.; reviewing diagnostics tests to evaluate your models; interpreting your findings; hypothesizing possible reasons and patterns discovered during exploration made through data visualizations ; Communicating and conveying results found via effective presentation slides/documents etc.. Having this understanding will enable you apply different methods of analysis on this data set accurately ad effectively.

Once these concepts are understood you are ready start exploring this data set by first importing it into your visualization software either tableau public/ desktop version/Qlikview / SAS Analytical suite/Python notebooks for building predictive models by loading specified packages based on usage like Scikit Learn if Python is used among others depending on what tool is used . Secondly a brief description of the entire table's column structure has been provided above . Statistical operations can be carried out with simple queries after proper knowledge of basic SQL commands is attained just like queries using sub sets can also be performed with good command over selecting columns while specifying conditions applicable along with sorting operations being done based on specific attributes as required leading up towards writing python codes needed when parsing specific portion of data desired grouping / aggregating different categories before performing any kind of predictions / models can also activated create post joining few tables possible , when ever necessary once again varying across tools being used Thereby diving deep into analyzing available features determined randomly thus creating correlation matrices figures showing distribution relationships using correlation & covariance matrixes , thus making evaluations deducing informative facts since revealing trends identified through corresponding scatter plots from a given metric gathered from appropriate fields!

Research Ideas

Building a predictive cancer incidence model based on county-level demographic data to identify high-risk areas and target public health interventions.

Analyzing correlations between age-adjusted death rate, average annual count, and recent trends in order to develop more effective policy initiatives for cancer prevention and healthcare access.

Utilizing the dataset to construct a machine learning algorithm that can predict county-level mortality rates based on socio-economic factors such as poverty levels and educational attainment rates

Acknowledgements

If you use this dataset i...
n
Data from: Functional traits and community composition: a comparison among...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Nov 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jesse E.D. Miller; Ellen I. Damschen; Anthony R. Ives (2018). Functional traits and community composition: a comparison among community-weighted means, weighted correlations, and multilevel models [Dataset]. http://doi.org/10.5061/dryad.7gj0s3b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.7gj0s3b
Dataset updated
Nov 2, 2018
Dataset provided by
University of Wisconsin–Madison
Authors
Jesse E.D. Miller; Ellen I. Damschen; Anthony R. Ives
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Siskiyou Mountains, Oregon
Description
Of the several approaches that are used to analyze functional trait-environment relationships, the most popular is community-weighted mean regressions (CWMr) in which species trait values are averaged at the site level and then regressed against environmental variables. Other approaches include model-based methods and weighted correlations of different metrics of trait-environment associations, the best known of which is the fourth-corner correlation method.

We investigated these three general statistical approaches for trait-environment associations: CWMr, five weighted correlation metrics (Peres-Neto et al. 2017), and two multilevel models (MLM) using four different methods for computing p-values. We first compared the methods applied to a plant community dataset. To determine the validity of the statistical conclusions, we then performed a simulation study.

CWMr gave highly significant associations for both traits, while the other methods gave a mix of support. CWMr had inflated type I errors for some simulation scenarios, implying that the significant results for the data could be spurious. The weighted correlation methods had generally good type I error control but had low power. One of the multilevel models, that from Jamil et al. (2013), had both good type I error control and high power when an appropriate method was used to obtain p-values. In particular, if there was no correlation among species in their abundances among sites, a parametric bootstrap likelihood ratio test (LRT) gave the best power. When there was correlation among species in their abundances, a conditional parametric LRT had correct type I errors but had lower power.

There is no overall best method for identifying trait-environment associations. For the simple task of testing, one-by-one, associations between single environmental variables and single traits, the weighted correlations with permutation tests all had good type I error control, and their ease of implementation is an advantage. For the more complex task of multivariate analyses and model fitting, and when high statistical power is needed, we recommend MLM2 (Jamil et al. 2013); however, care must be taken to ensure against inflated type I errors. Because CWMr exhibited highly inflated type I error rates, it should always be avoided.

We investigated these three general statistical approaches for trait-environment associations: CWMr, five weighted correlation metrics (Peres-Neto et al. 2017), and two multilevel models (MLM) using five different methods for computing p-values. We first compared the methods applied to a plant community dataset. To determine the validity of the statistical conclusions, we then performed a simulation study.

CWMr gave highly significant associations for both traits, while the other methods gave a mix of support. CWMr had inflated type I errors for some simulation scenarios. The weighted correlation methods had generally good type I error control but had low power. One of the multilevel models, that from Jamil et al. (2013), had both good type I error control and high power when an appropriate method was used to obtain p-values. In particular, if there was no correlation among species in their abundances among sites, a parametric bootstrap likelihood ratio test (LRT) gave the best power. When there was correlation among species in their abundances, a conditional parametric LRT had correct type I errors but suffered from low power.

There is no overall best method for identifying trait-environment associations. For the simple task of testing, one-by-one, associations between single environmental variables and single traits, the weighted correlations with permutation tests all had good type I error control, and their ease of implementation is an advantage. For the more complex task of multivariate analyses and model fitting, and when high statistical power is needed, we recommend MLM2 (Jamil et al. 2013); however, care must be taken to ensure against inflated type I errors. Because CWMr exhibited highly inflated type I error rates, it should be avoided.
f
Repeated measure MANOVA table combining all three performance measures of...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das (2023). Repeated measure MANOVA table combining all three performance measures of all 10 classifiers. [Dataset]. http://doi.org/10.1371/journal.pone.0285321.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0285321.t004
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Nivedita Bhadra; Shre Kumar Chatterjee; Saptarshi Das
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Repeated measure MANOVA table combining all three performance measures of all 10 classifiers.
Goalkeeper and Midfielder Statistics
kaggle.com
zip
Updated Dec 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Goalkeeper and Midfielder Statistics [Dataset]. https://www.kaggle.com/datasets/thedevastator/maximizing-player-performance-with-goalkeeper-an
Explore at:
zip(108659 bytes)Available download formats
Dataset updated
Dec 8, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Goalkeeper and Midfielder Statistics

Leveraging Statistical Data Of Goalkeepers and Midfielders

By [source]

About this dataset

Welcome to Kaggle's dataset, where we provide rich and detailed insights into professional football players. Analyze player performance and team data with over 125 different metrics covering everything from goal involvement to tackles won, errors made and clean sheets kept. With the high levels of granularity included in our analysis, you can identify which players are underperforming or stand out from their peers for areas such as defense, shot stopping and key passes. Discover current trends in the game or uncover players' hidden value with this comprehensive dataset - a must-have resource for any aspiring football analyst!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Define Performance: The first step of using this dataset is defining what type of performance you are measuring. Are you looking at total goals scored? Assists made? Shots on target? This will allow you to choose which metrics from the dataset best fit your criteria.

Descriptive Analysis: Once you have chosen your metric(s), it's time for descriptive analysis. This means analyzing the patterns within the data that contribute towards that metric(s). Does one team have more potential assist makers than another? What about shot accuracy or tackles won %? With descriptive analysis, we'll look for general trends across teams or specific players that influence performance in a meaningful way.

Predictive Analysis: Finally, we can move onto predictive analysis. This type of analysis seeks to answer two questions: what are factors that predict player performance? And which factors are most important when predicting performance? Utilizing various predictive models—ex – Logistic regression or Random forest -we can determine which variables in our dataset best explain a certain metric’s outcome—for example –expected goals per match -and build models that accurately predict future outcomes based on given input values associated with those factors.

By following these steps outlined here, you'll be able to get started in finding relationships between different metrics from this dataset and leveraging these insights into predictions about player performance!

Research Ideas

Creating an advanced predictive analytics model: By using the data in this dataset, it would be possible to create an advanced predictive analytics model that can analyze player performance and provide more accurate insights on which players are likely to have the most impact during a given season.

Using Machine Learning algorithms to identify potential transfer targets: By using a variety of metrics included in this dataset, such as shots, shots on target and goals scored, it would be possible to use Machine Learning algorithms to identify potential transfer targets for a team.

Analyzing positional differences between players: This dataset contains information about each player's position as well as their performance metrics across various aspects of the game (e.g., crosses attempted, defensive clearances). Thus it could be used for analyzing how certain positional groupings perform differently from one another in certain aspects of their play over different stretches of time or within one season or matchday in particular.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: DEF PerApp 2GWs.csv | Column name | Description | |:----------------------------|:------------------------------------------------------------| | Name | Name of the player. (String) | | App. | Number of appearances. (Integer) | | Minutes | Number of minutes played. (Integer) | | Shots | Number of shots taken. (Integer) | | Shots on Target | Number of shots on target. (Integer) ...
d
Change factors to derive projected future precipitation...
catalog.data.gov
data.usgs.gov
Updated Nov 13, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Change factors to derive projected future precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida [Dataset]. https://catalog.data.gov/dataset/r-script-to-create-boxplots-of-change-factors-by-noaa-atlas-14-station-or-for-all-stations-a92e3
Explore at:
Dataset updated
Nov 13, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Florida
Description
This data release consists of Microsoft Excel workbooks, shapefiles, and a figure (png format) related to a cooperative project between the U.S. Geological Survey (USGS) and the South Florida Water Management District (SFWMD) to derive projected future change factors for precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida. The change factors were computed as the ratio of projected future (2050-2089) to historical (1966-2005) extreme precipitation depths fitted to extreme precipitation data using a constrained maximum likelihood (CML) approach. The change factors are tabulated by duration (1, 3, and 7 days) and return period (5, 10, 25, 50, 100, and 200 years). The official historical NOAA Atlas 14 DDF curves based on partial-duration series (PDS) can be multiplied by the change factors derived in this project to determine projected future extreme precipitation for events of a given duration and return period. Various statistical, dynamical and hybrid downscaled precipitation datasets were used to derive the change factors at the grid cells closest to the NOAA Atlas 14 stations including (1) the Coordinated Regional Downscaling Experiment (CORDEX), (2) the Localized Constructed Analogues (LOCA) dataset, (3) the Multivariate Adaptive Constructed Analogs (MACA) dataset, (4) the Analog Resampling and Statistical Scaling Method by Jupiter Intelligence using the Weather Research and Forecasting Model (JupiterWRF). The emission scenarios evaluated include representative concentration pathways RCP4.5 and RCP8.5 from the Coupled Model Intercomparison Project Phase 5 (CMIP5) for the downscaled climate datasets CORDEX, LOCA, and MACA. The emission scenarios evaluated for the JupiterWRF downscaled dataset include RCP8.5 from CMIP5, and shared socioeconomic pathways SSP2-4.5 and SSP5-8.5 from the Coupled Model Intercomparison Project Phase 6 (CMIP6). Only daily durations are evaluated for JupiterWRF. When applying change factors to the historical NOAA Atlas 14 DDF curves to derive projected future precipitation DDF curves for the entire range of durations and return periods evaluated as part of this project, there is a possibility that the resulting projected future DDF curves may be inconsistent across duration and return period. By inconsistent it is meant that the precipitation depths may decrease for longer durations instead of increasing. Depending on the change factors used, this may happen in up to 6% of cases. In such a case, it is recommended that users use the higher of the projected future precipitation depths derived for the duration of interest and the previous shorter duration. This data release consists of four shapefiles: (1) polygons for the basins defined in the South Florida Water Management District (SFWMD)'s ArcHydro Enhanced Database (AHED) (AHED_basins.shp); (2) polygons of climate regions (Climate_regions.shp); (3) polygons of Areal Reduction Factor (ARF) regions for the state of Florida (ARF_regions.shp); and (4) point locations of NOAA Atlas 14 stations in central and south Florida for which depth-duration-frequency curves and change factors of precipitation depths were developed as part of this project (Atlas14_stations.shp). This data release also includes 21 tables. Four tables contain computed change factors for the four downscaled climate datasets: (1) CORDEX (CF_CORDEX_future_to_historical.xlsx); (2) LOCA (CF_LOCA_future_to_historical.xlsx); (3) MACA (CF_MACA_future_to_historical.xlsx); and (4) JupiterWRF (CF_JupiterWRF_future_to_historical.xlsx). Eight tables contain the corresponding DDF values for the historical and projected future periods in each of the four downscaled climate datasets: (1) CORDEX historical (DDF_CORDEX_historical.xlsx); (2) CORDEX projected future (DDF_CORDEX_future.xlsx); (3) LOCA historical (DDF_LOCA_historical.xlsx); (4) LOCA projected future (DDF_LOCA_future.xlsx); (5) MACA historical (DDF_MACA_historical.xlsx); (6) MACA projected future (DDF_MACA_future.xlsx); (7) JupiterWRF historical (DDF_JupiterWRF_historical.xlsx); and (8) JupiterWRF projected future (DDF_JupiterWRF_future.xlsx). Six tables contain quantiles of change factors at 174 NOAA Atlas 14 stations in central and south Florida derived from various downscaled climate datasets considering: (1) all models and all future emission scenarios evaluated (CFquantiles_future_to_historical_all_models_allRCPs.xlsx); (2) all models and only the RCP4.5 and SSP2-4.5 future emission scenarios (CFquantiles_future_to_historical_all_models_RCP4.5.xlsx); (3) all models and only the RCP8.5 and SSP5-8.5 future emission scenarios (CFquantiles_future_to_historical_all_models_RCP8.5.xlsx); (4) best models and all future emission scenarios evaluated (CFquantiles_future_to_historical_best_models_allRCPs.xlsx); (5) best models and only the RCP4.5 and SSP2-4.5 future emission scenarios (CFquantiles_future_to_historical_best_models_RCP4.5.xlsx); and (6) best models and only the RCP8.5 and SSP5-8.5 future emission scenarios (CFquantiles_future_to_historical_best_models_RCP8.5.xlsx). Finally, three tables contain miscellaneous information: (1) information about downscaled climate datasets and National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations used in this project (Datasets_station_information.xlsx); (2) best models for each downscaled climate dataset and for all downscaled climate datasets considered together (Best_model_lists.xlsx); and (3) areal reduction factors by region in Florida (Areal_reduction_factors.xlsx). An R script is provided which generates boxplots of change factors at a NOAA Atlas 14 station, or for all NOAA Atlas 14 stations in an ArcHydro Enhanced Database (AHED) basin or county (create_boxplot.R). A Microsoft Word file documenting code usage and available options is also provided within this data release (Documentation_R_script_create_boxplot.docx). Disclaimer: As a reminder, projected future (2050-89) and historical (1966-2005) DDF curves fitted to extreme precipitation data from models in each downscaled climate dataset are provided as part of this data release as a way to verify the computed change factors. However, these model-based projected future and historical DDF curves are expected to be biased and only their ratio (change factor) is considered a reasonable approximation of how historically-observed DDF depths might be multiplicatively amplified or muted in the future period 2050-89. An error was identified in the bias-corrected CORDEX data used as described at https://na-cordex.org/bias-correction-error.html. Datasets developed previously by the USGS for this data release were based on these erroneous data and were originally published at: Irizarry-Ortiz, M.M., and Stamm, J.F., 2021, Change factors to derive future precipitation depth-duration-frequency (DDF) curves at 174 National Oceanic and Atmospheric Administration (NOAA) Atlas 14 stations in central and south Florida: U.S. Geological Survey data release, https://doi.org/10.5066/P9KEMHYM. Data downloaded from that ScienceBase page prior to April 1, 2022 are based on this erroneous bias-corrected CORDEX dataset and has been superseded by the data on this page. On January 10, 2022, the University Corporation for Atmospheric Research notified the USGS that a revised set of bias-corrected CORDEX data were available for download. The USGS recomputed Depth Duration-Frequency (DDF) curves and change factors based on the revised CORDEX dataset and the updated results were posted on this ScienceBase page on April 1, 2022. Data downloaded on this page are based on the revised bias-corrected CORDEX dataset. To obtain the previous superseded dataset, please contact Michelle Irizarry-Ortiz at mirizarry-ortiz@usgs.gov. First release: October 2021 Revised: March 2022
f
Data from: A Statistical Approach for Identifying the Best Combination of...
datasetcatalog.nlm.nih.gov
acs.figshare.com
+1more
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jha, Girish Kumar; Mishra, Dwijesh Chandra; Sakthivel, Kabilan; Khan, Yasin Jeshima; Lal, Shashi Bhushan; Madival, Sharanbasappa D; Vaidhyanathan, Ramasubramanian; Chaturvedi, Krishna Kumar; Srivastava, Sudhir (2024). A Statistical Approach for Identifying the Best Combination of Normalization and Imputation Methods for Label-Free Proteomics Expression Data [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001385078
Explore at:
Dataset updated
Dec 11, 2024
Authors
Jha, Girish Kumar; Mishra, Dwijesh Chandra; Sakthivel, Kabilan; Khan, Yasin Jeshima; Lal, Shashi Bhushan; Madival, Sharanbasappa D; Vaidhyanathan, Ramasubramanian; Chaturvedi, Krishna Kumar; Srivastava, Sudhir
Description
Label-free proteomics expression data sets often exhibit data heterogeneity and missing values, necessitating the development of effective normalization and imputation methods. The selection of appropriate normalization and imputation methods is inherently data-specific, and choosing the optimal approach from the available options is critical for ensuring robust downstream analysis. This study aimed to identify the most suitable combination of these methods for quality control and accurate identification of differentially expressed proteins. In this study, we developed nine combinations by integrating three normalization methods, locally weighted linear regression (LOESS), variance stabilization normalization (VSN), and robust linear regression (RLR) with three imputation methods: k-nearest neighbors (k-NN), local least-squares (LLS), and singular value decomposition (SVD). We utilized statistical measures, including the pooled coefficient of variation (PCV), pooled estimate of variance (PEV), and pooled median absolute deviation (PMAD), to assess intragroup and intergroup variation. The combinations yielding the lowest values corresponding to each statistical measure were chosen as the data set’s suitable normalization and imputation methods. The performance of this approach was tested using two spiked-in standard label-free proteomics benchmark data sets. The identified combinations returned a low NRMSE and showed better performance in identifying spiked-in proteins. The developed approach can be accessed through the R package named ’lfproQC’ and a user-friendly Shiny web application (https://dabiniasri.shinyapps.io/lfproQC and http://omics.icar.gov.in/lfproQC), making it a valuable resource for researchers looking to apply this method to their data sets.
d
Evaluation of Precipitation and Temperature: An Analysis of In-Situ...
search.dataone.org
Updated Nov 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Reza Morovati; Ehsan Ebrahimi; Ehsan Kahrizi; Pamela Claure (2024). Evaluation of Precipitation and Temperature: An Analysis of In-Situ Observations Versus Gridded Data within the Great Salt Lake Basin [Dataset]. https://search.dataone.org/view/sha256%3A92c95ba4c90218686b62df33cdf28ceb2d3ffa62dbba1bd1dc88977c51f4d151
Explore at:
Dataset updated
Nov 16, 2024
Dataset provided by
Hydroshare
Authors
Reza Morovati; Ehsan Ebrahimi; Ehsan Kahrizi; Pamela Claure
Time period covered
Jan 1, 1990 - Dec 31, 2020
Area covered

Description
This study evaluates the consistency between in-situ measurements and gridded datasets for precipitation and temperature within the Great Salt Lake Basin, highlighting the significant implications for hydrological modelling and climate analysis. We analysed five widely recognized gridded datasets: GRIDMET, DAYMET, PRISM, NLDAS-2, and CONUS404, utilizing statistical metrics such as the Pearson Correlation Coefficient, Root Mean Square Error (RMSE), and Kling-Gupta Efficiency to assess their accuracy and reliability against ground truth data from 30 meteorological stations. Our findings indicate that the PRISM dataset outperformed others, demonstrating the lowest median RMSE values for both precipitation (approximately 1.9 mm/day) and temperature (approximately 0.9°C), which is attributed to its advanced interpolation methods that effectively incorporate orographic adjustments. In contrast, NLDAS-2 and CONUS404, despite their finer temporal resolutions, showed greater error variability and lower performance metrics, which may limit their utility for detailed hydrological applications. Through the use of visual analytical tools such as heatmaps and boxplots, we were able to vividly illustrate the performance disparities across the datasets, thereby providing a clear comparative analysis that underscores the strengths and weaknesses of each dataset. The study emphasizes the need for careful selection of gridded datasets based on specific regional characteristics to improve the accuracy and reliability of hydro climatological studies and supports better-informed decisions in climate-related adaptations and policy-making. The insights gained from this analysis aim to guide researchers and practitioners in selecting the most appropriate datasets that align with the unique climatic and topographical conditions of the Great Salt Lake Basin, enhancing the efficacy of environmental forecasting and resource management strategies.
r
Source-code and datasets for: Performance-based pay and limited information...
resodate.org
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Reinwald; Stephan Leitner; Friederike Wall (2025). Source-code and datasets for: Performance-based pay and limited information access. An agent-based model of the hidden-action problem [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9qYm5zdC0yMDIzLTAxMDE=
Explore at:
Dataset updated
Oct 2, 2025
Dataset provided by
ZBW Journal Data Archive
ZBW
Journal of Economics and Statistics
Authors
Patrick Reinwald; Stephan Leitner; Friederike Wall
Description
Content

The provided content consists of two parts: First, the code of the agent-based simulation model (in the folder "Model") and, second, the data that is generated using the model and analyzed in the manuscript (in the folder "Datasets").

Summary

This agent-based simulation model aims to analyze the effects of limited information access (and limited memory) in Holmström's hidden action model

on the principal’s and the agent’s respective utilities,

the effort (action) the agent makes to perform a task, and

the premium parameter in the rule to share the outcome between the principal and the agent.

Requirements

MATLAB R2019b or higher is required to run the model and to analyze the datasets. In addition, the following packages are required to run the model:

Parallel Computing Toolbox

Symbolic Math Toolbox

Optimization Toolbox

Global Optimization Toolbox

Statistics and Machine Learning Toolbox

Running the model

Open the folder "Model". Find and double-click the file main.m (in the folder "agentization"). The MATLAB editor opens, and you can change the simulation parameters.

To run the model, you can either:

Type the script name (main) in the command line and press enter

Select the main.m file in the editor and press the run button (green triangle)

Please note: If a message pops up with the options "Change Folger", "Add to Path", "Cancel", and "Help", please choose "Add to Path".

You can set all relevant parameters in the file main.m

umwSD: This is the standard deviation of the normal distribution from which the environmental variable is drawn. It is defined relative (in %) to the optimal outcome. We set it to either 5, 25, or 45.

jto: This is the number of simulation runs. We set it to 700 in all scenarios. You are free to change it to any number. However, please note that performing many simulation runs might take a long time.

limitedMemoryP: This parameter defines whether the principal’s memory is limited or not. The variable can be set to either true or false. If set to false, the principal’s memory is unlimited and changes in the variable "memoryP" have no effects.

limitedMemoryA: This parameter defines whether the agent’s memory is limited or not. The variable can be set to either true or false. If set to false, the agent’s memory is unlimited and changes in the variable "memoryA" have no effects.

memoryP: This variable defines the length of the principal’s memory (in periods). We set it either to 1, 3, or 5.

memoryA: This variable defines the length of the agent’s memory (in periods). We set it either to 1, 3, or 5.

Simulation output

The simulation model creates the folder "Results" in the project directory. This folder consists of at least one subfolder. The subfolder’s name consists, amongst others, of the values assigned to the variables umwSD (environment) and jto (number of simulation runs). This subfolder consists of two further folders named "einzelneSims" (in which only intermediate results are saved) and "final" (in which the final simulation data are saved). The simluation output includes 61 variables. However, not all of these variables are used in the analysis because some are saved for verification only. The most important variables are the following (the ones used in the study are printed in bold font):

opta: The effort level proposed by the second-best solution of Holmström’s model.

a_A_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the effort made by the agent to perform a task (in every timestep).

a_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the effort levels incited by the principal (in every timestep).

optp: The premium parameter proposed by the second-best solution of Holmström’s model.

p_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the premium parameter set by the principal (in every timestep).

optUA: The agent's utility proposed by the second-best solution of Holmström’s model.

UA_A_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the (agent’s) utility expected by the agent (in every timestep).

UA_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the agent's utility expected by the principal (in every timestep).

UA_realized_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the utility that realized for the agent (in every timestep).

lostUA-sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains the difference between the optimal and the realized utility for the agent in every timestep (i.e., the optimal minus the achieved utility of the agent).

optUP: The principal's utility proposed by the second-best solution of Holmström’s model.

UP_P_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the (principal’s) utility expected by the principal (in every timestep).

UP_realized_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the utility that realized for the principal (in every timestep).

lostUP-sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains the difference between the optimal and the realized utility for the principal in every timestep (i.e., the optimal minus the achieved utility of the principal).

optoutcome: The outcome proposed by the second-best solution of Holmström’s model.

outcome_realized_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains information on the outcome that materialized (in every timestep).

limitedMemoryA: This variable gives information on whether the agent’s memory was limited or not. Either set to 1 or 0 (If set to 0, the agent’s memory is unlimited).

limitedMemoryP: This variable gives information on whether the principal’s memory was limited or not. Either set to 1 or 0 (If set to 0, the principal’s memory is unlimited).

lostoutcome_sims: This is a 700*20 matrix in our case (700 simulation runs, 20 periods). It contains the result of the optimal outcome minus the achieved outcome (in every timestep).

uwmM: This variable gives information on the mean of the normally distributed environmental factor (set to 0 in our scenarios).

umwSD: This variable contains the standard deviation of the environmental variable; it is calculated as the chosen deviation in main.m multiplied by the optoutcome.

jto: This is the number of simulation runs (we set it to 700 in all scenarios).

Datasets

The folder "Datasets" contains simulation data for scenarios with both limited and unlimited memory, covering all four observations (premium parameter, agent's effort, principal's utility, and agent's utility) in CSV format. Each row represents one simulation run, and columns represent timesteps within each run. For unlimited memory scenarios, the first column details environmental turbulence, followed by 200 columns representing the timesteps of a single run. For limited memory scenarios, the first three columns provide information on environmental turbulence, the principal's memory, and the agent's memory, followed by 20 columns that capture the results of each simulation run.

Contact

For any reamining questions, please contact me via stephan.leitner@aau.at
N
Two Buttes, CO Population Breakdown by Gender
neilsberg.com
csv, json
Updated Sep 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2023). Two Buttes, CO Population Breakdown by Gender [Dataset]. https://www.neilsberg.com/research/datasets/65b7d550-3d85-11ee-9abe-0aa64bf2eeb2/
Explore at:
json, csvAvailable download formats
Dataset updated
Sep 14, 2023
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Colorado, Two Buttes
Variables measured
Male Population, Female Population, Male Population as Percent of Total Population, Female Population as Percent of Total Population
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of Two Buttes by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Two Buttes across both sexes and to determine which sex constitutes the majority.

Key observations

There is a majority of female population, with 62.5% of total population being female. Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.

Variables / Data Columns

Gender: This column displays the Gender (Male / Female)

Population: The population of the gender in the Two Buttes is shown in this column.

% of Total Population: This column displays the percentage distribution of each gender as a proportion of Two Buttes total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Two Buttes Population by Gender. You can refer the same here
N
Two Rivers Town, Wisconsin Population Breakdown by Gender Dataset: Male and...
neilsberg.com
csv, json
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Two Rivers Town, Wisconsin Population Breakdown by Gender Dataset: Male and Female Population Distribution // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/b258b316-f25d-11ef-8c1b-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Two Rivers, Wisconsin
Variables measured
Male Population, Female Population, Male Population as Percent of Total Population, Female Population as Percent of Total Population
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of Two Rivers town by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Two Rivers town across both sexes and to determine which sex constitutes the majority.

Key observations

There is a majority of male population, with 57.96% of total population being male. Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.

Variables / Data Columns

Gender: This column displays the Gender (Male / Female)

Population: The population of the gender in the Two Rivers town is shown in this column.

% of Total Population: This column displays the percentage distribution of each gender as a proportion of Two Rivers town total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Two Rivers town Population by Race & Ethnicity. You can refer the same here
d
Data for: Integrating open education practices with data analysis of open...
search.dataone.org
data.niaid.nih.gov
+1more
Updated Jul 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marja Bakermans (2024). Data for: Integrating open education practices with data analysis of open science in an undergraduate course [Dataset]. http://doi.org/10.5061/dryad.37pvmcvst
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.37pvmcvst
Dataset updated
Jul 27, 2024
Dataset provided by
Dryad Digital Repository
Authors
Marja Bakermans
Description
The open science movement produces vast quantities of openly published data connected to journal articles, creating an enormous resource for educators to engage students in current topics and analyses. However, educators face challenges using these materials to meet course objectives. I present a case study using open science (published articles and their corresponding datasets) and open educational practices in a capstone course. While engaging in current topics of conservation, students trace connections in the research process, learn statistical analyses, and recreate analyses using the programming language R. I assessed the presence of best practices in open articles and datasets, examined student selection in the open grading policy, surveyed students on their perceived learning gains, and conducted a thematic analysis on student reflections. First, articles and datasets met just over half of the assessed fairness practices, but this increased with the publication date. There was a..., Article and dataset fairness To assess the utility of open articles and their datasets as an educational tool in an undergraduate academic setting, I measured the congruence of each pair to a set of best practices and guiding principles. I assessed ten guiding principles and best practices (Table 1), where each category was scored â€˜1â€™ or â€˜0â€™ based on whether it met that criteria, with a total possible score of ten. Open grading policies Students were allowed to specify the percentage weight for each assessment category in the course, including 1) six coding exercises (Exercises), 2) one lead exercise (Lead Exercise), 3) fourteen annotation assignments of readings (Annotations), 4) one final project (Final Project), 5) five discussion board posts and a statement of learning reflection (Discussion), and 6) attendance and participation (Participation). I examined if assessment categories (independent variable) were weighted (dependent variable) differently by students using an analysis of ..., , # Data for: Integrating open education practices with data analysis of open science in an undergraduate course

Author: Marja H Bakermans Affiliation: Worcester Polytechnic Institute, 100 Institute Rd, Worcester, MA 01609 USA ORCID: https://orcid.org/0000-0002-4879-7771 Institutional IRB approval: IRB-24â€“0314

Data and file overview

The full dataset file called OEPandOSdata (.xlsx extension) contains 8 files. Below are descriptions of the name and contents of each file. NA = not applicable or no data available

BestPracticesData.csv

Description: Data to assess the adherence of articles and datasets to open science best practices.

Column headers and descriptions:

Article: articles used in the study, numbered randomly

F1: Findable, Data are assigned a unique and persistent doi

F2: Findable, Metadata includes an identifier of data

F3: Findable, Data are registered in a searchable database

A1: ...
N
Two Inlets Township, Minnesota Population Breakdown by Gender
neilsberg.com
csv, json
Updated Sep 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2023). Two Inlets Township, Minnesota Population Breakdown by Gender [Dataset]. https://www.neilsberg.com/research/datasets/65b7dec0-3d85-11ee-9abe-0aa64bf2eeb2/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Sep 14, 2023
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Minnesota, Two Inlets Township, Two Inlets
Variables measured
Male Population, Female Population, Male Population as Percent of Total Population, Female Population as Percent of Total Population
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) population and (b) population as a percentage of the total population, we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of Two Inlets township by gender, including both male and female populations. This dataset can be utilized to understand the population distribution of Two Inlets township across both sexes and to determine which sex constitutes the majority.

Key observations

There is a majority of male population, with 54.02% of total population being male. Source: U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates.

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis. No further analysis is done on the data reported from the Census Bureau.

Variables / Data Columns

Gender: This column displays the Gender (Male / Female)

Population: The population of the gender in the Two Inlets township is shown in this column.

% of Total Population: This column displays the percentage distribution of each gender as a proportion of Two Inlets township total population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Two Inlets township Population by Gender. You can refer the same here
Dataset: Predictions of Cyanobacteria and Microcystin in Lakes across the...
catalog.data.gov
Updated Jun 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). Dataset: Predictions of Cyanobacteria and Microcystin in Lakes across the Conterminous United States [Dataset]. https://catalog.data.gov/dataset/dataset-predictions-of-cyanobacteria-and-microcystin-in-lakes-across-the-conterminous-unit
Explore at:
Dataset updated
Jun 27, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Area covered
Contiguous United States, United States
Description
With increasing concerns about freshwater cyanobacteria blooms, there is a need to identify which waterbodies are at risk for developing these blooms, especially those that produce cyanotoxins. To address this concern, we developed spatial statistical models using the US National Lakes Assessment, a survey with over 3,000 spring and summer observations of cyanobacteria abundance and microcystin concentration in lakes across the conterminous US. We combined these observations with other nationally available data to model which lake and watershed factors best explain the presence of harmful cyanobacterial blooms. We then used these models to estimate the cyanobacteria abundance and probability of microcystin detection in 124,500 lakes across the CONUS. This dataset includes the compiled data used to generate the models and the dataset used to generate prediction for a much larger population of lakes. The data package includes two tabular data files, two tabular metadata files, and one methods document.
Estimated stand-off distance between ADS-B equipped aircraft and obstacles
zenodo.org
data.niaid.nih.gov
+1more
jpeg, zip
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrew Weinert; Andrew Weinert (2024). Estimated stand-off distance between ADS-B equipped aircraft and obstacles [Dataset]. http://doi.org/10.5281/zenodo.7741273
Explore at:
zip, jpegAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7741273
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrew Weinert; Andrew Weinert
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Summary:

Estimated stand-off distance between ADS-B equipped aircraft and obstacles. Obstacle information was sourced from the FAA Digital Obstacle File and the FHWA National Bridge Inventory. Aircraft tracks were sourced from processed data curated from the OpenSky Network. Results are presented as histograms organized by aircraft type and distance away from runways.

Description:

For many aviation safety studies, aircraft behavior is represented using encounter models, which are statistical models of how aircraft behave during close encounters. They are used to provide a realistic representation of the range of encounter flight dynamics where an aircraft collision avoidance system would be likely to alert. These models currently and have historically have been limited to interactions between aircraft; they have not represented the specific interactions between obstacles and aircraft equipped transponders. In response, we calculated the standoff distance between obstacles and ADS-B equipped manned aircraft.

For robustness, this assessment considered two different datasets of manned aircraft tracks and two datasets of obstacles. For robustness, MIT LL calculated the standoff distance using two different datasets of aircraft tracks and two datasets of obstacles. This approach aligned with the foundational research used to support the ASTM F3442/F3442M-20 well clear criteria of 2000 feet laterally and 250 feet AGL vertically.

The two datasets of processed tracks of ADS-B equipped aircraft curated from the OpenSky Network. It is likely that rotorcraft were underrepresented in these datasets. There were also no considerations for aircraft equipped only with Mode C or not equipped with any transponders. The first dataset was used to train the v1.3 uncorrelated encounter models and referred to as the “Monday” dataset. The second dataset is referred to as the “aerodrome” dataset and was used to train the v2.0 and v3.x terminal encounter model. The Monday dataset consisted of 104 Mondays across North America. The other dataset was based on observations at least 8 nautical miles within Class B, C, D aerodromes in the United States for the first 14 days of each month from January 2019 through February 2020. Prior to any processing, the datasets required 714 and 847 Gigabytes of storage. For more details on these datasets, please refer to "Correlated Bayesian Model of Aircraft Encounters in the Terminal Area Given a Straight Takeoff or Landing" and “Benchmarking the Processing of Aircraft Tracks with Triples Mode and Self-Scheduling.”

Two different datasets of obstacles were also considered. First was point obstacles defined by the FAA digital obstacle file (DOF) and consisted of point obstacle structures of antenna, lighthouse, meteorological tower (met), monument, sign, silo, spire (steeple), stack (chimney; industrial smokestack), transmission line tower (t-l tower), tank (water; fuel), tramway, utility pole (telephone pole, or pole of similar height, supporting wires), windmill (wind turbine), and windsock. Each obstacle was represented by a cylinder with the height reported by the DOF and a radius based on the report horizontal accuracy. We did not consider the actual width and height of the structure itself. Additionally, we only considered obstacles at least 50 feet tall and marked as verified in the DOF.

The other obstacle dataset, termed as “bridges,” was based on the identified bridges in the FAA DOF and additional information provided by the National Bridge Inventory. Due to the potential size and extent of bridges, it would not be appropriate to model them as point obstacles; however, the FAA DOF only provides a point location and no information about the size of the bridge. In response, we correlated the FAA DOF with the National Bridge Inventory, which provides information about the length of many bridges. Instead of sizing the simulated bridge based on horizontal accuracy, like with the point obstacles, the bridges were represented as circles with a radius of the longest, nearest bridge from the NBI. A circle representation was required because neither the FAA DOF or NBI provided sufficient information about orientation to represent bridges as rectangular cuboid. Similar to the point obstacles, the height of the obstacle was based on the height reported by the FAA DOF. Accordingly, the analysis using the bridge dataset should be viewed as risk averse and conservative. It is possible that a manned aircraft was hundreds of feet away from an obstacle in actuality but the estimated standoff distance could be significantly less. Additionally, all obstacles are represented with a fixed height, the potentially flat and low level entrances of the bridge are assumed to have the same height as the tall bridge towers. The attached figure illustrates an example simulated bridge.

It would had been extremely computational inefficient to calculate the standoff distance for all possible track points. Instead, we define an encounter between an aircraft and obstacle as when an aircraft flying 3069 feet AGL or less comes within 3000 feet laterally of any obstacle in a 60 second time interval. If the criteria were satisfied, then for that 60 second track segment we calculate the standoff distance to all nearby obstacles. Vertical separation was based on the MSL altitude of the track and the maximum MSL height of an obstacle.

For each combination of aircraft track and obstacle datasets, the results were organized seven different ways. Filtering criteria were based on aircraft type and distance away from runways. Runway data was sourced from the FAA runways of the United States, Puerto Rico, and Virgin Islands open dataset. Aircraft type was identified as part of the em-processing-opensky workflow.

All: No filter, all observations that satisfied encounter conditions

nearRunway: Aircraft within or at 2 nautical miles of a runway

awayRunway: Observations more than 2 nautical miles from a runway

glider: Observations when aircraft type is a glider

fwme: Observations when aircraft type is a fixed-wing multi-engine

fwse: Observations when aircraft type is a fixed-wing single engine

rotorcraft: Observations when aircraft type is a rotorcraft

License

This dataset is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International(CC BY-NC-ND 4.0).

This license requires that reusers give credit to the creator. It allows reusers to copy and distribute the material in any medium or format in unadapted form and for noncommercial purposes only. Only noncommercial use of your work is permitted. Noncommercial means not primarily intended for or directed towards commercial advantage or monetary compensation. Exceptions are given for the not for profit standards organizations of ASTM International and RTCA.

MIT is releasing this dataset in good faith to promote open and transparent research of the low altitude airspace. Given the limitations of the dataset and a need for more research, a more restrictive license was warranted. Namely it is based only on only observations of ADS-B equipped aircraft, which not all aircraft in the airspace are required to employ; and observations were source from a crowdsourced network whose surveillance coverage has not been robustly characterized.

As more research is conducted and the low altitude airspace is further characterized or regulated, it is expected that a future version of this dataset may have a more permissive license.

Distribution Statement

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

© 2021 Massachusetts Institute of Technology.

Delivered to the U.S. Government with Unlimited Rights, as defined in DFARS Part 252.227-7013 or 7014 (Feb 2014). Notwithstanding any copyright notice, U.S. Government rights in this work are defined by DFARS 252.227-7013 or DFARS 252.227-7014 as detailed above. Use of this work other than as specifically authorized by the U.S. Government may violate any copyrights that exist in this work.

This material is based upon work supported by the Federal Aviation Administration under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Federal Aviation Administration.

This document is derived from work done for the FAA (and possibly others); it is not the direct product of work done for the FAA. The information provided herein may include content supplied by third parties. Although the data and information contained herein has been produced or processed from sources believed to be reliable, the Federal Aviation Administration makes no warranty, expressed or implied, regarding the accuracy, adequacy, completeness, legality, reliability or usefulness of any information, conclusions or recommendations provided herein. Distribution of the information contained herein does not constitute an endorsement or warranty of the data or information provided herein by the Federal Aviation Administration or the U.S. Department of Transportation. Neither the Federal Aviation Administration nor the U.S. Department of
Diabetes in Adults - CDPHE Community Level Estimates (Census Tracts)
data-cdphe.opendata.arcgis.com
hub.arcgis.com
+1more
Updated May 12, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Colorado Department of Public Health and Environment (2016). Diabetes in Adults - CDPHE Community Level Estimates (Census Tracts) [Dataset]. https://data-cdphe.opendata.arcgis.com/datasets/diabetes-in-adults-cdphe-community-level-estimates-census-tracts
Explore at:
Dataset updated
May 12, 2016
Dataset authored and provided by
Colorado Department of Public Health and Environmenthttps://cdphe.colorado.gov/
Area covered

Description
These data represent the predicted (modeled) prevalence of Diabetes among adults (Age 18+) for each census tract in Colorado. Diabetes is defined as ever being diagnosed with Diabetes by a doctor, nurse, or other health professional, and this definition does not include gestational, borderline, or pre-diabetes.The estimate for each census tract represents an average that was derived from multiple years of Colorado Behavioral Risk Factor Surveillance System data (2014-2017).CDPHE used a model-based approach to measure the relationship between age, race, gender, poverty, education, location and health conditions or risk behavior indicators and applied this relationship to predict the number of persons' who have the health conditions or risk behavior for each census tract in Colorado. We then applied these probabilities, based on demographic stratification, to the 2013-2017 American Community Survey population estimates and determined the percentage of adults with the health conditions or risk behavior for each census tract in Colorado.The estimates are based on statistical models and are not direct survey estimates. Using the best available data, CDPHE was able to model census tract estimates based on demographic data and background knowledge about the distribution of specific health conditions and risk behaviors.The estimates are displayed in both the map and data table using point estimate values for each census tract and displayed using a Quintile range. The high and low value for each color on the map is calculated based on dividing the total number of census tracts in Colorado (1249) into five groups based on the total range of estimates for all Colorado census tracts. Each Quintile range represents roughly 20% of the census tracts in Colorado. No estimates are provided for census tracts with a known population of less than 50. These census tracts are displayed in the map as "No Est, Pop < 50."No estimates are provided for 7 census tracts with a known population of less than 50 or for the 2 census tracts that exclusively contain a federal correctional institution as 100% of their population. These 9 census tracts are displayed in the map as "No Estimate."

Facebook

Twitter

Click to copy link

Link copied

Cite

Qin Li; Xiaojun Kou (2021). WiBB: An integrated method for quantifying the relative importance of predictive variables [Dataset]. http://doi.org/10.5061/dryad.xsj3tx9g1

Data from: WiBB: An integrated method for quantifying the relative importance of predictive variables

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.xsj3tx9g1

Dataset updated

Aug 20, 2021

Dataset provided by

Field Museum of Natural History
Beijing Normal University

Authors

Qin Li; Xiaojun Kou

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

This dataset contains simulated datasets, empirical data, and R scripts described in the paper: “Li, Q. and Kou, X. (2021) WiBB: An integrated method for quantifying the relative importance of predictive variables. Ecography (DOI: 10.1111/ecog.05651)”.

A fundamental goal of scientific research is to identify the underlying variables that govern crucial processes of a system. Here we proposed a new index, WiBB, which integrates the merits of several existing methods: a model-weighting method from information theory (Wi), a standardized regression coefficient method measured by ß* (B), and bootstrap resampling technique (B). We applied the WiBB in simulated datasets with known correlation structures, for both linear models (LM) and generalized linear models (GLM), to evaluate its performance. We also applied two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate their performance in comparison with the WiBB method on ranking predictor importances under various scenarios. We also applied it to an empirical dataset in a plant genus Mimulus to select bioclimatic predictors of species’ presence across the landscape. Results in the simulated datasets showed that the WiBB method outperformed the ß* and SWi methods in scenarios with small and large sample sizes, respectively, and that the bootstrap resampling technique significantly improved the discriminant ability. When testing WiBB in the empirical dataset with GLM, it sensibly identified four important predictors with high credibility out of six candidates in modeling geographical distributions of 71 Mimulus species. This integrated index has great advantages in evaluating predictor importance and hence reducing the dimensionality of data, without losing interpretive power. The simplicity of calculation of the new metric over more sophisticated statistical procedures, makes it a handy method in the statistical toolbox.

Methods To simulate independent datasets (size = 1000), we adopted Galipaud et al.’s approach (2014) with custom modifications of the data.simulation function, which used the multiple normal distribution function rmvnorm in R package mvtnorm(v1.0-5, Genz et al. 2016). Each dataset was simulated with a preset correlation structure between a response variable (y) and four predictors(x1, x2, x3, x4). The first three (genuine) predictors were set to be strongly, moderately, and weakly correlated with the response variable, respectively (denoted by large, medium, small Pearson correlation coefficients, r), while the correlation between the response and the last (spurious) predictor was set to be zero. We simulated datasets with three levels of differences of correlation coefficients of consecutive predictors, where ∆r = 0.1, 0.2, 0.3, respectively. These three levels of ∆r resulted in three correlation structures between the response and four predictors: (0.3, 0.2, 0.1, 0.0), (0.6, 0.4, 0.2, 0.0), and (0.8, 0.6, 0.3, 0.0), respectively. We repeated the simulation procedure 200 times for each of three preset correlation structures (600 datasets in total), for LM fitting later. For GLM fitting, we modified the simulation procedures with additional steps, in which we converted the continuous response into binary data O (e.g., occurrence data having 0 for absence and 1 for presence). We tested the WiBB method, along with two other methods, relative sum of wight (SWi), and standardized beta (ß*), to evaluate the ability to correctly rank predictor importances under various scenarios. The empirical dataset of 71 Mimulus species was collected by their occurrence coordinates and correponding values extracted from climatic layers from WorldClim dataset (www.worldclim.org), and we applied the WiBB method to infer important predictors for their geographical distributions.

Clear search

Close search

Google apps

Main menu

Data from: WiBB: An integrated method for quantifying the relative...

Data from: Best Management Practices Statistical Estimator (BMPSE) Version...

Strengths and weaknesses of different methods.

Homestays data

Data from: S1 Dataset -

Cancer County-Level

Exploring County-Level Correlations in Cancer Rates and Trends

A Multivariate Ordinary Least Squares Regression Model

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

Data from: Functional traits and community composition: a comparison among...

Repeated measure MANOVA table combining all three performance measures of...

Goalkeeper and Midfielder Statistics

Goalkeeper and Midfielder Statistics

Leveraging Statistical Data Of Goalkeepers and Midfielders

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Change factors to derive projected future precipitation...

Data from: A Statistical Approach for Identifying the Best Combination of...

Evaluation of Precipitation and Temperature: An Analysis of In-Situ...

Source-code and datasets for: Performance-based pay and limited information...

Content

Summary

Requirements

Running the model

Simulation output

Datasets

Contact

Two Buttes, CO Population Breakdown by Gender

About this dataset

Content

Inspiration

Recommended for further research

Two Rivers Town, Wisconsin Population Breakdown by Gender Dataset: Male and...

About this dataset

Content

Inspiration

Recommended for further research

Data for: Integrating open education practices with data analysis of open...

Data and file overview

Two Inlets Township, Minnesota Population Breakdown by Gender

About this dataset

Content

Inspiration

Recommended for further research

Dataset: Predictions of Cyanobacteria and Microcystin in Lakes across the...

Estimated stand-off distance between ADS-B equipped aircraft and obstacles

Diabetes in Adults - CDPHE Community Level Estimates (Census Tracts)

Data from: WiBB: An integrated method for quantifying the relative importance of predictive variables