Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
K-Means Cluster Analysis Syntax RStudio
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
The datasets are generated using either Gaussian or Uniform distributions. Each dataset contains several known sub-groups intended for testing centroid-based clustering results and cluster validity indices.
Cluster analysis is a popular machine learning used for segmenting datasets with similar data points in the same group. For those who are familiar with R, there is a new R package called "UniversalCVI" https://CRAN.R-project.org/package=UniversalCVI used for cluster evaluation. This package provides algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). To use the "UniversalCVI" package, one can follow the instructions provided in the R documentation.
For more in-depth details of the package and cluster evaluation, please see the papers https://doi.org/10.1016/j.patcog.2023.109910 and https://arxiv.org/abs/2308.14785
https://github.com/O-PREEDASAWAKUL/FuzzyDatasets.git .
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F17645646%2Fa2f87fbad212a908718535589681a703%2Frealplot.jpeg?generation=1700111724010268&alt=media" alt="">
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Facebook
TwitterThis data release provides comprehensive results of monotonic trend assessment for long-term U.S. Geological Survey (USGS) streamgages in or proximal to the watersheds of Mobile and Perdido Bays, south-central United States (Tatum and others, 2024). Long-term is defined as streamgages having at least five complete decades of daily streamflow data since January 1, 1950, exclusive to those streamgages also having the entire 2010s decade represented. Input data for the trend assessment are daily streamflow data retrieved on March 8, 2024 (U.S. Geological Survey, 2024) and formatted using the fill_dvenv() function in akqdecay (Crowley-Ornelas and others, 2024). Monotonic trends were assessed for each of 69 streamgages using 26 Mann-Kendall hypothesis tests for 20 hydrologic metrics understood as particularly useful in ecological studies (Henriksen and others, 2006) with another 6 metrics measuring well-known streamflow properties, such as annual harmonic mean streamflow (Asquith and Heitmuller, 2008) and annual mean streamflow with decadal flow-duration curve quantiles (10th, 50th, and 90th percentiles) (Crowley-Ornelas and others, 2023). Helsel and others (2020) provide background and description of the Mann-Kendall hypothesis test. Some of the trend analyses are based on the annual values of a hydrologic metric (calendar year is the time interval test) whereas others are decadal (decade is the time interval for the test). The principal result output for this data release (monotrnd_1hyp.txt) clearly distinguishes the time interval for the respective tests. This data release includes the computational workflow to conduct the hypothesis testing and requisite data manipulations to do so. The workflow is comprised of the core computation script monotrnd_script.R and an auxiliary script containing functions for 20 ecological flow metrics. This means that script monotrnd_script.R requires additional functions to be loaded into the R workspace and sources the file monotrnd_ecomets_include.R. This design is useful as part of isolation of the 20 ecological-oriented hydrologic metrics (subroutines) (logic and nomenclature therein is informed by Henriksen and others, 2006) from the streamgage-looping workflow and other data manipulation features in monotrnd_script.R. The monotrnd_script.R is designed to use time series of daily mean streamflow stored in an R environment data object using the streamgage identification number as the key and a data frame (table) of the daily streamflows in the format defined by the dvget() and filled by the filldv_env() functions of the akqdecay R package (See supplemental information section; Crowley-Ornelas and others, 2024). Additionally, monotrnd_script.R tags a specific subset of streamgages within the workflow, identified by the authors as "major nodes," with a binary indicator (1 or 0) to support targeted analyses on these selected locations. The data in file monotrnd_1hyp.txt are comma-delimited results of Kendall tau or other test statistics and p-values of the Mann-Kendall hypothesis tests as part of monotonic trend assessment for 69 USGS streamgages using 26 Mann–Kendall hypothesis tests on a variety of streamflow metrics. The data include USGS streamgage identification numbers with prepended "S" character, decimal latitudes and longitudes for the streamgage locations, range of calendar year and decades of streamflow processed along with integer counts of number of calendar years and decades, Kendall tau (or other test statistic) and associated p-value of the test statistic for the 26 streamflow metrics considered. Broadly, the "left side of the table" presents the results for the tests on metrics using calendar year time steps, and the "right side of the table" presents the results for the tests on metrics using decade time steps. The content of the file does not assign or draw conclusions on statistical significance because the p-values are provided. The file monotrnd_dictionary_1hyp.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in the monotrnd_1hyp.txt. (This dictionary and two others accompany this data release to facilitate potential reuse of information by some users.) The source of monotrnd_1hyp.txt stems from ending computational steps in script monotrnd_script.R. Short summaries synthesizing information in file monotrnd_1hyp.txt are available in files monotrnd_3cnt.txt and monotrnd_2stn.txt also accompanying this data release. The data in file monotrnd_2stn.txt are comma-delimited summaries by streamgage identification number of the monotonic trend assessments for 26 Mann-Kendall hypothesis tests on streamflow metrics as described elsewhere in this data release. The summary data herein are composed of records (rows) by streamgage that include columns of (1) streamgage identification numbers with a prepended "S" character, (2) decimal latitudes and longitudes for the streamgage locations, (3) the integer counts of the number of hypothesis tests, (4) the integer count of number of tests for which the computed hypothesis test p-values less than the 0.05 level of statistical significance (so-called alpha = 0.05), and (5) colon-delimited strings of alphanumeric characters identifying each of the statistically significant tests for the respective streamgage. The file monotrnd_dictionary_2stn.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in monotrnd_2stn.txt. The source of monotrnd_2stn.txt stems from ending computational steps in script monotrnd_script.R described elsewhere in this data release from its production of the monotrnd_1hyp.txt; this later data file provides the values used to assemble monotrnd_2stn.txt. The information in file monotrnd_3cnt.txt are comma-delimited summaries of Kendall tau or other test statistic arithmetic means as well as integer counts of statistically significant trends as part of monotonic trend assessment using 26 Mann-Kendall hypothesis tests on a variety of streamflow metrics for 69 USGS streamgages as described elsewhere in this data release. The two-column summary data herein are composed of a first row indicating by character string of the integer number of streamgages (69) and then subsequent rows in pairs of three-decimal character-string representation of mean Kendall tau (or the test statistics of a seasonal Mann-Kendall test) followed by character string of the integer number of the counts of statistically significant tests for the respective test at it was applied to the 69 streamgages. Statistical significance is defined as p-values less than the 0.05 level of statistical significance (so-called alpha = 0.05). The file monotrnd_dictionary_3cnt.txt is a simple plain-text, pipe-delimited file of directly human-readable short definitions for the columns in the monotrnd_3cnt.txt. The source of monotrnd_3cnt.txt stems from ending computational steps in script monotrnd_script.R described elsewhere in this data release from its production of the monotrnd_1hyp.txt; this later data file provides the values used to assemble monotrnd_3cnt.txt.
Facebook
TwitterThe means and standard deviations of the 50 MAP estimates based upon data with 400 infected households for each parameter is shown in the form mean(standard deviation) for the BPA and DA-MCMC methods. The last row shows the difference in the mean and standard deviation between the two methods.
Facebook
TwitterIn 2024, the average weekly earnings of employees in the ventilation, heating, air-conditioning, and commercial refrigeration equipment manufacturing (HVAC-R) industry in the United States was lower than in 2022. However, those weekly earnings amounted to *** U.S. dollars in 2007, and they had risen to ***** U.S. dollars by 2024.
Facebook
TwitterAttribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
\r The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.\r \r \r \r There are 4 csv files here:\r \r BAWAP_P_annual_BA_SYB_GLO.csv\r \r Desc: Time series mean annual BAWAP rainfall from 1900 - 2012.\r \r Source data: annual BILO rainfall on \\wron\Project\BA\BA_N_Sydney\Working\li036_Lingtao_LI\Grids\BILO_Rain_Ann\\r \r \r \r P_PET_monthly_BA_SYB_GLO.csv\r \r long term average BAWAP rainfall and Penman PET from 198101 - 201212 for each month\r \r \r \r Climatology_Trend_BA_SYB_GLO.csv\r \r Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg; (iv) Tmax; (v) Tmin; (vi) VPD; (vii) Rn; and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend\r \r \r \r Risbey_Remote_Rainfall_Drivers_Corr_Coeffs_BA_NSB_GLO.csv\r \r Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). All data used in this analysis came directly from James Risbey, CMAR, Hobart. As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).\r \r
\r Dataset was created from various BILO source data, including Monthly BILO rainfall, Tmax, Tmin, VPD, etc, and other source data including monthly Penman PET (calculated by Randall Donohue), Correlation coefficient data from James Risbey\r \r
\r Bioregional Assessment Programme (XXXX) SYD ALL climate data statistics summary. Bioregional Assessment Derived Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/b0a6ccf1-395d-430e-adf1-5068f8371dea.\r \r
\r * Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012\r \r
Facebook
TwitterThis data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Facebook
TwitterThis dataset contains data collected aboard the ship Ron Brown for EPIC. The data are made up of computations of bulk meteorological variables and fluxes derived from the ETL system based on preliminary analysis done during Leg 1 of the EPIC2001 cruise. Most quantities given are subject to future modification based on accounting for other sources of data and revised calibrations. No direct turbulent flux calculations are included in this present data.
Facebook
TwitterDescriptive statistics (Pearson’s r, mean and standard deviations).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual average revenue per student from 1992 to 2023 for Lakeland R-III School District
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Nowadays, there is a growing tendency to use Python and R in the analytics world for physical/statistical modeling and data visualization. As scientists, analysts, or statisticians, we oftentimes choose the tool that allows us to perform the task in the quickest and most accurate way possible. For some, that means Python. For others, that means R. For many, that means a combination of the two. However, it may take considerable time to switch between these two languages, passing data and models through .csv files or database systems. There's a solution that allows researchers to quickly and easily interface R and Python together in one single Jupyter Notebook. Here we provide a Jupyter Notebook that serves as a tutorial showing how to interface R and Python together in a Jupyter Notebook on CUAHSI JupyterHub. This tutorial walks you through the installation of rpy2 library and shows simple examples illustrating this interface.
Facebook
TwitterThese are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is not publicly accessible because: EPA cannot release personally identifiable information regarding living individuals, according to the Privacy Act and the Freedom of Information Act (FOIA). This dataset contains information about human research subjects. Because there is potential to identify individual participants and disclose personal information, either alone or in combination with other datasets, individual level data are not appropriate to post for public access. Restricted access may be granted to authorized persons by contacting the party listed. It can be accessed through the following means: File format: R workspace file; “Simulated_Dataset.RData”. Metadata (including data dictionary) • y: Vector of binary responses (1: adverse outcome, 0: control) • x: Matrix of covariates; one row for each simulated individual • z: Matrix of standardized pollution exposures • n: Number of simulated individuals • m: Number of exposure time periods (e.g., weeks of pregnancy) • p: Number of columns in the covariate design matrix • alpha_true: Vector of “true” critical window locations/magnitudes (i.e., the ground truth that we want to estimate) Code Abstract We provide R statistical software code (“CWVS_LMC.txt”) to fit the linear model of coregionalization (LMC) version of the Critical Window Variable Selection (CWVS) method developed in the manuscript. We also provide R code (“Results_Summary.txt”) to summarize/plot the estimated critical windows and posterior marginal inclusion probabilities. Description “CWVS_LMC.txt”: This code is delivered to the user in the form of a .txt file that contains R statistical software code. Once the “Simulated_Dataset.RData” workspace has been loaded into R, the text in the file can be used to identify/estimate critical windows of susceptibility and posterior marginal inclusion probabilities. “Results_Summary.txt”: This code is also delivered to the user in the form of a .txt file that contains R statistical software code. Once the “CWVS_LMC.txt” code is applied to the simulated dataset and the program has completed, this code can be used to summarize and plot the identified/estimated critical windows and posterior marginal inclusion probabilities (similar to the plots shown in the manuscript). Optional Information (complete as necessary) Required R packages: • For running “CWVS_LMC.txt”: • msm: Sampling from the truncated normal distribution • mnormt: Sampling from the multivariate normal distribution • BayesLogit: Sampling from the Polya-Gamma distribution • For running “Results_Summary.txt”: • plotrix: Plotting the posterior means and credible intervals Instructions for Use Reproducibility (Mandatory) What can be reproduced: The data and code can be used to identify/estimate critical windows from one of the actual simulated datasets generated under setting E4 from the presented simulation study. How to use the information: • Load the “Simulated_Dataset.RData” workspace • Run the code contained in “CWVS_LMC.txt” • Once the “CWVS_LMC.txt” code is complete, run “Results_Summary.txt”. Format: Below is the replication procedure for the attached data set for the portion of the analyses using a simulated data set: Data The data used in the application section of the manuscript consist of geocoded birth records from the North Carolina State Center for Health Statistics, 2005-2008. In the simulation study section of the manuscript, we simulate synthetic data that closely match some of the key features of the birth certificate data while maintaining confidentiality of any actual pregnant women. Availability Due to the highly sensitive and identifying information contained in the birth certificate data (including latitude/longitude and address of residence at delivery), we are unable to make the data from the application section publically available. However, we will make one of the simulated datasets available for any reader interested in applying the method to realistic simulated birth records data. This will also allow the user to become familiar with the required inputs of the model, how the data should be structured, and what type of output is obtained. While we cannot provide the application data here, access to the North Carolina birth records can be requested through the North Carolina State Center for Health Statistics, and requires an appropriate data use agreement. Description Permissions: These are simulated data without any identifying information or informative birth-level covariates. We also standardize the pollution exposures on each week by subtracting off the median exposure amount on a given week and dividing by the interquartile range (IQR) (as in the actual application to the true NC birth records data). The dataset that we provide includes weekly average pregnancy exposures that have already been standardized in this way while the medians and IQRs are not given. This further protects identifiability of the spatial locations used in the analysis. This dataset is associated with the following publication: Warren, J., W. Kong, T. Luben, and H. Chang. Critical Window Variable Selection: Estimating the Impact of Air Pollution on Very Preterm Birth. Biostatistics. Oxford University Press, OXFORD, UK, 1-30, (2019).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual average revenue per student from 1992 to 2023 for Farmington R-VII School District
Facebook
TwitterTo assist communities in identifying racially/ethnically-concentrated areas of poverty (R/ECAPs), HUD has developed a census tract-based definition of R/ECAPs. The definition involves a racial/ethnic concentration threshold and a poverty test. The racial/ethnic concentration threshold is straightforward: R/ECAPs must have a non-white population of 50 percent or more. Regarding the poverty threshold, Wilson (1980) defines neighborhoods of extreme poverty as census tracts with 40 percent or more of individuals living at or below the poverty line. Because overall poverty levels are substantially lower in many parts of the country, HUD supplements this with an alternate criterion. Thus, a neighborhood can be a R/ECAP if it has a poverty rate that exceeds 40% or is three or more times the average tract poverty rate for the metropolitan/micropolitan area, whichever threshold is lower. Census tracts with this extreme poverty that satisfy the racial/ethnic concentration threshold are deemed R/ECAPs. This translates into the following equation: Where i represents census tracts, () is the metropolitan/micropolitan (CBSA) mean tract poverty rate, is the ith tract poverty rate, () is the non-Hispanic white population in tract i, and Pop is the population in tract i.While this definition of R/ECAP works well for tracts in CBSAs, place outside of these geographies are unlikely to have racial or ethnic concentrations as high as 50 percent. In these areas, the racial/ethnic concentration threshold is set at 20 percent.
Data Source: American Community Survey (ACS), 2009-2013; Decennial Census (2010); Brown Longitudinal Tract Database (LTDB) based on decennial census data, 1990, 2000 & 2010.
Related AFFH-T Local Government, PHA Tables/Maps: Table 4, 7; Maps 1-17. Related AFFH-T State Tables/Maps: Table 4, 7; Maps 1-15, 18.
References:Wilson, William J. (1980). The Declining Significance of Race: Blacks and Changing American Institutions. Chicago: University of Chicago Press.
To learn more about R/ECAPs visit:https://www.hud.gov/program_offices/fair_housing_equal_opp/affh ; https://www.hud.gov/sites/dfiles/FHEO/documents/AFFH-T-Data-Documentation-AFFHT0006-July-2020.pdf, for questions about the spatial attribution of this dataset, please reach out to us at GISHelpdesk@hud.gov. Date of Coverage: 11/2017
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Four R scripts, used for creating: 1) Sample wide weighted means of public concern for the human health impacts of 16 marine threats including marine plastic pollution and country level breakdowns of these means (Figures 1 & 3). A repeated measures ANOVA via a mixed linear effect model is also scripted which tests the effect of marine threat on concern level, as well as post hoc analysis testing the differences between marine threats. 2) Sample wide weighted means of public support for research into the human health implications of 15 marine research areas, including marine plastic pollution and country level breakdowns of these means (Figures 2 & 4). A repeated measures ANOVA via a mixed linear effect model is also scripted which tests the effect of marine research area on support level, as well as post hoc analysis testing the differences between marine research areas. 3) A data frame for multi-level regression analysis. 4) Multi-level modelling predicting public concern for and desire for research into marine plastic pollution, as well as mediation analysis formally testing the mediating effect of concern.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Pakistan Average Monthly Wages: Female: Finance, Insurance & R.Estate data was reported at 33,985.000 PKR in 2015. This records an increase from the previous number of 31,182.000 PKR for 2014. Pakistan Average Monthly Wages: Female: Finance, Insurance & R.Estate data is updated yearly, averaging 28,624.000 PKR from Jun 2008 (Median) to 2015, with 7 observations. The data reached an all-time high of 33,985.000 PKR in 2015 and a record low of 12,626.000 PKR in 2008. Pakistan Average Monthly Wages: Female: Finance, Insurance & R.Estate data remains active status in CEIC and is reported by Pakistan Bureau of Statistics. The data is categorized under Global Database’s Pakistan – Table PK.G004: Average Monthly Wages: By Industry.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
K-Means Cluster Analysis Syntax RStudio