100+ datasets found

f
Variable definition and statistics.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
He, Weiming; Wang, Jiaxue; Guo, Linan (2023). Variable definition and statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001021701
Explore at:
Dataset updated
Nov 30, 2023
Authors
He, Weiming; Wang, Jiaxue; Guo, Linan
Description
China is one of the countries hardest hit by disasters. Disaster shocks not only cause a large number of casualties and property damage but also have an impact on the risk preference of those who experience it. Current research has not reached a consensus conclusion on the impact of risk preferences. This paper empirically analyzes the effects of natural and man-made disasters on residents’ risk preference based on the data of the China Household Financial Survey (CHFS) in 2019. The results indicate that: (1) Both natural and man-made disasters can significantly lead to an increase in the risk aversion of residents, and man-made disasters have a greater impact. (2) Education background plays a negative moderating role in the impact of man-made disasters on residents’ risk preference. (3) Natural disaster experiences have a greater impact on the risk preference of rural residents, while man-made disaster experiences have a greater impact on the risk preference of urban residents. Natural disaster experiences make rural residents more risk-averse, while man-made disaster experiences make urban residents more risk-averse. The results provide new evidence and perspective on the negative impact of disaster shocks on the social life of residents.
f
Variable definition and descriptive statistics.
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Dec 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Mingyu; Wang, Kun; Zhang, Zixian; Liu, Tianlan; Zhao, Jinxu; Shi, Shaocheng; Yang, Tianyi; He, Li; Li, Tianyang; Wang, Jiangyin (2022). Variable definition and descriptive statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000389889
Explore at:
Dataset updated
Dec 30, 2022
Authors
Yang, Mingyu; Wang, Kun; Zhang, Zixian; Liu, Tianlan; Zhao, Jinxu; Shi, Shaocheng; Yang, Tianyi; He, Li; Li, Tianyang; Wang, Jiangyin
Description
Variable definition and descriptive statistics.
f
Data from: Variable definition.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min, Liangyu; Huang, Xiaohong; Zhang, Xiaorong; Zhang, Jun; Zeng, Qianqian; Liu, Jiangwei (2023). Variable definition. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000998892
Explore at:
Dataset updated
Mar 17, 2023
Authors
Min, Liangyu; Huang, Xiaohong; Zhang, Xiaorong; Zhang, Jun; Zeng, Qianqian; Liu, Jiangwei
Description
The impact of a chief executive officer’s (CEO’s) functional experience on firm performance has gained the attention of many scholars. However, the measurement of functional experience is rarely disclosed in the public database. Few studies have been conducted on the comprehensive functional experience of CEOs. This paper used the upper echelons theory and obtained deep-level curricula vitae (CVs) data through the named entity recognition technique. First, we mined 15 consecutive years of CEOs’ CVs from 2006 to 2020 from Chinese listed companies. Second, we extracted information throughout their careers and automatically classified their functional hierarchy. Finally, we constructed breadth (functional breadth: functional experience richness) and depth (functional depth: average tenure and the hierarchy of function) for empirical analysis. We found that a CEO’s breadth is significantly negatively related to firm performance, and the quadratic term is significantly positive. A CEO’s depth is significantly positively related to firm performance, and the quadratic term is significantly negative. The research results indicate a u-shaped relationship between a CEO’s breadth and firm performance and an inverted u-shaped relationship between their depth and firm performance. The study’s findings extend the literature on factors influencing firm performance and CEOs’ functional experience. The study expands from the horizontal macro to the vertical micro level, providing new evidence to support the recruitment and selection of high-level corporate talent.
Definitions of independent variables used in the statistical analysis.
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Silvertown; Laurence Cook; Robert Cameron; Mike Dodd; Kevin McConway; Jenny Worthington; Peter Skelton; Christian Anton; Oliver Bossdorf; Bruno Baur; Menno Schilthuizen; Benoît Fontaine; Helmut Sattmann; Giorgio Bertorelle; Maria Correia; Cristina Oliveira; Beata Pokryszko; Małgorzata Ożgo; Arturs Stalažs; Eoin Gill; Üllar Rammul; Péter Sólymos; Zoltan Féher; Xavier Juan (2023). Definitions of independent variables used in the statistical analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0018927.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0018927.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jonathan Silvertown; Laurence Cook; Robert Cameron; Mike Dodd; Kevin McConway; Jenny Worthington; Peter Skelton; Christian Anton; Oliver Bossdorf; Bruno Baur; Menno Schilthuizen; Benoît Fontaine; Helmut Sattmann; Giorgio Bertorelle; Maria Correia; Cristina Oliveira; Beata Pokryszko; Małgorzata Ożgo; Arturs Stalažs; Eoin Gill; Üllar Rammul; Péter Sólymos; Zoltan Féher; Xavier Juan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Definitions of independent variables used in the statistical analysis.
Variable definitions, sources and summary statistics.
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonios Garas; Sophie Guthmuller; Athanasios Lapatinas (2023). Variable definitions, sources and summary statistics. [Dataset]. http://doi.org/10.1371/journal.pone.0244843.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0244843.t002
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Antonios Garas; Sophie Guthmuller; Athanasios Lapatinas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Variable definitions, sources and summary statistics.
Variable definitions.
plos.figshare.com
xls
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kübranur Çebi Karaaslan; Hülya Diğer; Tubanur Çebi (2025). Variable definitions. [Dataset]. http://doi.org/10.1371/journal.pone.0324125.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0324125.t002
Dataset updated
May 29, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Kübranur Çebi Karaaslan; Hülya Diğer; Tubanur Çebi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The aim of this study is to evaluate the level of satisfaction with health examination services in Türkiye. It is thought that the findings will contribute to the more effective management of the health service process and offer potential solutions to identified problems. Notably, a significant portion of the problems encountered in healthcare services tends to arise during the examination phase. Therefore, this research was conducted to address these problems by thoroughly analyzing public satisfaction, with the expectation that such an approach could provide actionable insights for resolving these problems. In the study, the micro data set of the 2023 Life Satisfaction Survey conducted by the Turkish Statistical Institute was used. The analysis process was carried out with a two-stage method. In the first stage, Pearson’s χ² test was used to evaluate whether the independent variables had a statistically significant relationship with satisfaction with health examination services. In the second stage, a considering the binary categorical structure of the dependent variable, a logit regression model was applied to estimate the relationship between satisfaction with health examination services and the independent variables. The findings revealed that 61.85% of Turkish citizens were satisfied with health examination services. Furthermore, this level of satisfaction was significantly affected by a wide range of sociodemographic, individual, and institution-related factors. The study’s findings suggest that aligning individuals’ demands in the health service process with guidance from field experts and developing targeted policies could lead to improved satisfaction with health examination services. In addition, it is foreseen that the concept of trust is important in the satisfaction that constitutes the main subject of the study in health services and in the negative situations experienced in different subjects. Based on these insights, initiatives can be taken to increase trust in the health system through health policies to be designed. Furthermore, the results highlight the growing importance of digitalization and digital hospitals in healthcare. Further progress in this direction will increase the satisfaction with health examination and contribute to positive results in health services.
d
Data from: 2010 County and City-Level Water-Use Data and Associated...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
U.S. Geological Survey
Description
This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
d
Mean climate variables for all subregions
data.gov.au
researchdata.edu.au
+1more
zip
Updated Apr 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2022). Mean climate variables for all subregions [Dataset]. https://data.gov.au/data/dataset/3f568840-0c77-4f74-bbf3-6f82d189a1fc
Explore at:
zip(134609)Available download formats
Dataset updated
Apr 13, 2022
Dataset authored and provided by
Bioregional Assessment Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract

The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from BILO Gridded Climate Data data provided by the CSIRO. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.

Various climate variable summary for all 15 subregions. Including:

Time series mean annual Bureau of Meteorology Australian Water Availability Project (BAWAP) rainfall from 1900 - 2012.

Long term average BAWAP rainfall and Penman Potential Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month

Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net Radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). All data used in this analysis came directly from James Risbey, CSIRO Marine and Atmospheric Research (CMAR), Hobart. As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).

Dataset History

Dataset was generated using various source data:

annual BAWAP rainfall

annual BAWAPrainfall

long term average BAWAP rainfall and Penman PET for each month

Monthly BAWAP rainfall

Monthly Penman PET

climate varaible trends

Monthly BAWAP rainfall

Monthly Penman PET

Monthly BAWAP Tair

Monthly BAWAP Tmax

Monthly BAWAP Tmin

Monthly VPD

Actual vapour measured at 9:00am, the saturated vapour is calculated from Tmax and Tmin.

Monthly Rn

Monthly Wind

This dataset is created by CLW Ecohydrological Time Series Remote Sensing Team. See http://www-data.iwis.csiro.au/ts/climate/wind/mcvicar_etal_grl2008/.

Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers

Dataset Citation

Bioregional Assessment Programme (2013) Mean climate variables for all subregions. Bioregional Assessment Derived Dataset. Viewed 12 March 2019, http://data.bioregionalassessments.gov.au/dataset/3f568840-0c77-4f74-bbf3-6f82d189a1fc.

Dataset Ancestors

Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
Life Expectancy WHO
kaggle.com
zip
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
Explore at:
zip(121472 bytes)Available download formats
Dataset updated
Jun 19, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

We use DECISION TREE MODEL for the analysis.

Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).

We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.

We use 5 fold cross validation method with CP (complexity parameter) being 0.01.

In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).

MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

We use RANDOM FOREST for the analysis.

Run library(randomForest)

We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.

Predict Life expectancy through random forest model.

In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).

MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
f
Variable definition and descriptive statistics.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Xiaodi; Huang, Huan; Ma, Zhifei; Qin, Dongxue; Zhang, Xiangmin (2024). Variable definition and descriptive statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001288711
Explore at:
Dataset updated
Aug 30, 2024
Authors
Li, Xiaodi; Huang, Huan; Ma, Zhifei; Qin, Dongxue; Zhang, Xiangmin
Description
This study examines the impact of internet usage on farmer’s adoption behavior of fertilizer reduction and efficiency enhancement technologies in China. Based on 1,295 questionnaires in Henan Province, this study constructs a counterfactual analysis framework and used endogenous switching probit model to analyze the effects and pathways of internet usage on farmer’s adoption behavior of chemical fertilizer reduction and efficiency enhancement technologies. The results indicate that. (1) The proportion of farmers adopting chemical fertilizer reduction and efficiency enhancement technologies is 60.15%, while the proportion of farmers not adopting these technologies is 39.85%. (2) Internet usage directly influences farmers’ adoption of fertilizer reduction and efficiency enhancement technologies. According to counterfactual assumption analysis, if farmers who currently use the Internet were to stop using it, the probability of them adopting these technologies would decrease by 28.09%. Conversely, for farmers who do not currently use the Internet, if they were to start using it, the probability of them adopting fertilizer reduction and efficiency enhancement technologies would increase by 40.67%. (3) Internet usage indirectly influences farmers’ adoption behavior through mediating pathways of expected benefits and risk perception. In addition, social networks negatively moderate the impact of internet usage on farmers’ behavior of chemical fertilizer reduction and efficiency enhancement technologies.
Descriptive statistics of the variables in the database.
plos.figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José M. Gavilán; Juan Haro; José Antonio Hinojosa; Isabel Fraga; Pilar Ferré (2023). Descriptive statistics of the variables in the database. [Dataset]. http://doi.org/10.1371/journal.pone.0254484.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254484.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
José M. Gavilán; Juan Haro; José Antonio Hinojosa; Isabel Fraga; Pilar Ferré
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptive statistics of the variables in the database.
f
UC_vs_US Statistic Analysis.xlsx
figshare.com
xlsx
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.23644/uu.12631628.v1
Dataset updated
Jul 9, 2020
Dataset provided by
Utrecht University
Authors
F. (Fabiano) Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

Tagging scheme: Aligned (AL) - A concept is represented as a class in both models, either

with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

All the calculations and information provided in the following sheets

originate from that raw data.

Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,

including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

Sheet 3 (Size-Ratio):

The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

Sheet 4 (Overall):

Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

For sheet 4 as well as for the following four sheets, diverging stacked bar

charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

Sheet 5 (By-Notation):

Model correctness and model completeness is compared by notation - UC, US.

Sheet 6 (By-Case):

Model correctness and model completeness is compared by case - SIM, HOS, IFA.

Sheet 7 (By-Process):

Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

Sheet 8 (By-Grade):

Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
i
Household Health Survey 2012-2013, Economic Research Forum (ERF)...
catalog.ihsn.org
datacatalog.ihsn.org
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Central Statistical Organization (CSO) (2017). Household Health Survey 2012-2013, Economic Research Forum (ERF) Harmonization Data - Iraq [Dataset]. https://catalog.ihsn.org/index.php/catalog/6937
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
Kurdistan Regional Statistics Office (KRSO)
Central Statistical Organization (CSO)
Economic Research Forum
Time period covered
2012 - 2013
Area covered
Iraq
Description
Abstract

The harmonized data set on health, created and published by the ERF, is a subset of Iraq Household Socio Economic Survey (IHSES) 2012. It was derived from the household, individual and health modules, collected in the context of the above mentioned survey. The sample was then used to create a harmonized health survey, comparable with the Iraq Household Socio Economic Survey (IHSES) 2007 micro data set.

----> Overview of the Iraq Household Socio Economic Survey (IHSES) 2012:

Iraq is considered a leader in household expenditure and income surveys where the first was conducted in 1946 followed by surveys in 1954 and 1961. After the establishment of Central Statistical Organization, household expenditure and income surveys were carried out every 3-5 years in (1971/ 1972, 1976, 1979, 1984/ 1985, 1988, 1993, 2002 / 2007). Implementing the cooperation between CSO and WB, Central Statistical Organization (CSO) and Kurdistan Region Statistics Office (KRSO) launched fieldwork on IHSES on 1/1/2012. The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

The survey has six main objectives. These objectives are:

Provide data for poverty analysis and measurement and monitor, evaluate and update the implementation Poverty Reduction National Strategy issued in 2009.

Provide comprehensive data system to assess household social and economic conditions and prepare the indicators related to the human development.

Provide data that meet the needs and requirements of national accounts.

Provide detailed indicators on consumption expenditure that serve making decision related to production, consumption, export and import.

Provide detailed indicators on the sources of households and individuals income.

Provide data necessary for formulation of a new consumer price index number.

The raw survey data provided by the Statistical Office were then harmonized by the Economic Research Forum, to create a comparable version with the 2006/2007 Household Socio Economic Survey in Iraq. Harmonization at this stage only included unifying variables' names, labels and some definitions. See: Iraq 2007 & 2012- Variables Mapping & Availability Matrix.pdf provided in the external resources for further information on the mapping of the original variables on the harmonized ones, in addition to more indications on the variables' availability in both survey years and relevant comments.

Geographic coverage

National coverage: Covering a sample of urban, rural and metropolitan areas in all the governorates including those in Kurdistan Region.

Analysis unit

1- Household/family. 2- Individual/person.

Universe

The survey was carried out over a full year covering all governorates including those in Kurdistan Region.

Kind of data

Sample survey data [ssd]

Sampling procedure

----> Design:

Sample size was (25488) household for the whole Iraq, 216 households for each district of 118 districts, 2832 clusters each of which includes 9 households distributed on districts and governorates for rural and urban.

----> Sample frame:

Listing and numbering results of 2009-2010 Population and Housing Survey were adopted in all the governorates including Kurdistan Region as a frame to select households, the sample was selected in two stages: Stage 1: Primary sampling unit (blocks) within each stratum (district) for urban and rural were systematically selected with probability proportional to size to reach 2832 units (cluster). Stage two: 9 households from each primary sampling unit were selected to create a cluster, thus the sample size of total survey clusters was 25488 households distributed on the governorates, 216 households in each district.

----> Sampling Stages:

In each district, the sample was selected in two stages: Stage 1: based on 2010 listing and numbering frame 24 sample points were selected within each stratum through systematic sampling with probability proportional to size, in addition to the implicit breakdown urban and rural and geographic breakdown (sub-district, quarter, street, county, village and block). Stage 2: Using households as secondary sampling units, 9 households were selected from each sample point using systematic equal probability sampling. Sampling frames of each stages can be developed based on 2010 building listing and numbering without updating household lists. In some small districts, random selection processes of primary sampling may lead to select less than 24 units therefore a sampling unit is selected more than once , the selection may reach two cluster or more from the same enumeration unit when it is necessary.

Mode of data collection

Face-to-face [f2f]

Research instrument

----> Preparation:

The questionnaire of 2006 survey was adopted in designing the questionnaire of 2012 survey on which many revisions were made. Two rounds of pre-test were carried out. Revision were made based on the feedback of field work team, World Bank consultants and others, other revisions were made before final version was implemented in a pilot survey in September 2011. After the pilot survey implemented, other revisions were made in based on the challenges and feedbacks emerged during the implementation to implement the final version in the actual survey.

----> Questionnaire Parts:

The questionnaire consists of four parts each with several sections: Part 1: Socio – Economic Data: - Section 1: Household Roster - Section 2: Emigration - Section 3: Food Rations - Section 4: housing - Section 5: education - Section 6: health - Section 7: Physical measurements - Section 8: job seeking and previous job

Part 2: Monthly, Quarterly and Annual Expenditures: - Section 9: Expenditures on Non – Food Commodities and Services (past 30 days). - Section 10 : Expenditures on Non – Food Commodities and Services (past 90 days). - Section 11: Expenditures on Non – Food Commodities and Services (past 12 months). - Section 12: Expenditures on Non-food Frequent Food Stuff and Commodities (7 days). - Section 12, Table 1: Meals Had Within the Residential Unit. - Section 12, table 2: Number of Persons Participate in the Meals within Household Expenditure Other Than its Members.

Part 3: Income and Other Data: - Section 13: Job - Section 14: paid jobs - Section 15: Agriculture, forestry and fishing - Section 16: Household non – agricultural projects - Section 17: Income from ownership and transfers - Section 18: Durable goods - Section 19: Loans, advances and subsidies - Section 20: Shocks and strategy of dealing in the households - Section 21: Time use - Section 22: Justice - Section 23: Satisfaction in life - Section 24: Food consumption during past 7 days

Part 4: Diary of Daily Expenditures: Diary of expenditure is an essential component of this survey. It is left at the household to record all the daily purchases such as expenditures on food and frequent non-food items such as gasoline, newspapers…etc. during 7 days. Two pages were allocated for recording the expenditures of each day, thus the roster will be consists of 14 pages.

Cleaning operations

----> Raw Data:

Data Editing and Processing: To ensure accuracy and consistency, the data were edited at the following stages: 1. Interviewer: Checks all answers on the household questionnaire, confirming that they are clear and correct. 2. Local Supervisor: Checks to make sure that questions has been correctly completed. 3. Statistical analysis: After exporting data files from excel to SPSS, the Statistical Analysis Unit uses program commands to identify irregular or non-logical values in addition to auditing some variables. 4. World Bank consultants in coordination with the CSO data management team: the World Bank technical consultants use additional programs in SPSS and STAT to examine and correct remaining inconsistencies within the data files. The software detects errors by analyzing questionnaire items according to the expected parameter for each variable.

----> Harmonized Data:

The SPSS package is used to harmonize the Iraq Household Socio Economic Survey (IHSES) 2007 with Iraq Household Socio Economic Survey (IHSES) 2012.

The harmonization process starts with raw data files received from the Statistical Office.

A program is generated for each dataset to create harmonized variables.

Data is saved on the household and individual level, in SPSS and then converted to STATA, to be disseminated.

Response rate

Iraq Household Socio Economic Survey (IHSES) reached a total of 25488 households. Number of households refused to response was 305, response rate was 98.6%. The highest interview rates were in Ninevah and Muthanna (100%) while the lowest rates were in Sulaimaniya (92%).
Film Circulation dataset
zenodo.org
data.niaid.nih.gov
bin, csv, png
Updated Jul 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova (2024). Film Circulation dataset [Dataset]. http://doi.org/10.5281/zenodo.7887672
Explore at:
csv, png, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7887672
Dataset updated
Jul 12, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Skadi Loist; Skadi Loist; Evgenia (Zhenya) Samoilova; Evgenia (Zhenya) Samoilova
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”

A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org

Please cite this when using the dataset.

Detailed description of the dataset:

1 Film Dataset: Festival Programs

The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.

The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.

The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.

The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.

2 Survey Dataset

The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.

The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.

The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.

The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.

3 IMDb & Scripts

The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.

The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.

The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.

The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.

The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.

The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.

The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.

The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.

The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.

The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.

The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.

The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.

The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.

The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.

The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.

The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.

The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.

The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.

The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.

4 Festival Library Dataset

The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.

The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Customer Satisfaction Scores and Behavior Data
kaggle.com
zip
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salahuddin Ahmed (2025). Customer Satisfaction Scores and Behavior Data [Dataset]. https://www.kaggle.com/datasets/salahuddinahmedshuvo/customer-satisfaction-scores-and-behavior-data/discussion
Explore at:
zip(2456 bytes)Available download formats
Dataset updated
Apr 6, 2025
Authors
Salahuddin Ahmed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains customer satisfaction scores collected from a survey, alongside key demographic and behavioral data. It includes variables such as customer age, gender, location, purchase history, support contact status, loyalty level, and satisfaction factors. The dataset is designed to help analyze customer satisfaction, identify trends, and develop insights that can drive business decisions.

File Information: File Name: customer_satisfaction_data.csv (or your specific file name)

File Type: CSV (or the actual file format you are using)

Number of Rows: 120

Number of Columns: 10

Column Names:

Customer_ID – Unique identifier for each customer (e.g., 81-237-4704)

Group – The group to which the customer belongs (A or B)

Satisfaction_Score – Customer's satisfaction score on a scale of 1-10

Age – Age of the customer

Gender – Gender of the customer (Male, Female)

Location – Customer's location (e.g., Phoenix.AZ, Los Angeles.CA)

Purchase_History – Whether the customer has made a purchase (Yes or No)

Support_Contacted – Whether the customer has contacted support (Yes or No)

Loyalty_Level – Customer's loyalty level (Low, Medium, High)

Satisfaction_Factor – Primary factor contributing to customer satisfaction (e.g., Price, Product Quality)

Statistical Analyses:

Descriptive Statistics:

Calculate mean, median, mode, standard deviation, and range for key numerical variables (e.g., Satisfaction Score, Age).

Summarize categorical variables (e.g., Gender, Loyalty Level, Purchase History) with frequency distributions and percentages.

Two-Sample t-Test (Independent t-test):

Compare the mean satisfaction scores between two independent groups (e.g., Group A vs. Group B) to determine if there is a significant difference in their average satisfaction scores.

Paired t-Test:

If there are two related measurements (e.g., satisfaction scores before and after a certain event), you can compare the means using a paired t-test.

One-Way ANOVA (Analysis of Variance):

Test if there are significant differences in mean satisfaction scores across more than two groups (e.g., comparing the mean satisfaction score across different Loyalty Levels).

Chi-Square Test for Independence:

Examine the relationship between two categorical variables (e.g., Gender vs. Purchase History or Loyalty Level vs. Support Contacted) to determine if there’s a significant association.

Mann-Whitney U Test:

For non-normally distributed data, use this test to compare satisfaction scores between two independent groups (e.g., Group A vs. Group B) to see if their distributions differ significantly.

Kruskal-Wallis Test:

Similar to ANOVA, but used for non-normally distributed data. This test can compare the median satisfaction scores across multiple groups (e.g., comparing satisfaction scores across Loyalty Levels or Satisfaction Factors).

Spearman’s Rank Correlation:

Test for a monotonic relationship between two ordinal or continuous variables (e.g., Age vs. Satisfaction Score or Satisfaction Score vs. Loyalty Level).

Regression Analysis:

Linear Regression: Model the relationship between a continuous dependent variable (e.g., Satisfaction Score) and independent variables (e.g., Age, Gender, Loyalty Level).

Logistic Regression: If analyzing binary outcomes (e.g., Purchase History or Support Contacted), you could model the probability of an outcome based on predictors.

Factor Analysis:

To identify underlying patterns or groups in customer behavior or satisfaction factors, you can apply Factor Analysis to reduce the dimensionality of the dataset and group similar variables.

Cluster Analysis:

Use K-Means Clustering or Hierarchical Clustering to group customers based on similarity in their satisfaction scores and other features (e.g., Loyalty Level, Purchase History).

Confidence Intervals:

Calculate confidence intervals for the mean of satisfaction scores or any other metric to estimate the range in which the true population mean might lie.
Weather and Housing in North America
kaggle.com
zip
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america
Explore at:
zip(512280 bytes)Available download formats
Dataset updated
Feb 13, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
North America
Description
Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

By [source]

About this dataset

This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

Research Ideas

Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.

Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.

Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...
Data from: S1 Dataset -
plos.figshare.com
xlsx
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0305699.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305699.s002
Dataset updated
Jul 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThere is a need to develop harmonized procedures and a Minimum Data Set (MDS) for cross-border Multi Casualty Incidents (MCI) in medical emergency scenarios to ensure appropriate management of such incidents, regardless of place, language and internal processes of the institutions involved. That information should be capable of real-time communication to the command-and-control chain. It is crucial that the models adopted are interoperable between countries so that the rights of patients to cross-border healthcare are fully respected.ObjectiveTo optimize management of cross-border Multi Casualty Incidents through a Minimum Data Set collected and communicated in real time to the chain of command and control for each incident. To determine the degree of agreement among experts.MethodWe used the modified Delphi method supplemented with the Utstein technique to reach consensus among experts. In the first phase, the minimum requirements of the project, the profile of the experts who were to participate, the basic requirements of each variable chosen and the way of collecting the data were defined by providing bibliography on the subject. In the second phase, the preliminary variables were grouped into 6 clusters, the objectives, the characteristics of the variables and the logistics of the work were approved. Several meetings were held to reach a consensus to choose the MDS variables using a Modified Delphi technique. Each expert had to score each variable from 1 to 10. Non-voting variables were eliminated, and the round of voting ended. In the third phase, the Utstein Style was applied to discuss each group of variables and choose the ones with the highest consensus. After several rounds of discussion, it was agreed to eliminate the variables with a score of less than 5 points. In phase four, the researchers submitted the variables to the external experts for final assessment and validation before their use in the simulations. Data were analysed with SPSS Statistics (IBM, version 2) software.ResultsSix data entities with 31 sub-entities were defined, generating 127 items representing the final MDS regarded as essential for incident management. The level of consensus for the choice of items was very high and was highest for the category ‘Incident’ with an overall kappa of 0.7401 (95% CI 0.1265–0.5812, p 0.000), a good level of consensus in the Landis and Koch model. The items with the greatest degree of consensus at ten were those relating to location, type of incident, date, time and identification of the incident. All items met the criteria set, such as digital collection and real-time transmission to the chain of command and control.ConclusionsThis study documents the development of a MDS through consensus with a high degree of agreement among a group of experts of different nationalities working in different fields. All items in the MDS were digitally collected and forwarded in real time to the chain of command and control. This tool has demonstrated its validity in four large cross-border simulations involving more than eight countries and their emergency services.
Data from: A dataset to model Levantine landcover and land-use change...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10396148
Dataset updated
Dec 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Kempf; Michael Kempf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 16, 2023
Area covered
Levant
Description
Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
d
Data from: Population dynamics of an invasive forest insect and associated...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Population dynamics of an invasive forest insect and associated natural enemies in the aftermath of invasion [Dataset]. https://catalog.data.gov/dataset/data-from-population-dynamics-of-an-invasive-forest-insect-and-associated-natural-enemies--cb1db
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
Datasets archived here consist of all data analyzed in Duan et al. 2015 from Journal of Applied Ecology. Specifically, these data were collected from annual sampling of emerald ash borer (Agrilus planipennis) immature stages and associated parasitoids on infested ash trees (Fraxinus) in Southern Michigan, where three introduced biological control agents had been released between 2007 - 2010. Detailed data collection procedures can be found in Duan et al. 2012, 2013, and 2015. Resources in this dataset:Resource Title: Duan J Data on EAB larval density-bird predation and unknown factor from Journal of Applied Ecology. File Name: Duan J Data on EAB larval density-bird predation and unknown factor from Journal of Applied Ecology.xlsxResource Description: This data set is used to calculate mean EAB density (per m2 of ash phloem area), bird predation rate and mortality rate caused by unknown factors and analyzed with JMP (10.2) scripts for mixed effect linear models in Duan et al. 2015 (Journal of Applied Ecology).Resource Title: DUAN J Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology. File Name: DUAN J Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology.xlsxResource Description: This data set is used to construct life tables and calculation of net population growth rate of emerald ash borer for each site. The net population growth rates were then analyzed with JMP (10.2) scripts for mixed effect linear models in Duan et al. 2015 (Journal of Applied Ecology).Resource Title: DUAN J Data on EAB Life Tables Calculation from Journal of Applied Ecology. File Name: DUAN J Data on EAB Life Tables Calculation from Journal of Applied Ecology.xlsxResource Description: This data set is used to calculate parasitism rate of EAB larvae for each tree and then analyzed with JMP (10.2) scripts for mixed effect linear models on in Duan et al. 2015 (Journal of Applied Ecology).Resource Title: READ ME for Emerald Ash Borer Biocontrol Study from Journal of Applied Ecology. File Name: READ_ME_for_Emerald_Ash_Borer_Biocontrol_Study_from_Journal_of_Applied_Ecology.docxResource Description: Additional information and definitions for the variables/content in the three Emerald Ash Borer Biocontrol Study tables: Data on EAB Life Tables Calculation Data on EAB larval density-bird predation and unknown factor Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology Resource Title: Data Dictionary for Emerald Ash Borer Biocontrol Study from Journal of Applied Ecology. File Name: AshBorerAnd Parasitoids_DataDictionary.csvResource Description: CSV data dictionary for the variables/content in the three Emerald Ash Borer Biocontrol Study tables: Data on EAB Life Tables Calculation Data on EAB larval density-bird predation and unknown factor Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology Fore more information see the related READ ME file.

Facebook

Twitter

Click to copy link

Link copied

Cite

He, Weiming; Wang, Jiaxue; Guo, Linan (2023). Variable definition and statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001021701

Variable definition and statistics.

Explore at:

Dataset updated

Nov 30, 2023

Authors

He, Weiming; Wang, Jiaxue; Guo, Linan

Description

China is one of the countries hardest hit by disasters. Disaster shocks not only cause a large number of casualties and property damage but also have an impact on the risk preference of those who experience it. Current research has not reached a consensus conclusion on the impact of risk preferences. This paper empirically analyzes the effects of natural and man-made disasters on residents’ risk preference based on the data of the China Household Financial Survey (CHFS) in 2019. The results indicate that: (1) Both natural and man-made disasters can significantly lead to an increase in the risk aversion of residents, and man-made disasters have a greater impact. (2) Education background plays a negative moderating role in the impact of man-made disasters on residents’ risk preference. (3) Natural disaster experiences have a greater impact on the risk preference of rural residents, while man-made disaster experiences have a greater impact on the risk preference of urban residents. Natural disaster experiences make rural residents more risk-averse, while man-made disaster experiences make urban residents more risk-averse. The results provide new evidence and perspective on the negative impact of disaster shocks on the social life of residents.

Clear search

Close search

Google apps

Main menu

Variable definition and statistics.

Variable definition and descriptive statistics.

Data from: Variable definition.

Definitions of independent variables used in the statistical analysis.

Variable definitions, sources and summary statistics.

Variable definitions.

Data from: 2010 County and City-Level Water-Use Data and Associated...

Mean climate variables for all subregions

Abstract

Dataset History

Dataset Citation

Dataset Ancestors

University SET data, with faculty and courses characteristics

Life Expectancy WHO

We use DECISION TREE MODEL for the analysis.

We use RANDOM FOREST for the analysis.

Variable definition and descriptive statistics.

Descriptive statistics of the variables in the database.

UC_vs_US Statistic Analysis.xlsx

Household Health Survey 2012-2013, Economic Research Forum (ERF)...

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Response rate

Film Circulation dataset

Customer Satisfaction Scores and Behavior Data

Weather and Housing in North America

Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Data from: S1 Dataset -

Data from: A dataset to model Levantine landcover and land-use change...

Data from: Population dynamics of an invasive forest insect and associated...

Variable definition and statistics.See More Versions

Variable definition and statistics.