99 datasets found

f
Variable definition and statistics.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Nov 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
He, Weiming; Wang, Jiaxue; Guo, Linan (2023). Variable definition and statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001021701
Explore at:
Dataset updated
Nov 30, 2023
Authors
He, Weiming; Wang, Jiaxue; Guo, Linan
Description
China is one of the countries hardest hit by disasters. Disaster shocks not only cause a large number of casualties and property damage but also have an impact on the risk preference of those who experience it. Current research has not reached a consensus conclusion on the impact of risk preferences. This paper empirically analyzes the effects of natural and man-made disasters on residents’ risk preference based on the data of the China Household Financial Survey (CHFS) in 2019. The results indicate that: (1) Both natural and man-made disasters can significantly lead to an increase in the risk aversion of residents, and man-made disasters have a greater impact. (2) Education background plays a negative moderating role in the impact of man-made disasters on residents’ risk preference. (3) Natural disaster experiences have a greater impact on the risk preference of rural residents, while man-made disaster experiences have a greater impact on the risk preference of urban residents. Natural disaster experiences make rural residents more risk-averse, while man-made disaster experiences make urban residents more risk-averse. The results provide new evidence and perspective on the negative impact of disaster shocks on the social life of residents.
f
Variable definition and descriptive statistics.
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Dec 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yang, Mingyu; Wang, Kun; Zhang, Zixian; Liu, Tianlan; Zhao, Jinxu; Shi, Shaocheng; Yang, Tianyi; He, Li; Li, Tianyang; Wang, Jiangyin (2022). Variable definition and descriptive statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000389889
Explore at:
Dataset updated
Dec 30, 2022
Authors
Yang, Mingyu; Wang, Kun; Zhang, Zixian; Liu, Tianlan; Zhao, Jinxu; Shi, Shaocheng; Yang, Tianyi; He, Li; Li, Tianyang; Wang, Jiangyin
Description
Variable definition and descriptive statistics.
Definitions of independent variables used in the statistical analysis.
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Silvertown; Laurence Cook; Robert Cameron; Mike Dodd; Kevin McConway; Jenny Worthington; Peter Skelton; Christian Anton; Oliver Bossdorf; Bruno Baur; Menno Schilthuizen; Benoît Fontaine; Helmut Sattmann; Giorgio Bertorelle; Maria Correia; Cristina Oliveira; Beata Pokryszko; Małgorzata Ożgo; Arturs Stalažs; Eoin Gill; Üllar Rammul; Péter Sólymos; Zoltan Féher; Xavier Juan (2023). Definitions of independent variables used in the statistical analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0018927.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0018927.t001
Dataset updated
Jun 3, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Jonathan Silvertown; Laurence Cook; Robert Cameron; Mike Dodd; Kevin McConway; Jenny Worthington; Peter Skelton; Christian Anton; Oliver Bossdorf; Bruno Baur; Menno Schilthuizen; Benoît Fontaine; Helmut Sattmann; Giorgio Bertorelle; Maria Correia; Cristina Oliveira; Beata Pokryszko; Małgorzata Ożgo; Arturs Stalažs; Eoin Gill; Üllar Rammul; Péter Sólymos; Zoltan Féher; Xavier Juan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Definitions of independent variables used in the statistical analysis.
f
Variable definitions, sources and summary statistics.
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jan 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guthmuller, Sophie; Garas, Antonios; Lapatinas, Athanasios (2021). Variable definitions, sources and summary statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000856470
Explore at:
Dataset updated
Jan 7, 2021
Authors
Guthmuller, Sophie; Garas, Antonios; Lapatinas, Athanasios
Description
Variable definitions, sources and summary statistics.
f
Data from: Variable definition.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Min, Liangyu; Huang, Xiaohong; Zhang, Xiaorong; Zhang, Jun; Zeng, Qianqian; Liu, Jiangwei (2023). Variable definition. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000998892
Explore at:
Dataset updated
Mar 17, 2023
Authors
Min, Liangyu; Huang, Xiaohong; Zhang, Xiaorong; Zhang, Jun; Zeng, Qianqian; Liu, Jiangwei
Description
The impact of a chief executive officer’s (CEO’s) functional experience on firm performance has gained the attention of many scholars. However, the measurement of functional experience is rarely disclosed in the public database. Few studies have been conducted on the comprehensive functional experience of CEOs. This paper used the upper echelons theory and obtained deep-level curricula vitae (CVs) data through the named entity recognition technique. First, we mined 15 consecutive years of CEOs’ CVs from 2006 to 2020 from Chinese listed companies. Second, we extracted information throughout their careers and automatically classified their functional hierarchy. Finally, we constructed breadth (functional breadth: functional experience richness) and depth (functional depth: average tenure and the hierarchy of function) for empirical analysis. We found that a CEO’s breadth is significantly negatively related to firm performance, and the quadratic term is significantly positive. A CEO’s depth is significantly positively related to firm performance, and the quadratic term is significantly negative. The research results indicate a u-shaped relationship between a CEO’s breadth and firm performance and an inverted u-shaped relationship between their depth and firm performance. The study’s findings extend the literature on factors influencing firm performance and CEOs’ functional experience. The study expands from the horizontal macro to the vertical micro level, providing new evidence to support the recruitment and selection of high-level corporate talent.
d
Data from: 2010 County and City-Level Water-Use Data and Associated...
catalog.data.gov
data.usgs.gov
+1more
Updated Nov 20, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
Explore at:
Dataset updated
Nov 20, 2025
Dataset provided by
U.S. Geological Survey
Description
This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
f
UC_vs_US Statistic Analysis.xlsx
figshare.com
xlsx
Updated Jul 9, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
F. (Fabiano) Dalpiaz (2020). UC_vs_US Statistic Analysis.xlsx [Dataset]. http://doi.org/10.23644/uu.12631628.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.23644/uu.12631628.v1
Dataset updated
Jul 9, 2020
Dataset provided by
Utrecht University
Authors
F. (Fabiano) Dalpiaz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.

Tagging scheme: Aligned (AL) - A concept is represented as a class in both models, either

with the same name or using synonyms or clearly linkable names; Wrongly represented (WR) - A class in the domain expert model is incorrectly represented in the student model, either (i) via an attribute, method, or relationship rather than class, or (ii) using a generic term (e.g., user'' instead ofurban planner''); System-oriented (SO) - A class in CM-Stud that denotes a technical implementation aspect, e.g., access control. Classes that represent legacy system or the system under design (portal, simulator) are legitimate; Omitted (OM) - A class in CM-Expert that does not appear in any way in CM-Stud; Missing (MI) - A class in CM-Stud that does not appear in any way in CM-Expert.

All the calculations and information provided in the following sheets

originate from that raw data.

Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,

including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.

Sheet 3 (Size-Ratio):

The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.

Sheet 4 (Overall):

Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.

For sheet 4 as well as for the following four sheets, diverging stacked bar

charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:

Sheet 5 (By-Notation):

Model correctness and model completeness is compared by notation - UC, US.

Sheet 6 (By-Case):

Model correctness and model completeness is compared by case - SIM, HOS, IFA.

Sheet 7 (By-Process):

Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.

Sheet 8 (By-Grade):

Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
Weather and Housing in North America
kaggle.com
zip
Updated Feb 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Weather and Housing in North America [Dataset]. https://www.kaggle.com/datasets/thedevastator/weather-and-housing-in-north-america
Explore at:
zip(512280 bytes)Available download formats
Dataset updated
Feb 13, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
North America
Description
Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

By [source]

About this dataset

This comprehensive dataset explores the relationship between housing and weather conditions across North America in 2012. Through a range of climate variables such as temperature, wind speed, humidity, pressure and visibility it provides unique insights into the weather-influenced environment of numerous regions. The interrelated nature of housing parameters such as longitude, latitude, median income, median house value and ocean proximity further enhances our understanding of how distinct climates play an integral part in area real estate valuations. Analyzing these two data sets offers a wealth of knowledge when it comes to understanding what factors can dictate the value and comfort level offered by residential areas throughout North America

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset offers plenty of insights into the effects of weather and housing on North American regions. To explore these relationships, you can perform data analysis on the variables provided.

First, start by examining descriptive statistics (i.e., mean, median, mode). This can help show you the general trend and distribution of each variable in this dataset. For example, what is the most common temperature in a given region? What is the average wind speed? How does this vary across different regions? By looking at descriptive statistics, you can get an initial idea of how various weather conditions and housing attributes interact with one another.

Next, explore correlations between variables. Are certain weather variables correlated with specific housing attributes? Is there a link between wind speeds and median house value? Or between humidity and ocean proximity? Analyzing correlations allows for deeper insights into how different aspects may influence one another for a given region or area. These correlations may also inform broader patterns that are present across multiple North American regions or countries.

Finally, use visualizations to further investigate this relationship between climate and housing attributes in North America in 2012. Graphs allow you visualize trends like seasonal variations or long-term changes over time more easily so they are useful when interpreting large amounts of data quickly while providing larger context beyond what numbers alone can tell us about relationships between different aspects within this dataset

Research Ideas

Analyzing the effect of climate change on housing markets across North America. By looking at temperature and weather trends in combination with housing values, researchers can better understand how climate change may be impacting certain regions differently than others.

Investigating the relationship between median income, house values and ocean proximity in coastal areas. Understanding how ocean proximity plays into housing prices may help inform real estate investment decisions and urban planning initiatives related to coastal development.

Utilizing differences in weather patterns across different climates to determine optimal seasonal rental prices for property owners. By analyzing changes in temperature, wind speed, humidity, pressure and visibility from season to season an investor could gain valuable insights into seasonal market trends to maximize their profits from rentals or Airbnb listings over time

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: Weather.csv | Column name | Description | |:---------------------|:-----------------------------------------------| | Date/Time | Date and time of the observation. (Date/Time) | | Temp_C | Temperature in Celsius. (Numeric) | | Dew Point Temp_C | Dew point temperature in Celsius. (Numeric) | | Rel Hum_% | Relative humidity in percent. (Numeric) | | Wind Speed_km/h | Wind speed in kilometers per hour. (Numeric) | | Visibility_km | Visibilit...
Life Expectancy WHO
kaggle.com
zip
Updated Jun 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Life Expectancy WHO [Dataset]. https://www.kaggle.com/datasets/vikramamin/life-expectancy-who
Explore at:
zip(121472 bytes)Available download formats
Dataset updated
Jun 19, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.

We use DECISION TREE MODEL for the analysis.

Run the required libraries (rpart, rpart.plot, RColorBrewer, rattle).

We run the decision tree analysis using rpart and plot the tree. We use fancyRpartPlot.

We use 5 fold cross validation method with CP (complexity parameter) being 0.01.

In Decision Tree , RMSE (Root Mean Squared Error) is 3.06. This indicates that on an average, the predicted values have an error of 3.06 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.035. This indicates an accuracy prediction of 96.45% (1-0.035).

MAE (Mean Absolute Error) is 2.35. This indicates that on an average, the predicted values deviate by approximately 2.35 years from the actual values.

We use RANDOM FOREST for the analysis.

Run library(randomForest)

We use varImpPlot to find out which variables are most significant and least significant. Income composition is the most important followed by adult mortality and the least relevant independent variable is Population.

Predict Life expectancy through random forest model.

In Random Forest , RMSE (Root Mean Squared Error) is 1.73. This indicates that on an average, the predicted values have an error of 1.73 years as compared to the actual life expectancy values.

MAPE (Mean Absolute Percentage Error) is 0.01. This indicates an accuracy prediction of 98.27% (1-0.01).

MAE (Mean Absolute Error) is 1.14. This indicates that on an average, the predicted values deviate by approximately 1.14 years from the actual values.

Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Descriptive statistics of the variables in the database.
plos.figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
José M. Gavilán; Juan Haro; José Antonio Hinojosa; Isabel Fraga; Pilar Ferré (2023). Descriptive statistics of the variables in the database. [Dataset]. http://doi.org/10.1371/journal.pone.0254484.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0254484.t001
Dataset updated
Jun 10, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
José M. Gavilán; Juan Haro; José Antonio Hinojosa; Isabel Fraga; Pilar Ferré
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Descriptive statistics of the variables in the database.
o
University SET data, with faculty and courses characteristics
openicpsr.org
Updated Sep 12, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Under blind review in refereed journal (2021). University SET data, with faculty and courses characteristics [Dataset]. http://doi.org/10.3886/E149801V1
Explore at:
Unique identifier
https://doi.org/10.3886/E149801V1
Dataset updated
Sep 12, 2021
Authors
Under blind review in refereed journal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper explores a unique dataset of all the SET ratings provided by students of one university in Poland at the end of the winter semester of the 2020/2021 academic year. The SET questionnaire used by this university is presented in Appendix 1. The dataset is unique for several reasons. It covers all SET surveys filled by students in all fields and levels of study offered by the university. In the period analysed, the university was entirely in the online regime amid the Covid-19 pandemic. While the expected learning outcomes formally have not been changed, the online mode of study could have affected the grading policy and could have implications for some of the studied SET biases. This Covid-19 effect is captured by econometric models and discussed in the paper. The average SET scores were matched with the characteristics of the teacher for degree, seniority, gender, and SET scores in the past six semesters; the course characteristics for time of day, day of the week, course type, course breadth, class duration, and class size; the attributes of the SET survey responses as the percentage of students providing SET feedback; and the grades of the course for the mean, standard deviation, and percentage failed. Data on course grades are also available for the previous six semesters. This rich dataset allows many of the biases reported in the literature to be tested for and new hypotheses to be formulated, as presented in the introduction section. The unit of observation or the single row in the data set is identified by three parameters: teacher unique id (j), course unique id (k) and the question number in the SET questionnaire (n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9} ). It means that for each pair (j,k), we have nine rows, one for each SET survey question, or sometimes less when students did not answer one of the SET questions at all. For example, the dependent variable SET_score_avg(j,k,n) for the triplet (j=Calculus, k=John Smith, n=2) is calculated as the average of all Likert-scale answers to question nr 2 in the SET survey distributed to all students that took the Calculus course taught by John Smith. The data set has 8,015 such observations or rows. The full list of variables or columns in the data set included in the analysis is presented in the attached filesection. Their description refers to the triplet (teacher id = j, course id = k, question number = n). When the last value of the triplet (n) is dropped, it means that the variable takes the same values for all n ϵ {1, 2, 3, 4, 5, 6, 7, 8, 9}.Two attachments:- word file with variables description- Rdata file with the data set (for R language).Appendix 1. Appendix 1. The SET questionnaire was used for this paper. Evaluation survey of the teaching staff of [university name] Please, complete the following evaluation form, which aims to assess the lecturer’s performance. Only one answer should be indicated for each question. The answers are coded in the following way: 5- I strongly agree; 4- I agree; 3- Neutral; 2- I don’t agree; 1- I strongly don’t agree. Questions 1 2 3 4 5 I learnt a lot during the course. ○ ○ ○ ○ ○ I think that the knowledge acquired during the course is very useful. ○ ○ ○ ○ ○ The professor used activities to make the class more engaging. ○ ○ ○ ○ ○ If it was possible, I would enroll for the course conducted by this lecturer again. ○ ○ ○ ○ ○ The classes started on time. ○ ○ ○ ○ ○ The lecturer always used time efficiently. ○ ○ ○ ○ ○ The lecturer delivered the class content in an understandable and efficient way. ○ ○ ○ ○ ○ The lecturer was available when we had doubts. ○ ○ ○ ○ ○ The lecturer treated all students equally regardless of their race, background and ethnicity. ○ ○
f
Variable definition and descriptive statistics.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Aug 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li, Xiaodi; Huang, Huan; Ma, Zhifei; Qin, Dongxue; Zhang, Xiangmin (2024). Variable definition and descriptive statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001288711
Explore at:
Dataset updated
Aug 30, 2024
Authors
Li, Xiaodi; Huang, Huan; Ma, Zhifei; Qin, Dongxue; Zhang, Xiangmin
Description
This study examines the impact of internet usage on farmer’s adoption behavior of fertilizer reduction and efficiency enhancement technologies in China. Based on 1,295 questionnaires in Henan Province, this study constructs a counterfactual analysis framework and used endogenous switching probit model to analyze the effects and pathways of internet usage on farmer’s adoption behavior of chemical fertilizer reduction and efficiency enhancement technologies. The results indicate that. (1) The proportion of farmers adopting chemical fertilizer reduction and efficiency enhancement technologies is 60.15%, while the proportion of farmers not adopting these technologies is 39.85%. (2) Internet usage directly influences farmers’ adoption of fertilizer reduction and efficiency enhancement technologies. According to counterfactual assumption analysis, if farmers who currently use the Internet were to stop using it, the probability of them adopting these technologies would decrease by 28.09%. Conversely, for farmers who do not currently use the Internet, if they were to start using it, the probability of them adopting fertilizer reduction and efficiency enhancement technologies would increase by 40.67%. (3) Internet usage indirectly influences farmers’ adoption behavior through mediating pathways of expected benefits and risk perception. In addition, social networks negatively moderate the impact of internet usage on farmers’ behavior of chemical fertilizer reduction and efficiency enhancement technologies.
Titanic Dataset - Machine Learning from Disaster
kaggle.com
zip
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aman Chauhan (2022). Titanic Dataset - Machine Learning from Disaster [Dataset]. https://www.kaggle.com/datasets/whenamancodes/titanic-dataset-machine-learning-from-disaster
Explore at:
zip(34877 bytes)Available download formats
Dataset updated
Sep 20, 2022
Authors
Aman Chauhan
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

The data has been split into two groups:

training set (train.csv)

test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary:

| Variable | Definition | Key | | --- | --- | | survival | Survival | 0 = No, 1 = Yes | | pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd | | sex | Sex | | | Age | Age in years | | | sibsp | # of siblings / spouses aboard the Titanic | | | parch | # of parents / children aboard the Titanic | | | ticket | Ticket number | | | fare | Passenger fare | | | cabin | Cabin number | | | embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.

More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Hehe
Data from: A dataset to model Levantine landcover and land-use change...
zenodo.org
data.niaid.nih.gov
+1more
zip
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Kempf; Michael Kempf (2023). A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19 [Dataset]. http://doi.org/10.5281/zenodo.10396148
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10396148
Dataset updated
Dec 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Michael Kempf; Michael Kempf
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Dec 16, 2023
Area covered
Levant
Description
Overview

This dataset is the repository for the following paper submitted to Data in Brief:

Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).

The Data in Brief article contains the supplement information and is the related data paper to:

Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).

Description/abstract

The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.

Folder structure

The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:

“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.

“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.

“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).

“yield_productivity” contains .csv files of yield information for all countries listed above.

“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).

“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.

“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.

Code structure

1_MODIS_NDVI_hdf_file_extraction.R

This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.

2_MERGE_MODIS_tiles.R

In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").

3_CROP_MODIS_merged_tiles.R

Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.

4_TREND_analysis_NDVI.R

Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.

5_BUILT_UP_change_raster.R

Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.

6_POPULATION_numbers_plot.R

For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.

7_YIELD_plot.R

In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.

8_GLDAS_read_extract_trend

The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
d
Data from: Population dynamics of an invasive forest insect and associated...
catalog.data.gov
agdatacommons.nal.usda.gov
+1more
Updated Apr 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agricultural Research Service (2025). Data from: Population dynamics of an invasive forest insect and associated natural enemies in the aftermath of invasion [Dataset]. https://catalog.data.gov/dataset/data-from-population-dynamics-of-an-invasive-forest-insect-and-associated-natural-enemies--cb1db
Explore at:
Dataset updated
Apr 21, 2025
Dataset provided by
Agricultural Research Service
Description
Datasets archived here consist of all data analyzed in Duan et al. 2015 from Journal of Applied Ecology. Specifically, these data were collected from annual sampling of emerald ash borer (Agrilus planipennis) immature stages and associated parasitoids on infested ash trees (Fraxinus) in Southern Michigan, where three introduced biological control agents had been released between 2007 - 2010. Detailed data collection procedures can be found in Duan et al. 2012, 2013, and 2015. Resources in this dataset:Resource Title: Duan J Data on EAB larval density-bird predation and unknown factor from Journal of Applied Ecology. File Name: Duan J Data on EAB larval density-bird predation and unknown factor from Journal of Applied Ecology.xlsxResource Description: This data set is used to calculate mean EAB density (per m2 of ash phloem area), bird predation rate and mortality rate caused by unknown factors and analyzed with JMP (10.2) scripts for mixed effect linear models in Duan et al. 2015 (Journal of Applied Ecology).Resource Title: DUAN J Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology. File Name: DUAN J Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology.xlsxResource Description: This data set is used to construct life tables and calculation of net population growth rate of emerald ash borer for each site. The net population growth rates were then analyzed with JMP (10.2) scripts for mixed effect linear models in Duan et al. 2015 (Journal of Applied Ecology).Resource Title: DUAN J Data on EAB Life Tables Calculation from Journal of Applied Ecology. File Name: DUAN J Data on EAB Life Tables Calculation from Journal of Applied Ecology.xlsxResource Description: This data set is used to calculate parasitism rate of EAB larvae for each tree and then analyzed with JMP (10.2) scripts for mixed effect linear models on in Duan et al. 2015 (Journal of Applied Ecology).Resource Title: READ ME for Emerald Ash Borer Biocontrol Study from Journal of Applied Ecology. File Name: READ_ME_for_Emerald_Ash_Borer_Biocontrol_Study_from_Journal_of_Applied_Ecology.docxResource Description: Additional information and definitions for the variables/content in the three Emerald Ash Borer Biocontrol Study tables: Data on EAB Life Tables Calculation Data on EAB larval density-bird predation and unknown factor Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology Resource Title: Data Dictionary for Emerald Ash Borer Biocontrol Study from Journal of Applied Ecology. File Name: AshBorerAnd Parasitoids_DataDictionary.csvResource Description: CSV data dictionary for the variables/content in the three Emerald Ash Borer Biocontrol Study tables: Data on EAB Life Tables Calculation Data on EAB larval density-bird predation and unknown factor Data on Parasitism L1-L2 Excluded from Journal of Applied Ecology Fore more information see the related READ ME file.
Data from: S-RIP: Zonal-mean dynamical variables of global atmospheric...
catalogue.ceda.ac.uk
data-search.nerc.ac.uk
Updated Feb 21, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Martineau (2018). S-RIP: Zonal-mean dynamical variables of global atmospheric reanalyses on pressure levels [Dataset]. https://catalogue.ceda.ac.uk/uuid/b241a7f536a244749662360bd7839312
Explore at:
Dataset updated
Feb 21, 2018
Dataset provided by
Centre for Environmental Data Analysishttp://www.ceda.ac.uk/
Authors
Patrick Martineau
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Time period covered
Jan 1, 1958 - Dec 31, 2016
Area covered
Earth
Variables measured
time, latitude, air_pressure, eastward_wind, northward_wind, air_temperature, geopotential_height, vertical_residual_circulation, zonal_variance_of_temperature, meridional_residual_circulation, and 17 more
Description
This dataset contains zonal-mean atmospheric diagnostics computed from reanalysis datasets on pressure levels. Primary variables include temperature, geopotential height, and the three-dimensional wind field. Advanced diagnostics include zonal covariance terms that can be used to compute, for instance, eddy kinetic energy and eddy fluxes. Terms from the primitive zonal-mean momentum equation and the transformed Eulerian momentum equation are also provided.

This dataset was produced to facilitate the comparison of reanalysis datasets for the collaborators of the SPARC- Reanalysis Intercomparison Project (S-RIP) project. The dataset is substantially smaller in size compared to the full three dimensional reanalysis fields and uses unified numerical methods. The dataset includes all global reanalyses available at the time of its development and will be extended to new reanalysis products in the future.
Data from: S1 Dataset -
plos.figshare.com
xlsx
Updated Jul 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo (2024). S1 Dataset - [Dataset]. http://doi.org/10.1371/journal.pone.0305699.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0305699.s002
Dataset updated
Jul 18, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Navid Behzadi Koochani; Raúl Muñoz Romo; Ignacio Hernández Palencia; Sergio López Bernal; Carmen Martin Curto; José Cabezas Rodríguez; Almudena Castaño Reguillo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThere is a need to develop harmonized procedures and a Minimum Data Set (MDS) for cross-border Multi Casualty Incidents (MCI) in medical emergency scenarios to ensure appropriate management of such incidents, regardless of place, language and internal processes of the institutions involved. That information should be capable of real-time communication to the command-and-control chain. It is crucial that the models adopted are interoperable between countries so that the rights of patients to cross-border healthcare are fully respected.ObjectiveTo optimize management of cross-border Multi Casualty Incidents through a Minimum Data Set collected and communicated in real time to the chain of command and control for each incident. To determine the degree of agreement among experts.MethodWe used the modified Delphi method supplemented with the Utstein technique to reach consensus among experts. In the first phase, the minimum requirements of the project, the profile of the experts who were to participate, the basic requirements of each variable chosen and the way of collecting the data were defined by providing bibliography on the subject. In the second phase, the preliminary variables were grouped into 6 clusters, the objectives, the characteristics of the variables and the logistics of the work were approved. Several meetings were held to reach a consensus to choose the MDS variables using a Modified Delphi technique. Each expert had to score each variable from 1 to 10. Non-voting variables were eliminated, and the round of voting ended. In the third phase, the Utstein Style was applied to discuss each group of variables and choose the ones with the highest consensus. After several rounds of discussion, it was agreed to eliminate the variables with a score of less than 5 points. In phase four, the researchers submitted the variables to the external experts for final assessment and validation before their use in the simulations. Data were analysed with SPSS Statistics (IBM, version 2) software.ResultsSix data entities with 31 sub-entities were defined, generating 127 items representing the final MDS regarded as essential for incident management. The level of consensus for the choice of items was very high and was highest for the category ‘Incident’ with an overall kappa of 0.7401 (95% CI 0.1265–0.5812, p 0.000), a good level of consensus in the Landis and Koch model. The items with the greatest degree of consensus at ten were those relating to location, type of incident, date, time and identification of the incident. All items met the criteria set, such as digital collection and real-time transmission to the chain of command and control.ConclusionsThis study documents the development of a MDS through consensus with a high degree of agreement among a group of experts of different nationalities working in different fields. All items in the MDS were digitally collected and forwarded in real time to the chain of command and control. This tool has demonstrated its validity in four large cross-border simulations involving more than eight countries and their emergency services.
ERA5 monthly averaged data on single levels from 1940 to present
cds.climate.copernicus.eu
grib
Updated Nov 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ECMWF (2025). ERA5 monthly averaged data on single levels from 1940 to present [Dataset]. http://doi.org/10.24381/cds.f17050d7
Explore at:
gribAvailable download formats
Unique identifier
https://doi.org/10.24381/cds.f17050d7
Dataset updated
Nov 6, 2025
Dataset provided by
European Centre for Medium-Range Weather Forecastshttp://ecmwf.int/
Authors
ECMWF
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 8 decades. Data is available from 1940 onwards. ERA5 replaces the ERA-Interim reanalysis. Reanalysis combines model data with observations from across the world into a globally complete and consistent dataset using the laws of physics. This principle, called data assimilation, is based on the method used by numerical weather prediction centres, where every so many hours (12 hours at ECMWF) a previous forecast is combined with newly available observations in an optimal way to produce a new best estimate of the state of the atmosphere, called analysis, from which an updated, improved forecast is issued. Reanalysis works in the same way, but at reduced resolution to allow for the provision of a dataset spanning back several decades. Reanalysis does not have the constraint of issuing timely forecasts, so there is more time to collect observations, and when going further back in time, to allow for the ingestion of improved versions of the original observations, which all benefit the quality of the reanalysis product. ERA5 provides hourly estimates for a large number of atmospheric, ocean-wave and land-surface quantities. An uncertainty estimate is sampled by an underlying 10-member ensemble at three-hourly intervals. Ensemble mean and spread have been pre-computed for convenience. Such uncertainty estimates are closely related to the information content of the available observing system which has evolved considerably over time. They also indicate flow-dependent sensitive areas. To facilitate many climate applications, monthly-mean averages have been pre-calculated too, though monthly means are not available for the ensemble mean and spread. ERA5 is updated daily with a latency of about 5 days (monthly means are available around the 6th of each month). In case that serious flaws are detected in this early release (called ERA5T), this data could be different from the final release 2 to 3 months later. In case that this occurs users are notified. The data set presented here is a regridded subset of the full ERA5 data set on native resolution. It is online on spinning disk, which should ensure fast and easy access. It should satisfy the requirements for most common applications. An overview of all ERA5 datasets can be found in this article. Information on access to ERA5 data on native resolution is provided in these guidelines. Data has been regridded to a regular lat-lon grid of 0.25 degrees for the reanalysis and 0.5 degrees for the uncertainty estimate (0.5 and 1 degree respectively for ocean waves). There are four main sub sets: hourly and monthly products, both on pressure levels (upper air fields) and single levels (atmospheric, ocean-wave and land surface quantities). The present entry is "ERA5 monthly mean data on single levels from 1940 to present".
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
Figsharehttp://figshare.com/
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

Bridging the Gap in Hypertension Management: Evaluating Blood Pressure...

data.mendeley.com

Updated Jan 15, 2025

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

abu sufian (2025). Bridging the Gap in Hypertension Management: Evaluating Blood Pressure Control and Associated Risk Factors in a Resource-Constrained Setting [Dataset]. http://doi.org/10.17632/56jyjndvcr.1

Explore at:

Unique identifier

https://doi.org/10.17632/56jyjndvcr.1

Dataset updated

Jan 15, 2025

Authors

abu sufian

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset Description

This dataset contains a simulated collection of 1,00000 patient records designed to explore hypertension management in resource-constrained settings. It provides comprehensive data for analyzing blood pressure control rates, associated risk factors, and complications. The dataset is ideal for predictive modelling, risk analysis, and treatment optimization, offering insights into demographic, clinical, and treatment-related variables.

Dataset Structure

Dataset Volume

• Size: 10,000 records. • Features: 19 variables, categorized into Sociodemographic, Clinical, Complications, and Treatment/Control groups.
Variables and Categories

A. Sociodemographic Variables

1. Age:
•  Continuous variable in years.
•  Range: 18–80 years.
•  Mean ± SD: 49.37 ± 12.81.
2. Sex:
•  Categorical variable.
•  Values: Male, Female.
3. Education:
•  Categorical variable.
•  Values: No Education, Primary, Secondary, Higher Secondary, Graduate, Post-Graduate, Madrasa.
4. Occupation:
•  Categorical variable.
•  Values: Service, Business, Agriculture, Retired, Unemployed, Housewife.
5. Monthly Income:
•  Categorical variable in Bangladeshi Taka.
•  Values: <5000, 5001–10000, 10001–15000, >15000.
6. Residence:
•  Categorical variable.
•  Values: Urban, Sub-urban, Rural.

B. Clinical Variables

7. Systolic BP:
•  Continuous variable in mmHg.
•  Range: 100–200 mmHg.
•  Mean ± SD: 140 ± 15 mmHg.
8. Diastolic BP:
•  Continuous variable in mmHg.
•  Range: 60–120 mmHg.
•  Mean ± SD: 90 ± 10 mmHg.
9. Elevated Creatinine:
•  Binary variable (\geq 1.4 \, \text{mg/dL}).
•  Values: Yes, No.
10. Diabetes Mellitus:
•  Binary variable.
•  Values: Yes, No.
11. Family History of CVD:
•  Binary variable.
•  Values: Yes, No.
12. Elevated Cholesterol:
•  Binary variable (\geq 200 \, \text{mg/dL}).
•  Values: Yes, No.
13. Smoking:
•  Binary variable.
•  Values: Yes, No.

C. Complications

14. LVH (Left Ventricular Hypertrophy):
•  Binary variable (ECG diagnosis).
•  Values: Yes, No.
15. IHD (Ischemic Heart Disease):
•  Binary variable.
•  Values: Yes, No.
16. CVD (Cerebrovascular Disease):
•  Binary variable.
•  Values: Yes, No.
17. Retinopathy:
•  Binary variable.
•  Values: Yes, No.

D. Treatment and Control

18. Treatment:
•  Categorical variable indicating therapy type.
•  Values: Single Drug, Combination Drugs.
19. Control Status:
•  Binary variable.
•  Values: Controlled, Uncontrolled.

Dataset Applications

1. Predictive Modeling:
•  Develop models to predict blood pressure control status using demographic and clinical data.
2. Risk Analysis:
•  Identify significant factors influencing hypertension control and complications.
3. Severity Scoring:
•  Quantify hypertension severity for patient risk stratification.
4. Complications Prediction:
•  Forecast complications like IHD, LVH, and CVD for early intervention.
5. Treatment Guidance:
•  Analyze therapy efficacy to recommend optimal treatment strategies.

Facebook

Twitter

Click to copy link

Link copied

Cite

He, Weiming; Wang, Jiaxue; Guo, Linan (2023). Variable definition and statistics. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001021701

Variable definition and statistics.

Explore at:

Dataset updated

Nov 30, 2023

Authors

He, Weiming; Wang, Jiaxue; Guo, Linan

Description

China is one of the countries hardest hit by disasters. Disaster shocks not only cause a large number of casualties and property damage but also have an impact on the risk preference of those who experience it. Current research has not reached a consensus conclusion on the impact of risk preferences. This paper empirically analyzes the effects of natural and man-made disasters on residents’ risk preference based on the data of the China Household Financial Survey (CHFS) in 2019. The results indicate that: (1) Both natural and man-made disasters can significantly lead to an increase in the risk aversion of residents, and man-made disasters have a greater impact. (2) Education background plays a negative moderating role in the impact of man-made disasters on residents’ risk preference. (3) Natural disaster experiences have a greater impact on the risk preference of rural residents, while man-made disaster experiences have a greater impact on the risk preference of urban residents. Natural disaster experiences make rural residents more risk-averse, while man-made disaster experiences make urban residents more risk-averse. The results provide new evidence and perspective on the negative impact of disaster shocks on the social life of residents.

Clear search

Close search

Google apps

Main menu

Variable definition and statistics.

Variable definition and descriptive statistics.

Definitions of independent variables used in the statistical analysis.

Variable definitions, sources and summary statistics.

Data from: Variable definition.

Data from: 2010 County and City-Level Water-Use Data and Associated...

UC_vs_US Statistic Analysis.xlsx

Weather and Housing in North America

Weather and Housing in North America

Exploring the Relationship between Weather and Housing Conditions in 2012

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Life Expectancy WHO

We use DECISION TREE MODEL for the analysis.

We use RANDOM FOREST for the analysis.

Descriptive statistics of the variables in the database.

University SET data, with faculty and courses characteristics

Variable definition and descriptive statistics.

Titanic Dataset - Machine Learning from Disaster

Overview

The data has been split into two groups:

Data Dictionary:

Variable Notes

Data from: A dataset to model Levantine landcover and land-use change...

Data from: Population dynamics of an invasive forest insect and associated...

Data from: S-RIP: Zonal-mean dynamical variables of global atmospheric...

Data from: S1 Dataset -

ERA5 monthly averaged data on single levels from 1940 to present

Cleaned NHANES 1988-2018

Bridging the Gap in Hypertension Management: Evaluating Blood Pressure...

Variable definition and statistics.See More Versions

Variable definition and statistics.