Facebook
TwitterChina is one of the countries hardest hit by disasters. Disaster shocks not only cause a large number of casualties and property damage but also have an impact on the risk preference of those who experience it. Current research has not reached a consensus conclusion on the impact of risk preferences. This paper empirically analyzes the effects of natural and man-made disasters on residents’ risk preference based on the data of the China Household Financial Survey (CHFS) in 2019. The results indicate that: (1) Both natural and man-made disasters can significantly lead to an increase in the risk aversion of residents, and man-made disasters have a greater impact. (2) Education background plays a negative moderating role in the impact of man-made disasters on residents’ risk preference. (3) Natural disaster experiences have a greater impact on the risk preference of rural residents, while man-made disaster experiences have a greater impact on the risk preference of urban residents. Natural disaster experiences make rural residents more risk-averse, while man-made disaster experiences make urban residents more risk-averse. The results provide new evidence and perspective on the negative impact of disaster shocks on the social life of residents.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Variable definition and descriptive statistics.
Facebook
TwitterThe impact of a chief executive officer’s (CEO’s) functional experience on firm performance has gained the attention of many scholars. However, the measurement of functional experience is rarely disclosed in the public database. Few studies have been conducted on the comprehensive functional experience of CEOs. This paper used the upper echelons theory and obtained deep-level curricula vitae (CVs) data through the named entity recognition technique. First, we mined 15 consecutive years of CEOs’ CVs from 2006 to 2020 from Chinese listed companies. Second, we extracted information throughout their careers and automatically classified their functional hierarchy. Finally, we constructed breadth (functional breadth: functional experience richness) and depth (functional depth: average tenure and the hierarchy of function) for empirical analysis. We found that a CEO’s breadth is significantly negatively related to firm performance, and the quadratic term is significantly positive. A CEO’s depth is significantly positively related to firm performance, and the quadratic term is significantly negative. The research results indicate a u-shaped relationship between a CEO’s breadth and firm performance and an inverted u-shaped relationship between their depth and firm performance. The study’s findings extend the literature on factors influencing firm performance and CEOs’ functional experience. The study expands from the horizontal macro to the vertical micro level, providing new evidence to support the recruitment and selection of high-level corporate talent.
Facebook
TwitterDefinitions of independent variables used in the statistical analysis.
Facebook
TwitterVariable definitions, sources and summary statistics.
Facebook
TwitterThis study examines the impact of internet usage on farmer’s adoption behavior of fertilizer reduction and efficiency enhancement technologies in China. Based on 1,295 questionnaires in Henan Province, this study constructs a counterfactual analysis framework and used endogenous switching probit model to analyze the effects and pathways of internet usage on farmer’s adoption behavior of chemical fertilizer reduction and efficiency enhancement technologies. The results indicate that. (1) The proportion of farmers adopting chemical fertilizer reduction and efficiency enhancement technologies is 60.15%, while the proportion of farmers not adopting these technologies is 39.85%. (2) Internet usage directly influences farmers’ adoption of fertilizer reduction and efficiency enhancement technologies. According to counterfactual assumption analysis, if farmers who currently use the Internet were to stop using it, the probability of them adopting these technologies would decrease by 28.09%. Conversely, for farmers who do not currently use the Internet, if they were to start using it, the probability of them adopting fertilizer reduction and efficiency enhancement technologies would increase by 40.67%. (3) Internet usage indirectly influences farmers’ adoption behavior through mediating pathways of expected benefits and risk perception. In addition, social networks negatively moderate the impact of internet usage on farmers’ behavior of chemical fertilizer reduction and efficiency enhancement technologies.
Facebook
TwitterThis data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
aAverage across raters by sex (ratings for 91 different Tsimane' subjects).bAll explanatory variables were taken as the average for each subject across the 2002–2006 panel.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was derived by the Bioregional Assessment Programme. This dataset was derived from BILO Gridded Climate Data data provided by the CSIRO. You can find a link to the parent datasets in the Lineage Field in this metadata statement. The History Field in this metadata statement describes how this dataset was derived.
Various climate variable summary for all 15 subregions. Including:
Time series mean annual Bureau of Meteorology Australian Water Availability Project (BAWAP) rainfall from 1900 - 2012.
Long term average BAWAP rainfall and Penman Potential Evapotranspiration (PET) from Jan 1981 - Dec 2012 for each month
Values calculated over the years 1981 - 2012 (inclusive), for 17 time periods (i.e., annual, 4 seasons and 12 months) for the following 8 meteorological variables: (i) BAWAP_P; (ii) Penman ETp; (iii) Tavg (average temperature); (iv) Tmax (maximum temperature); (v) Tmin (minimum temperature); (vi) VPD (Vapour Pressure Deficit); (vii) Rn (net Radiation); and (viii) Wind speed. For each of the 17 time periods for each of the 8 meteorological variables have calculated the: (a) average; (b) maximum; (c) minimum; (d) average plus standard deviation (stddev); (e) average minus stddev; (f) stddev; and (g) trend.
Correlation coefficients (-1 to 1) between rainfall and 4 remote rainfall drivers between 1957-2006 for the four seasons. The data and methodology are described in Risbey et al. (2009). All data used in this analysis came directly from James Risbey, CSIRO Marine and Atmospheric Research (CMAR), Hobart. As described in the Risbey et al. (2009) paper, the rainfall was from 0.05 degree gridded data described in Jeffrey et al. (2001 - known as the SILO datasets); sea surface temperature was from the Hadley Centre Sea Ice and Sea Surface Temperature dataset (HadISST) on a 1 degree grid. BLK=Blocking; DMI=Dipole Mode Index; SAM=Southern Annular Mode; SOI=Southern Oscillation Index; DJF=December, January, February; MAM=March, April, May; JJA=June, July, August; SON=September, October, November. The analysis is a summary of Fig. 15 of Risbey et al. (2009).
Dataset was generated using various source data:
annual BAWAPrainfall
Monthly BAWAP rainfall
Monthly Penman PET
Monthly BAWAP rainfall
Monthly Penman PET
Monthly BAWAP Tair
Monthly BAWAP Tmax
Monthly BAWAP Tmin
Monthly VPD
Actual vapour measured at 9:00am, the saturated vapour is calculated from Tmax and Tmin.
Monthly Rn
Monthly Wind
This dataset is created by CLW Ecohydrological Time Series Remote Sensing Team. See http://www-data.iwis.csiro.au/ts/climate/wind/mcvicar_etal_grl2008/.
Bioregional Assessment Programme (2013) Mean climate variables for all subregions. Bioregional Assessment Derived Dataset. Viewed 12 March 2019, http://data.bioregionalassessments.gov.au/dataset/3f568840-0c77-4f74-bbf3-6f82d189a1fc.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Descriptive statistics of the variables in the database.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterTecplot (ascii) and matlab files are posted here for the Static pressure coefficient data sets. To download all of the data in either tecplot format or matlab format, you can go to https://c3.nasa.gov/dashlink/resources/485/ Please consult the documentation found on this page under Support/Documentation for information regarding variable definition, data processing, etc.
Facebook
TwitterVital Statistics tables that contain aggregate metrics on the natality (births) of Philadelphia residents. Included in these datasets are natality metrics by planning district or citywide. You can find mortality (deaths) metrics, and social determinants of health metrics at the city and planning district levels of geography as well. Population metrics are provided at the city, planning district, and census tract levels of geography. Please refer to the metadata links below for variable definitions and this technical notes document to access detailed technical notes about the datasets.
Facebook
TwitterSocial determinants of health metrics at the city and planning district levels of geography. Please refer to the metadata links below for variable definitions and this technical notes document to access detailed technical notes and variable definitions. You can find related vital statistics tables that contain aggregate metrics on vital events, including natality (births) metrics and mortality (deaths) by planning district or citywide. Population metrics are provided at the city, planning district, and census tract levels of geography.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
This dataset contains zonal-mean atmospheric diagnostics computed from reanalysis datasets on pressure levels. Primary variables include temperature, geopotential height, and the three-dimensional wind field. Advanced diagnostics include zonal covariance terms that can be used to compute, for instance, eddy kinetic energy and eddy fluxes. Terms from the primitive zonal-mean momentum equation and the transformed Eulerian momentum equation are also provided.
This dataset was produced to facilitate the comparison of reanalysis datasets for the collaborators of the SPARC- Reanalysis Intercomparison Project (S-RIP) project. The dataset is substantially smaller in size compared to the full three dimensional reanalysis fields and uses unified numerical methods. The dataset includes all global reanalyses available at the time of its development and will be extended to new reanalysis products in the future.
Facebook
TwitterMost of the previous studies of environmental innovation focus on the impact of environmental innovation on carbon emissions. This study rarely examines the internal causes and mechanisms of influence of low-carbon innovation. This study focuses on the effect of carbon emissions on low-carbon innovation in firms. Using a panel data set of Chinese A-share firms, this study finds that the increase in carbon emissions promotes low-carbon innovation. This promoting effect comes from high carbon emissions increasing the pressure to reduce carbon emissions in firms and prompting firms to increase R&D investment, and the effect is more pronounced in firms with lower equity concentration or high-tech firms. It is also found that indirect carbon emissions do not promote low-carbon innovation, while other types of carbon emissions do. This study expands the research on the internal causes of low-carbon innovation in firms, examines the logic influencing low-carbon innovation in firms from the perspective of emission reduction motives and methods, reveals that global warming contains opportunities for the development of low-carbon innovation in firms, and provides a reference for optimizing the carbon emissions calculation system.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
atransformed to power 0.3.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Description
This dataset contains a simulated collection of 1,00000 patient records designed to explore hypertension management in resource-constrained settings. It provides comprehensive data for analyzing blood pressure control rates, associated risk factors, and complications. The dataset is ideal for predictive modelling, risk analysis, and treatment optimization, offering insights into demographic, clinical, and treatment-related variables.
Dataset Structure
Dataset Volume
• Size: 10,000 records. • Features: 19 variables, categorized into Sociodemographic, Clinical, Complications, and Treatment/Control groups.
Variables and Categories
A. Sociodemographic Variables
1. Age:
• Continuous variable in years.
• Range: 18–80 years.
• Mean ± SD: 49.37 ± 12.81.
2. Sex:
• Categorical variable.
• Values: Male, Female.
3. Education:
• Categorical variable.
• Values: No Education, Primary, Secondary, Higher Secondary, Graduate, Post-Graduate, Madrasa.
4. Occupation:
• Categorical variable.
• Values: Service, Business, Agriculture, Retired, Unemployed, Housewife.
5. Monthly Income:
• Categorical variable in Bangladeshi Taka.
• Values: <5000, 5001–10000, 10001–15000, >15000.
6. Residence:
• Categorical variable.
• Values: Urban, Sub-urban, Rural.
B. Clinical Variables
7. Systolic BP:
• Continuous variable in mmHg.
• Range: 100–200 mmHg.
• Mean ± SD: 140 ± 15 mmHg.
8. Diastolic BP:
• Continuous variable in mmHg.
• Range: 60–120 mmHg.
• Mean ± SD: 90 ± 10 mmHg.
9. Elevated Creatinine:
• Binary variable (\geq 1.4 \, \text{mg/dL}).
• Values: Yes, No.
10. Diabetes Mellitus:
• Binary variable.
• Values: Yes, No.
11. Family History of CVD:
• Binary variable.
• Values: Yes, No.
12. Elevated Cholesterol:
• Binary variable (\geq 200 \, \text{mg/dL}).
• Values: Yes, No.
13. Smoking:
• Binary variable.
• Values: Yes, No.
C. Complications
14. LVH (Left Ventricular Hypertrophy):
• Binary variable (ECG diagnosis).
• Values: Yes, No.
15. IHD (Ischemic Heart Disease):
• Binary variable.
• Values: Yes, No.
16. CVD (Cerebrovascular Disease):
• Binary variable.
• Values: Yes, No.
17. Retinopathy:
• Binary variable.
• Values: Yes, No.
D. Treatment and Control
18. Treatment:
• Categorical variable indicating therapy type.
• Values: Single Drug, Combination Drugs.
19. Control Status:
• Binary variable.
• Values: Controlled, Uncontrolled.
Dataset Applications
1. Predictive Modeling:
• Develop models to predict blood pressure control status using demographic and clinical data.
2. Risk Analysis:
• Identify significant factors influencing hypertension control and complications.
3. Severity Scoring:
• Quantify hypertension severity for patient risk stratification.
4. Complications Prediction:
• Forecast complications like IHD, LVH, and CVD for early intervention.
5. Treatment Guidance:
• Analyze therapy efficacy to recommend optimal treatment strategies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Complete dataset of “Film Circulation on the International Film Festival Network and the Impact on Global Film Culture”
A peer-reviewed data paper for this dataset is in review to be published in NECSUS_European Journal of Media Studies - an open access journal aiming at enhancing data transparency and reusability, and will be available from https://necsus-ejms.org/ and https://mediarep.org
Please cite this when using the dataset.
Detailed description of the dataset:
1 Film Dataset: Festival Programs
The Film Dataset consists a data scheme image file, a codebook and two dataset tables in csv format.
The codebook (csv file “1_codebook_film-dataset_festival-program”) offers a detailed description of all variables within the Film Dataset. Along with the definition of variables it lists explanations for the units of measurement, data sources, coding and information on missing data.
The csv file “1_film-dataset_festival-program_long” comprises a dataset of all films and the festivals, festival sections, and the year of the festival edition that they were sampled from. The dataset is structured in the long format, i.e. the same film can appear in several rows when it appeared in more than one sample festival. However, films are identifiable via their unique ID.
The csv file “1_film-dataset_festival-program_wide” consists of the dataset listing only unique films (n=9,348). The dataset is in the wide format, i.e. each row corresponds to a unique film, identifiable via its unique ID. For easy analysis, and since the overlap is only six percent, in this dataset the variable sample festival (fest) corresponds to the first sample festival where the film appeared. For instance, if a film was first shown at Berlinale (in February) and then at Frameline (in June of the same year), the sample festival will list “Berlinale”. This file includes information on unique and IMDb IDs, the film title, production year, length, categorization in length, production countries, regional attribution, director names, genre attribution, the festival, festival section and festival edition the film was sampled from, and information whether there is festival run information available through the IMDb data.
2 Survey Dataset
The Survey Dataset consists of a data scheme image file, a codebook and two dataset tables in csv format.
The codebook “2_codebook_survey-dataset” includes coding information for both survey datasets. It lists the definition of the variables or survey questions (corresponding to Samoilova/Loist 2019), units of measurement, data source, variable type, range and coding, and information on missing data.
The csv file “2_survey-dataset_long-festivals_shared-consent” consists of a subset (n=161) of the original survey dataset (n=454), where respondents provided festival run data for films (n=206) and gave consent to share their data for research purposes. This dataset consists of the festival data in a long format, so that each row corresponds to the festival appearance of a film.
The csv file “2_survey-dataset_wide-no-festivals_shared-consent” consists of a subset (n=372) of the original dataset (n=454) of survey responses corresponding to sample films. It includes data only for those films for which respondents provided consent to share their data for research purposes. This dataset is shown in wide format of the survey data, i.e. information for each response corresponding to a film is listed in one row. This includes data on film IDs, film title, survey questions regarding completeness and availability of provided information, information on number of festival screenings, screening fees, budgets, marketing costs, market screenings, and distribution. As the file name suggests, no data on festival screenings is included in the wide format dataset.
3 IMDb & Scripts
The IMDb dataset consists of a data scheme image file, one codebook and eight datasets, all in csv format. It also includes the R scripts that we used for scraping and matching.
The codebook “3_codebook_imdb-dataset” includes information for all IMDb datasets. This includes ID information and their data source, coding and value ranges, and information on missing data.
The csv file “3_imdb-dataset_aka-titles_long” contains film title data in different languages scraped from IMDb in a long format, i.e. each row corresponds to a title in a given language.
The csv file “3_imdb-dataset_awards_long” contains film award data in a long format, i.e. each row corresponds to an award of a given film.
The csv file “3_imdb-dataset_companies_long” contains data on production and distribution companies of films. The dataset is in a long format, so that each row corresponds to a particular company of a particular film.
The csv file “3_imdb-dataset_crew_long” contains data on names and roles of crew members in a long format, i.e. each row corresponds to each crew member. The file also contains binary gender assigned to directors based on their first names using the GenderizeR application.
The csv file “3_imdb-dataset_festival-runs_long” contains festival run data scraped from IMDb in a long format, i.e. each row corresponds to the festival appearance of a given film. The dataset does not include each film screening, but the first screening of a film at a festival within a given year. The data includes festival runs up to 2019.
The csv file “3_imdb-dataset_general-info_wide” contains general information about films such as genre as defined by IMDb, languages in which a film was shown, ratings, and budget. The dataset is in wide format, so that each row corresponds to a unique film.
The csv file “3_imdb-dataset_release-info_long” contains data about non-festival release (e.g., theatrical, digital, tv, dvd/blueray). The dataset is in a long format, so that each row corresponds to a particular release of a particular film.
The csv file “3_imdb-dataset_websites_long” contains data on available websites (official websites, miscellaneous, photos, video clips). The dataset is in a long format, so that each row corresponds to a website of a particular film.
The dataset includes 8 text files containing the script for webscraping. They were written using the R-3.6.3 version for Windows.
The R script “r_1_unite_data” demonstrates the structure of the dataset, that we use in the following steps to identify, scrape, and match the film data.
The R script “r_2_scrape_matches” reads in the dataset with the film characteristics described in the “r_1_unite_data” and uses various R packages to create a search URL for each film from the core dataset on the IMDb website. The script attempts to match each film from the core dataset to IMDb records by first conducting an advanced search based on the movie title and year, and then potentially using an alternative title and a basic search if no matches are found in the advanced search. The script scrapes the title, release year, directors, running time, genre, and IMDb film URL from the first page of the suggested records from the IMDb website. The script then defines a loop that matches (including matching scores) each film in the core dataset with suggested films on the IMDb search page. Matching was done using data on directors, production year (+/- one year), and title, a fuzzy matching approach with two methods: “cosine” and “osa.” where the cosine similarity is used to match titles with a high degree of similarity, and the OSA algorithm is used to match titles that may have typos or minor variations.
The script “r_3_matching” creates a dataset with the matches for a manual check. Each pair of films (original film from the core dataset and the suggested match from the IMDb website was categorized in the following five categories: a) 100% match: perfect match on title, year, and director; b) likely good match; c) maybe match; d) unlikely match; and e) no match). The script also checks for possible doubles in the dataset and identifies them for a manual check.
The script “r_4_scraping_functions” creates a function for scraping the data from the identified matches (based on the scripts described above and manually checked). These functions are used for scraping the data in the next script.
The script “r_5a_extracting_info_sample” uses the function defined in the “r_4_scraping_functions”, in order to scrape the IMDb data for the identified matches. This script does that for the first 100 films, to check, if everything works. Scraping for the entire dataset took a few hours. Therefore, a test with a subsample of 100 films is advisable.
The script “r_5b_extracting_info_all” extracts the data for the entire dataset of the identified matches.
The script “r_5c_extracting_info_skipped” checks the films with missing data (where data was not scraped) and tried to extract data one more time to make sure that the errors were not caused by disruptions in the internet connection or other technical issues.
The script “r_check_logs” is used for troubleshooting and tracking the progress of all of the R scripts used. It gives information on the amount of missing values and errors.
4 Festival Library Dataset
The Festival Library Dataset consists of a data scheme image file, one codebook and one dataset, all in csv format.
The codebook (csv file “4_codebook_festival-library_dataset”) offers a detailed description of all variables within the Library Dataset. It lists the definition of variables, such as location and festival name, and festival categories,
Facebook
TwitterTo facilitate the use of data collected through the high-frequency phone surveys on COVID-19, the Living Standards Measurement Study (LSMS) team has created the harmonized datafiles using two household surveys: 1) the country’ latest face-to-face survey which has become the sample frame for the phone survey, and 2) the country’s high-frequency phone survey on COVID-19.
The LSMS team has extracted and harmonized variables from these surveys, based on the harmonized definitions and ensuring the same variable names. These variables include demography as well as housing, household consumption expenditure, food security, and agriculture. Inevitably, many of the original variables are collected using questions that are asked differently. The harmonized datafiles include the best available variables with harmonized definitions.
Two harmonized datafiles are prepared for each survey. The two datafiles are:
1. HH: This datafile contains household-level variables. The information include basic household characterizes, housing, water and sanitation, asset ownership, consumption expenditure, consumption quintile, food security, livestock ownership. It also contains information on agricultural activities such as crop cultivation, use of organic and inorganic fertilizer, hired labor, use of tractor and crop sales.
2. IND: This datafile contains individual-level variables. It includes basic characteristics of individuals such as age, sex, marital status, disability status, literacy, education and work.
National coverage
The survey covered all de jure households excluding prisons, hospitals, military barracks, and school dormitories.
Sample survey data [ssd]
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Computer Assisted Personal Interview [capi]
Nigeria General Household Survey, Panel (GHS-Panel) 2018-2019 and Nigeria COVID-19 National Longitudinal Phone Survey (COVID-19 NLPS) 2020 data were harmonized following the harmonization guidelines (see “Harmonized Datafiles and Variables for High-Frequency Phone Surveys on COVID-19” for more details).
The high-frequency phone survey on COVID-19 has multiple rounds of data collection. When variables are extracted from multiple rounds of the survey, the originating round of the survey is noted with “_rX” in the variable name, where X represents the number of the round. For example, a variable with “_r3” presents that the variable was extracted from Round 3 of the high-frequency phone survey. Round 0 refers to the country’s latest face-to-face survey which has become the sample frame for the high-frequency phone surveys on COVID-19. When the variables are without “_rX”, they were extracted from Round 0.
See “Nigeria - General Household Survey, Panel 2018-2019, Wave 4” and “Nigeria - COVID-19 National Longitudinal Phone Survey 2020” available in the Microdata Library for details.
Facebook
TwitterChina is one of the countries hardest hit by disasters. Disaster shocks not only cause a large number of casualties and property damage but also have an impact on the risk preference of those who experience it. Current research has not reached a consensus conclusion on the impact of risk preferences. This paper empirically analyzes the effects of natural and man-made disasters on residents’ risk preference based on the data of the China Household Financial Survey (CHFS) in 2019. The results indicate that: (1) Both natural and man-made disasters can significantly lead to an increase in the risk aversion of residents, and man-made disasters have a greater impact. (2) Education background plays a negative moderating role in the impact of man-made disasters on residents’ risk preference. (3) Natural disaster experiences have a greater impact on the risk preference of rural residents, while man-made disaster experiences have a greater impact on the risk preference of urban residents. Natural disaster experiences make rural residents more risk-averse, while man-made disaster experiences make urban residents more risk-averse. The results provide new evidence and perspective on the negative impact of disaster shocks on the social life of residents.