Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises a comprehensive set of files designed for the analysis and 2D correlation of spectral data, specifically focusing on ATR and NIR spectra. It includes MATLAB scripts and supporting functions necessary to replicate the analysis, as well as the raw datasets used in the study. Below is a detailed description of the included files:
Data Analysis:
Data_Analysis.mlx
2D Correlation Data Analysis:
Data_Analysis_2Dcorr.mlx
Functions:
Functions
Datasets:
ATR_dataset.xlsx
, NIR_dataset.xlsx
, Reference_data.csv
Data_Analysis.mlx
and Data_Analysis_2Dcorr.mlx
scripts in MATLAB, ensuring that the Functions
folder is in the MATLAB path.Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Firm-level data from 2009 to 2018 of 34 large gold mines in Developing countries. The data is used to compute the deterministic, dynamic environmental and technical efficiencies of large gold mines in developing countries. Steps to reproduce1. Run the R command to generate dynamic technical and dynamic inefficiencies per every two subsequent period (i.e period t and t+1)2. combine the results files of inefficiencies per period generated in R into a panel (see the Excel files in the results folder)3. Import the excel folder into Stata and generate the final results indicated in the paper.
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:
1- Data Import and Transformation:
2- Data Quality Assessment:
3- Calculating COGS:
4- Discount Analysis:
5- Sales Metrics:
6- Visualization:
7- Report Generation:
Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was generated and used in the publication "Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters." The dataset is organized into three stages: raw data, preprocessed data, and processed data.
Each workload execution includes log files, application code, executables, and launching scripts. Execution times are extracted from log files, specifically from "slurm-dmr_*.out"
and "slurm-dmr_*.info"
, where "*"
represents a number corresponding to a specific job execution.
Contains the output files from executing workloads on the MarenostrumV HPC cluster. This section is divided into three subsections, each corresponding to a different workload type:
This section contains the collected raw data in .pkl
files, following the same structure as the raw_data
folder. For each workload execution, four .pkl
files are generated. The variable name
can take values from [baseline, merge, static]
, while X
represents a workload execution number:
X = J
, the file contains a compilation of all workloads with the same configuration.A
appears before X
, it refers to an asynchronous execution.The four types of .pkl
files are:
nameX_data_description.txt
.nameX_data_resize_description.txt
.A
) workloads. A description is available in nameAX_iter_data_description.txt
.nameX_workload_description.txt
.Includes the analyzed results from the preprocessed_data
folder. This section contains .xlsx
files and images used in the Experimental Setup section of the paper. The Excel files are categorized as follows:
This folder contains the scripts used to convert raw_data
into preprocessed_data
, along with a Jupyter Notebook used for data analysis and visualization. To understand or use these codes, please contact the dataset creators.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The shared archived combined in Supplementary Datasets represent the actual databases used in the investigation considered in two papers:
Meteorological conditions affecting black vulture (Coragyps atratus) soaring behavior in the southeast of Brazil: Implications for bird strike abatement (in submission)
Remote sensing applications for abating the aircraft-bird strike risks in the southeast of Brazil (Human-Wildlife Interactions Journal, in print)
The papers were based on my Master’s thesis defended in 2016 in the Institute of Biology of the University of Campinas (UNICAMP) in partial fulfilment of the requirements for the degree of Master in Ecology. Our investigation was devoted to reducing the risk of aircraft collision with Black vultures. It had two parts considered in these two papers. In the first one we studied the relationship between soaring activity of Black vultures and meteorological characteristics. In the second one we explored the dependence of soaring activity of vultures on superficial and anthropogenic characteristics. The study was implemented within surroundings of two airports in the southeast of Brazil taken as case studies. We developed the methodological approaches combining application of GIS and remote sensing technologies for data processing, which were used as the main research instrument. By dint of them we joined in the georeferenced databases (shapefiles) the data of bird's observation and three types of environmental factors: (i) meteorological characteristics collected together with the bird’s observation, (ii) superficial parameters (relief and surface temperature) obtained from the products of ASTER imagery; (iii) parameters of surface covering and anthropogenic pressure obtained from the satellite images of high resolution. Based on the analyses of the georeferenced databases, the relationship between soaring activity of vultures and environmental factors was studied; the behavioral patterns of vultures in soaring flight were revealed; the landscape types highly attractive for this species and forming the increased concentration of birds over them were detected; the maps giving a numerical estimation of hazard of bird strike events over the airport vicinities were constructed; the practical recommendations devoted to decrease the risk of collisions with vultures and other bird species were formulated.
This archive contains all materials elaborated and used for the study, including the GIS database for two papers, remote sensing data, and Microsoft Excel datasets. You can find the description of supplementary files in the Description of Supplementary Dataset.docx. The links on supplementary files and their attribution to the text of papers are considered in the Attribution to the text of papers.docx. The supplementary files are in the folders Datasets, GIS_others, GIS_Raster, GIS_Shape.
For any question please write me on this email: natalieenov@gmail.com
Natalia Novoselova
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The climate-disease relationship is complex, where multiple driver-pressure factors and interactions combine with climate change in determining the occurrence geographies and timings of various infectious diseases. However, studies often focus on just some selected factor(s) and interaction(s). Such focus choices may limit and bias our understanding and predictive capability of disease sensitivity to climate and other driver-pressure changes. To assess these research choices and identify possible remaining key gaps and biases, a scoping review is applied to climate-disease related publications for Lyme disease and cryptosporidiosis.
This dataset includes three excel files for Lyme disease (“literature_search_LD_raw.xls”), cryptosporidiosis (“literature_search_CY_raw.xls”), and categories of driver-pressure factors ("Table_1.xlsx") respectively, covering all the relevant information for further analysis. The first two excel files contain three sheets named “Search”, “Filter”, and “Quantitative”. Contents of each sheet is explained as follows.
1. “Search”
Sheet “Search” listed all the publications found from the literature searches, including the information of publication year, title, DOI, and Authors. The searches were performed in Web of Science™ (WoS) and considered publications from 1 January 2000 to 10 February 2022. Search terms were (('borreliosis' OR 'Lyme disease') AND ('climate' OR 'climate change' OR 'climate variability')) for Lyme disease, ((‘cryptosporidiosis’ OR ‘cryptosporidium’ OR ‘crypto.’) AND (‘climate’ OR ‘climate change’ OR ‘climate variability’)) for cryptosporidiosis. The search yielded 555 publication results for Lyme disease and 185 for cryptosporidiosis.
2. “Filter”
Sheet “Filter” listed the inclusions and exclusions for searched publications, and categories each included publication belonging to. Excluded articles (marked as blank in column “Included”) are ones that: (i) do not consider both climate and disease; (ii) are for Lyme disease, but not about the Ixodes transmission of the Borrelia pathogen; (iii) are not written in English; and (iv) are not full-text open access. Included articles (marked as “x” in column “Included”) are further classified into following categories based on their focus: for Lyme disease, categories of publications include reviews (mentioning), reviews (specifically discussing), public awareness, mitigation, survey (implications), investigations, projections, and others; for cryptosporidiosis, categories include reviews (mentioning), reviews (specifically discussing), survey (implications), investigations, projections, and others. Articles belonging to any categories are marked as “x”.
3. “Quantitative”
Sheet “Quantitative” lists further information extracted from quantitative studies which are articles under category of investigations and projections in sheet “filter”. Further extracted information is from the methods section of each study or from the full text if necessary. The information includes transmission components, study region (if applicable), category of investigations or projections, included driver-pressure factors, and methods. Transmission components are reproduction host, transmission host, vector, and human for Lyme disease; animal reservoir, environmental reservoir, and human for cryptosporidiosis. Study region considers specific countries as the smallest scale for spatial resolution, so that smaller than whole-country study sites were counted as studies of the associated countries. Included driver-pressure factors are categorized as shown in Table 1 in the file of "Table_1.xlsx", including main categories and their covered variables. Methods include laboratory/field experimentation/observation, statistical analysis, mechanistic modelling, and synthesis/meta-analysis.
Enable GingerCannot connect to Ginger Check your internet connection
or reload the browserDisable in this text fieldRephraseRephrase current sentenceEdit in Ginger×
This reference contains tabular datasets resulting from the eDNA pilot study on National Wildlife Refuges. ZIP file contains all datasets as received from the authors: a folder for each participating refuge containing two Excel workbooks, one for the MiFish marker results and one for the COI marker results. Each workbook has several sheets including one for the raw compiled data, one for each site, and filtered combined data. CSV of filtered data for all participating refuges combined. This dataset was compiled by extracting the filtered datasheet for each refuge from the excel workbook and combining them into a CSV using an r script. CSV of the total OTU, OTU species, unique families, and number of fish, mammal, amphibian, mollusk, and bird species for each participating refuge. This csv was compiled by Rachel Maxey (I&M Data Manager) by extracting the data from the refuge workbooks and combining manually into a CSV. CSV of the full Site data download from Survey 123. Data dictionaries and metadata for site information and eDNA results tables.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv, .tsv, and .xlsx.
Recall that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:
data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.
listing_id: unique identifier of listing price: nightly listing price in USD nbhood_full: name of borough and neighborhood where listing is located data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.
listing_id: unique identifier of listing description: listing description room_type: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.
listing_id: unique identifier of listing host_name: name of listing host last_review: date when the listing was last reviewed
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The SMARTDEST DATASET WP3 v1.0 includes data at sub-city level for 7 cities: Amsterdam, Barcelona, Edinburgh, Lisbon, Ljubljana, Turin, and Venice. It is made up of information extracted from public sources at the local level (mostly, city council open data portals) or volunteered geographic information, that is, geospatial content generated by non-professionals using mapping systems available on the Internet (e.g., Geofabrik). Details on data sources and variables are included in a ‘metadata’ spreadsheet in the excel file. The same excel file contains 5 additional spreadsheets. The first one, labelled #1, was used to perform the analysis on the determinants of the geographical spread of tourism supply in SMARTDEST case study’s cities (in the main document D3.3, section 4.1), The second one (labelled #2) offers information that would allow to replicate the analysis on tourism-led population decline reported in section 4.3. As for spreadsheets named #3-AMS, #4-BCN, and #5-EDI, they refer to data sources and variables used to run follow-up analyses discussed in section 5.1, with the objective of digging into the causes of depopulation in Amsterdam, Barcelona, and Edinburgh, respectively. The column ‘row’ can be used to merge the excel file with the shapefile ‘db_task3.3_SmartDest’. Data are available at the buurt level in Amsterdam (an administrative unit roughly corresponding to a neighbourhood), census tract level in Barcelona and Ljubljana, for data zones in Edinburgh, statistical zones in Turin, and località in Venice.
State estimates for these years are no longer available due to methodological concerns with combining 2019 and 2020 data. We apologize for any inconvenience or confusion this may causeBecause of the COVID-19 pandemic, most respondents answered the survey via the web in Quarter 4 of 2020, even though all responses in Quarter 1 were from in-person interviews. It is known that people may respond to the survey differently while taking it online, thus introducing what is called a mode effect.When the state estimates were released, it was assumed that the mode effect was similar for different groups of people. However, later analyses have shown that this assumption should not be made. Because of these analyses, along with concerns about the rapid societal changes in 2020, it was determined that averages across the two years could be misleading.For more detail on this decision, see the 2019-2020state data page.
The documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.
The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.
As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.
National
The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.
As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).
Sample survey data [ssd]
The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.
Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.
For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.
For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).
Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).
For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.
For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.
For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.
Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).
Computer Assisted Personal Interview [capi]
Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.
Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.
Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.
For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.
For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.
For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.
Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The content of the dataset is twofold: I. It contains the results of two case studies made by the ENCKEP Simulation tool, developed within the COST ActionENCKEP COST Action: European Network for Collaboration on Kidney ExchangeProgrammes (ENCKEP) (https://www.enckep-cost.eu/) II. It provides the source code of the Evaluation module, developed in R for evaluation of the results based on various metrics. The details of case studies creation, the formats of the files, the details of usage of the Evaluation module are provided in the Handbook of Working Group 3 and 4 of the ENCKEP COST Action: European Network for Collaboration on Kidney ExchangeProgrammes (ENCKEP) ""International Kidney Exchange Programmes in Europe: Practice, Solution Models,Simulation and Evaluation tools"", available at https://www.enckep-cost.eu/ Case studies. Each case study is organised in a zipped folder: Case study #1 : Case_study_French.zip The folder contains the results of the computational experiment on data which was created by modifying the real data in French Kidney Exchange Programme (KEP). Folder contains the following files: input_arcs_FR.csv - file content compatibility graph input_config_FR.json - file with distribution parameters for generation of missed data input_failed_arcs_FR.csv - contain the list of arcs that fail after matching input_failed_pairs_FR.csv - contain the list of pairs that fail after matching input_hla_FR.csv - HLA data of donors and patients input_objective_FR.json - file with the list of criteria for optimisation and the parameters of their usage including the approach of multi-objective optimisation (lexicographic or/and weighted optimisation) Input_paris.csv - characteristics of pairs Input_policy.csv - policy file with the setting for simulations output_file_FR.xlsx - the excel file that merge all the output files of simulator, each file in separator sheet Case study #2: Case_study_international.zip The folder contains the results of computational experiments for several countries where only distribution parameters were provided for each country. List of files (see descriptions of contents of each file above): Input_arcs.csv input_config_FR.json - distribution parameters for generation of data for country FR input_config_NL.json - distribution parameters for generation of data for country NL input_config_UK.json - distribution parameters for generation of data for country UK Input_failed_arcs.csv Input_faield_paris.csv Input_objective.json Input_pairs.csv The three subfolders contain the results for three different policies run, each subfolder has: input_NameOfPolicy_policy.json policy file for the corresponding policy Ouput_files_NameOfPolicy.xlsx output files Where “no collaboration” folder contains the results for the policy where there is no collaboration between the countries (NameOfPolicy = individual ); “Consecutive_collaboration” is the collaboration when countries first run internal matching runs, and only the remaining pairs are participating in the international pool (NameOfPolicy = consecutive) “Borderless_collaboration” content the results when countries completely merge their pools (NameOfPolicy = borderless) II. R Evaluation module ENCKEP_evaluator.R is the code of Evaluation module in R. The details of usage of the module are provided in the above mentioned Handbook. ENCKEP_evaluator_report.pdf is the example of the report generated by the module for Case study #1.
This dataset comprises of 10 Excel files of fishing data and relevant locations. File 1 was used to produce figure 1 using QGIS. We used R scripts to select and combine data from files 2 and 3 to produce figures 2, 3, 4, 5, and 6. We crossed file 4 with files 5, 6, 7, 8, 9, and 10 separately to add the georeferences of the fishing spots to the catch data, and then we used QGIS to plot the data and produce figures S1 A (file 5), S1 B (file 6), S2 A (file 7), S2 B (file 8), S2 C (file 9), and S2 D (file 10).
There are two datasets in this data repository: the first one, named “RH.RF.720.360.1980.2016.Yearly.nc”, is a global heterotrophic respiration (RH) product with a spatial resolution of with 0.5 degree and a time resolution of one year. The global RH product is modelled by Random Forest algorithm with field observations and environmental variables. The environmental variables include temperature, precipitation, diurnal temperature range from CRU TS v.4.01 from 1901 to 2016; shortwave radiation; soil organic carbon content from soil grid data (Hengl et al., 2017); soil nitrogen content from ORNL DAAC; nitrogen deposition data from the Earth System Models of GISS-E2-R, CCSM-CAM3.5 and GFDL-AM3 from the 1850s to 2000s; Palmer Drought Severity Index (PDSI); and soil water content.The RH product is provided in network Common Data Form, version 4 (netCDF-4, short name: nc) data format (https://www.unidata.ucar.edu/software/netcdf/). The RH product is named using the following regulation:"RH.modelling approaches.spatial resolution.start YYYY. end YYYY.temporal resolution.nc"“RH.RF.720.360.1980.2016.Yearly.nc” means modelled RH flux (g C m-2 yr-1) by Random Forest (RF) with a 0.5° spatial resolution (size 720 along longitude and 360 along latitude) from start year 1980 to end year 2016 with a yearly temporal resolution.The second file, named “dataset.xlsx”, is the field observation from peer review publications combining Global Soil Respiration Database (SRDB), (version 3, Bond-Lamberty and Thomson, 2014), which is publicly available at https://github.com/bpbond/srdb. Besides, the database was further updated using observations collected from the China Knowledge Resource Integrated Database (www.cnki.net) up to March 2018 according to the criteria of SRDB. This dataset is provided in Microsoft Excel in format of “.xlsx”.R codes to produce main results and land area (named land.area.nc, km^2) are available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scott M. Brown (University of Puerto Rico)
Email: scott.brown@upr.edu
Data DOI: 10.5281/zenodo.15050209
This project empirically tests how language regimes embedded in legal and administrative systems create institutional traps that constrain multinational enterprise (MNE) operations and economic integration.
The study combines national and subnational data across four key datasets to measure how symbolic misalignment (such as monolingualism in non-commercial languages) affects regulatory quality, business formation, and workforce access.
You must upload the following four files into your Google Colab session before running the code:
Uploaded File | Description |
---|---|
/content/2020_Rankings.xlsx | World Bank Ease of Doing Business (EODB) — Global regulatory efficiency indicators (2020 Edition) |
/content/DBNA 2022 Rank and Scores.xlsx | Doing Business North America (DBNA 2022) — City-level institutional performance across 83 U.S. cities |
/content/Spanish_Speakers_All_States.xlsx | U.S. Census American Community Survey (ACS) — State-level Spanish-speaking and English proficiency data |
/content/wgidataset.xlsx | World Governance Indicators (WGI) — Governance quality measures (Regulatory Quality, Government Effectiveness, etc.) |
Open Google Colab.
Upload the four Excel files listed above.
Copy and paste the Python code provided below into a Colab notebook cell.
Run the code to automatically load the datasets, clean the data, and estimate key regression models.
Symbolic Institutional Traps: Language regimes act as hidden barriers, complicating regulatory navigation and labor market integration.
Symbolic Misalignment: Misfit between administrative languages and global commercial norms raises onboarding costs for MNEs.
Institutional Friction: Language encapsulation isolates economies and reduces foreign direct investment (FDI) attractiveness.
Each dataset has been:
Cleaned for consistent formatting.
Harmonized for cross-dataset integration.
Standardized to facilitate reproducible econometric analysis.
Full codebooks and metadata are available in the appendix of the research paper.
The EF EPI (English Proficiency) dataset was not uploaded here. If available, further regressions on symbolic distance can be run.
If any columns do not match exactly (e.g., different spellings), modify the variable names slightly based on print(dbna.columns)
.
The code generates:
Regression outputs on how Spanish-speaking prevalence correlates with:
Starting a business
Ease of Doing Business
Regulatory quality
Subnational institutional performance differences (Puerto Rico vs. U.S. states).
Open Data: CC BY 4.0 License
Citation Requested:
Brown, S.M. (2025). Symbolic Institutional Traps and the Liability of Foreignness: Language Regimes as Hidden Barriers to Multinational Entry. University of Puerto Rico. DOI: 10.5281/zenodo.15050209
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABSTRACT
The Albero study analyzes the personal transitions of a cohort of high school students at the end of their studies. The data consist of (a) the longitudinal social network of the students, before (n = 69) and after (n = 57) finishing their studies; and (b) the longitudinal study of the personal networks of each of the participants in the research. The two observations of the complete social network are presented in two matrices in Excel format. For each respondent, two square matrices of 45 alters of their personal networks are provided, also in Excel format. For each respondent, both psychological sense of community and frequency of commuting is provided in a SAV file (SPSS). The database allows the combined analysis of social networks and personal networks of the same set of individuals.
INTRODUCTION
Ecological transitions are key moments in the life of an individual that occur as a result of a change of role or context. This is the case, for example, of the completion of high school studies, when young people start their university studies or try to enter the labor market. These transitions are turning points that carry a risk or an opportunity (Seidman & French, 2004). That is why they have received special attention in research and psychological practice, both from a developmental point of view and in the situational analysis of stress or in the implementation of preventive strategies.
The data we present in this article describe the ecological transition of a group of young people from Alcala de Guadaira, a town located about 16 kilometers from Seville. Specifically, in the “Albero” study we monitored the transition of a cohort of secondary school students at the end of the last pre-university academic year. It is a turning point in which most of them began a metropolitan lifestyle, with more displacements to the capital and a slight decrease in identification with the place of residence (Maya-Jariego, Holgado & Lubbers, 2018).
Normative transitions, such as the completion of studies, affect a group of individuals simultaneously, so they can be analyzed both individually and collectively. From an individual point of view, each student stops attending the institute, which is replaced by new interaction contexts. Consequently, the structure and composition of their personal networks are transformed. From a collective point of view, the network of friendships of the cohort of high school students enters into a gradual process of disintegration and fragmentation into subgroups (Maya-Jariego, Lubbers & Molina, 2019).
These two levels, individual and collective, were evaluated in the “Albero” study. One of the peculiarities of this database is that we combine the analysis of a complete social network with a survey of personal networks in the same set of individuals, with a longitudinal design before and after finishing high school. This allows combining the study of the multiple contexts in which each individual participates, assessed through the analysis of a sample of personal networks (Maya-Jariego, 2018), with the in-depth analysis of a specific context (the relationships between a promotion of students in the institute), through the analysis of the complete network of interactions. This potentially allows us to examine the covariation of the social network with the individual differences in the structure of personal networks.
PARTICIPANTS
The social network and personal networks of the students of the last two years of high school of an institute of Alcala de Guadaira (Seville) were analyzed. The longitudinal follow-up covered approximately a year and a half. The first wave was composed of 31 men (44.9%) and 38 women (55.1%) who live in Alcala de Guadaira, and who mostly expect to live in Alcala (36.2%) or in Seville (37.7%) in the future. In the second wave, information was obtained from 27 men (47.4%) and 30 women (52.6%).
DATE STRUCTURE AND ARCHIVES FORMAT
The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.
Social network
The file “Red_Social_t1.xlsx” is a valued matrix of 69 actors that gathers the relations of knowledge and friendship between the cohort of students of the last year of high school in the first observation. The file “Red_Social_t2.xlsx” is a valued matrix of 57 actors obtained 17 months after the first observation.
The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.
In order to generate each complete social network, the list of 77 students enrolled in the last year of high school was passed to the respondents, asking that in each case they indicate the type of relationship, according to the following values: 1, “his/her name sounds familiar"; 2, "I know him/her"; 3, "we talk from time to time"; 4, "we have good relationship"; and 5, "we are friends." The two resulting complete networks are represented in Figure 2. In the second observation, it is a comparatively less dense network, reflecting the gradual disintegration process that the student group has initiated.
Personal networks
Also in this case the information is organized in two observations. The compressed file “Redes_Personales_t1.csv” includes 69 folders, corresponding to personal networks. Each folder includes a valued matrix of 45 alters in CSV format. Likewise, in each case a graphic representation of the network obtained with Visone (Brandes and Wagner, 2004) is included. Relationship values range from 0 (do not know each other) to 2 (know each other very well).
Second, the compressed file “Redes_Personales_t2.csv” includes 57 folders, with the information equivalent to each respondent referred to the second observation, that is, 17 months after the first interview. The structure of the data is the same as in the first observation.
Sense of community and metropolitan displacements
The SPSS file “Albero.sav” collects the survey data, together with some information-summary of the network data related to each respondent. The 69 rows correspond to the 69 individuals interviewed, and the 118 columns to the variables related to each of them in T1 and T2, according to the following list:
• Socio-economic data.
• Data on habitual residence.
• Information on intercity journeys.
• Identity and sense of community.
• Personal network indicators.
• Social network indicators.
DATA ACCESS
Social networks and personal networks are available in CSV format. This allows its use directly with UCINET, Visone, Pajek or Gephi, among others, and they can be exported as Excel or text format files, to be used with other programs.
The visual representation of the personal networks of the respondents in both waves is available in the following album of the Graphic Gallery of Personal Networks on Flickr: .
In previous work we analyzed the effects of personal networks on the longitudinal evolution of the socio-centric network. It also includes additional details about the instruments applied. In case of using the data, please quote the following reference:
Maya-Jariego, I., Holgado, D. & Lubbers, M. J. (2018). Efectos de la estructura de las redes personales en la red sociocéntrica de una cohorte de estudiantes en transición de la enseñanza secundaria a la universidad. Universitas Psychologica, 17(1), 86-98. https://doi.org/10.11144/Javeriana.upsy17-1.eerp
The English version of this article can be downloaded from: https://tinyurl.com/yy9s2byl
CONCLUSION
The database of the “Albero” study allows us to explore the co-evolution of social networks and personal networks. In this way, we can examine the mutual dependence of individual trajectories and the structure of the relationships of the cohort of students as a whole. The complete social network corresponds to the same context of interaction: the secondary school. However, personal networks collect information from the different contexts in which the individual participates. The structural properties of personal networks may partly explain individual differences in the position of each student in the entire social network. In turn, the properties of the entire social network partly determine the structure of opportunities in which individual trajectories are displayed.
The longitudinal character and the combination of the personal networks of individuals with a common complete social network, make this database have unique characteristics. It may be of interest both for multi-level analysis and for the study of individual differences.
ACKNOWLEDGEMENTS
The fieldwork for this study was supported by the Complementary Actions of the Ministry of Education and Science (SEJ2005-25683), and was part of the project “Dynamics of actors and networks across levels: individuals, groups, organizations and social settings” (2006 -2009) of the European Science Foundation (ESF). The data was presented for the first time on June 30, 2009, at the European Research Collaborative Project Meeting on Dynamic Analysis of Networks and Behaviors, held at the Nuffield College of the University of Oxford.
REFERENCES
Brandes, U., & Wagner, D. (2004). Visone - Analysis and Visualization of Social Networks. In M. Jünger, & P. Mutzel (Eds.), Graph Drawing Software (pp. 321-340). New York: Springer-Verlag.
Maya-Jariego, I. (2018). Why name generators with a fixed number of alters may be a pragmatic option for personal network analysis. American Journal of
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the paper titled 'Issues and Their Causes in WebAssembly Applications: An Empirical Study.' The dataset is stored in a Microsoft Excel file, which comprises multiple worksheets. A brief description of each worksheet is provided below.
(1) The 'Selected Systems' worksheet contains information on the 12 chosen open-source WebAssembly applications, along with the URL for each application.
(2) The 'GitHub-Raw Data' worksheet contains information on the initially retrieved 6,667 issues, including the titles, links, and statuses of each individual issue discussion.
(3) The 'SOF-Raw Data' worksheet contains information on the initially retrieved 6,667 questions and answers, including the details of each question and answer, respective links, and associated tags.
(4) The 'GitHubData Random Selected' worksheet contains a list of issues randomly selected from the initial pool of 6,667 issues, as well as extracted data from the discussions associated with these randomly selected issues.
(5) The 'GitHub-(Issues, Causes)' worksheet contains the initial codes categorizing the types of issues and causes.
(6) The 'SOF (Issues, Causes)' worksheet contains information gleaned from a randomly selected subset of 354 Stack Overflow posts. This information includes the title and body of each question, the associated link, tags, as well as key points for types of issues and causes.
(7) The 'Combine (Git and SOF) Data' worksheet contains the compiled issues and causes extracted from both GitHub and Stack Overflow.
(8) The 'Issue Taxonomy' worksheet contains a comprehensive issue taxonomy, which is organized into 9 categories, 20 subcategories, and 120 specific types of issues.
(9) The 'Cause Taxonomy' worksheet contains a comprehensive cause taxonomy, which is organized into 10 categories, 35 subcategories, and 278 specific types of causes.
A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.
This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.
The dataset was separated into the following data files:
Applications, individuals
Administrative records [adm]
Other [oth]
The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.