24 datasets found

f
Cleaned NHANES 1988-2018
figshare.com
txt
Updated Feb 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.21743372.v9
Dataset updated
Feb 18, 2025
Dataset provided by
figshare
Authors
Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.
Data from: DATASET FOR: A multimodal spectroscopic approach combining...
zenodo.org
producciocientifica.uv.es
bin, csv, zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Perez Guaita; David Perez Guaita (2024). DATASET FOR: A multimodal spectroscopic approach combining mid-infrared and near-infrared for discriminating Gram-positive and Gram-negative bacteria [Dataset]. http://doi.org/10.5281/zenodo.10523185
Explore at:
bin, zip, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10523185
Dataset updated
Aug 2, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Perez Guaita; David Perez Guaita
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:

This dataset comprises a comprehensive set of files designed for the analysis and 2D correlation of spectral data, specifically focusing on ATR and NIR spectra. It includes MATLAB scripts and supporting functions necessary to replicate the analysis, as well as the raw datasets used in the study. Below is a detailed description of the included files:

Data Analysis:

File Name: Data_Analysis.mlx

Description: This MATLAB Live Script file contains the main script used for the classification analysis of the spectral data. It includes steps for preprocessing, analysis, and visualization of the ATR and NIR spectra.

2D Correlation Data Analysis:

File Name: Data_Analysis_2Dcorr.mlx

Description: This MATLAB Live Script file is similar to the primary analysis script but is specifically tailored for performing 2D correlation analysis on the spectral data. It includes detailed steps and code for executing the 2D correlation.

Functions:

Folder Name: Functions

Description: This folder contains all the necessary MATLAB function files required to replicate the analyses presented in the scripts. These functions handle various preprocessing steps, calculations, and visualizations.

Datasets:

File Names: ATR_dataset.xlsx, NIR_dataset.xlsx, Reference_data.csv

Description: These Excel files contain the raw spectral data for ATR and NIR analyses, as well as reference datasets. Each file includes multiple sheets with detailed measurements and metadata.

Usage Notes:

Software Requirements:

MATLAB is required to run the .mlx files and utilize the functions.

PLS_Toolbox: Necessary for certain preprocessing and analysis steps.

MIDAS 2010: Available at MIDAS 2010, required for the 2D correlation analysis.

Replication: Users can replicate the analyses by running the Data_Analysis.mlx and Data_Analysis_2Dcorr.mlx scripts in MATLAB, ensuring that the Functions folder is in the MATLAB path.

Data Handling: The datasets are provided in .xlsx format, which can be easily imported into MATLAB or other data analysis software.
F
Data from: Dynamic Technical and Environmental Efficiency Performance of...
dataverse.fgcu.edu
data.mendeley.com
zip
Updated Aug 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaiah Magambo; Isaiah Magambo (2024). Dynamic Technical and Environmental Efficiency Performance of Large Gold Mines in Developing Countries [Dataset]. http://doi.org/10.17632/pp3g267hny.1
Explore at:
zip(322671)Available download formats
Unique identifier
https://doi.org/10.17632/pp3g267hny.1
Dataset updated
Aug 2, 2024
Dataset provided by
FGCU Data Repository
Authors
Isaiah Magambo; Isaiah Magambo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Firm-level data from 2009 to 2018 of 34 large gold mines in Developing countries. The data is used to compute the deterministic, dynamic environmental and technical efficiencies of large gold mines in developing countries. Steps to reproduce1. Run the R command to generate dynamic technical and dynamic inefficiencies per every two subsequent period (i.e period t and t+1)2. combine the results files of inefficiencies per period generated in R into a panel (see the Excel files in the results folder)3. Import the excel folder into Stata and generate the final results indicated in the paper.
Superstore Sales Analysis
kaggle.com
Updated Oct 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Reda Elblgihy (2023). Superstore Sales Analysis [Dataset]. https://www.kaggle.com/datasets/aliredaelblgihy/superstore-sales-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 21, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ali Reda Elblgihy
Description
Analyzing sales data is essential for any business looking to make informed decisions and optimize its operations. In this project, we will utilize Microsoft Excel and Power Query to conduct a comprehensive analysis of Superstore sales data. Our primary objectives will be to establish meaningful connections between various data sheets, ensure data quality, and calculate critical metrics such as the Cost of Goods Sold (COGS) and discount values. Below are the key steps and elements of this analysis:

1- Data Import and Transformation:

Gather and import relevant sales data from various sources into Excel.

Utilize Power Query to clean, transform, and structure the data for analysis.

Merge and link different data sheets to create a cohesive dataset, ensuring that all data fields are connected logically.

2- Data Quality Assessment:

Perform data quality checks to identify and address issues like missing values, duplicates, outliers, and data inconsistencies.

Standardize data formats and ensure that all data is in a consistent, usable state.

3- Calculating COGS:

Determine the Cost of Goods Sold (COGS) for each product sold by considering factors like purchase price, shipping costs, and any additional expenses.

Apply appropriate formulas and calculations to determine COGS accurately.

4- Discount Analysis:

Analyze the discount values offered on products to understand their impact on sales and profitability.

Calculate the average discount percentage, identify trends, and visualize the data using charts or graphs.

5- Sales Metrics:

Calculate and analyze various sales metrics, such as total revenue, profit margins, and sales growth.

Utilize Excel functions to compute these metrics and create visuals for better insights.

6- Visualization:

Create visualizations, such as charts, graphs, and pivot tables, to present the data in an understandable and actionable format.

Visual representations can help identify trends, outliers, and patterns in the data.

7- Report Generation:

Compile the findings and insights into a well-structured report or dashboard, making it easy for stakeholders to understand and make informed decisions.

Throughout this analysis, the goal is to provide a clear and comprehensive understanding of the Superstore's sales performance. By using Excel and Power Query, we can efficiently manage and analyze the data, ensuring that the insights gained contribute to the store's growth and success.
Dataset (2025) for article "Resource Optimization with MPI Process...
zenodo.org
explore.openaire.eu
zip
Updated Feb 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Iker Martín Álvarez; Iker Martín Álvarez; Sergio Iserte; Sergio Iserte (2025). Dataset (2025) for article "Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters" [Dataset]. http://doi.org/10.5281/zenodo.14812022
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14812022
Dataset updated
Feb 6, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Iker Martín Álvarez; Iker Martín Álvarez; Sergio Iserte; Sergio Iserte
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Feb 14, 2025
Description
This dataset was generated and used in the publication "Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters." The dataset is organized into three stages: raw data, preprocessed data, and processed data.

Each workload execution includes log files, application code, executables, and launching scripts. Execution times are extracted from log files, specifically from "slurm-dmr_*.out" and "slurm-dmr_*.info", where "*" represents a number corresponding to a specific job execution.

Dataset Structure:

1. raw_data

Contains the output files from executing workloads on the MarenostrumV HPC cluster. This section is divided into three subsections, each corresponding to a different workload type:

Static_Workload: Data for the static workload, which does not use malleability.

Sync_Workload: Data for the synchronous dynamic workload, including results for both baseline and merge configurations (5 executions each).

Async_Workload: Data for the asynchronous dynamic workload, including results for both baseline and merge configurations (5 executions each).

2. preprocessed_data

This section contains the collected raw data in .pkl files, following the same structure as the raw_data folder. For each workload execution, four .pkl files are generated. The variable name can take values from [baseline, merge, static], while X represents a workload execution number:

If X = J, the file contains a compilation of all workloads with the same configuration.

If A appears before X, it refers to an asynchronous execution.

The four types of .pkl files are:

nameX_data.pkl: Contains application runtime data. A description is available in nameX_data_description.txt.

nameX_data_resize.pkl: Contains application resize data. A description is available in nameX_data_resize_description.txt.

nameAX_iter_data.pkl: Contains iteration time data for asynchronous (A) workloads. A description is available in nameAX_iter_data_description.txt.

nameX_workload.pkl: Contains Slurm workload metrics. A description is available in nameX_workload_description.txt.

3. processed_data

Includes the analyzed results from the preprocessed_data folder. This section contains .xlsx files and images used in the Experimental Setup section of the paper. The Excel files are categorized as follows:

Exec_dataX.xlsx: Contains application execution results.

Mall_dataX.xlsx: Contains resize time results for dynamic executions.

4. Codes

This folder contains the scripts used to convert raw_data into preprocessed_data, along with a Jupyter Notebook used for data analysis and visualization. To understand or use these codes, please contact the dataset creators.
m
Supplementary Datasets
data.mendeley.com
Updated Mar 17, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Natalia Novoselova (2020). Supplementary Datasets [Dataset]. http://doi.org/10.17632/8s3fps4vvb.2
Explore at:
Unique identifier
https://doi.org/10.17632/8s3fps4vvb.2
Dataset updated
Mar 17, 2020
Authors
Natalia Novoselova
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The shared archived combined in Supplementary Datasets represent the actual databases used in the investigation considered in two papers:

Meteorological conditions affecting black vulture (Coragyps atratus) soaring behavior in the southeast of Brazil: Implications for bird strike abatement (in submission)

Remote sensing applications for abating the aircraft-bird strike risks in the southeast of Brazil (Human-Wildlife Interactions Journal, in print)

The papers were based on my Master’s thesis defended in 2016 in the Institute of Biology of the University of Campinas (UNICAMP) in partial fulfilment of the requirements for the degree of Master in Ecology. Our investigation was devoted to reducing the risk of aircraft collision with Black vultures. It had two parts considered in these two papers. In the first one we studied the relationship between soaring activity of Black vultures and meteorological characteristics. In the second one we explored the dependence of soaring activity of vultures on superficial and anthropogenic characteristics. The study was implemented within surroundings of two airports in the southeast of Brazil taken as case studies. We developed the methodological approaches combining application of GIS and remote sensing technologies for data processing, which were used as the main research instrument. By dint of them we joined in the georeferenced databases (shapefiles) the data of bird's observation and three types of environmental factors: (i) meteorological characteristics collected together with the bird’s observation, (ii) superficial parameters (relief and surface temperature) obtained from the products of ASTER imagery; (iii) parameters of surface covering and anthropogenic pressure obtained from the satellite images of high resolution. Based on the analyses of the georeferenced databases, the relationship between soaring activity of vultures and environmental factors was studied; the behavioral patterns of vultures in soaring flight were revealed; the landscape types highly attractive for this species and forming the increased concentration of birds over them were detected; the maps giving a numerical estimation of hazard of bird strike events over the airport vicinities were constructed; the practical recommendations devoted to decrease the risk of collisions with vultures and other bird species were formulated.

This archive contains all materials elaborated and used for the study, including the GIS database for two papers, remote sensing data, and Microsoft Excel datasets. You can find the description of supplementary files in the Description of Supplementary Dataset.docx. The links on supplementary files and their attribution to the text of papers are considered in the Attribution to the text of papers.docx. The supplementary files are in the folders Datasets, GIS_others, GIS_Raster, GIS_Shape.

For any question please write me on this email: natalieenov@gmail.com

Natalia Novoselova
Dataset of scoping reviews on climate-disease publications for Lyme disease...
zenodo.org
data.niaid.nih.gov
bin, xls
Updated Apr 28, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Ma; Yan Ma; Zahra Kalantari; Zahra Kalantari; Georgia Destouni; Georgia Destouni (2023). Dataset of scoping reviews on climate-disease publications for Lyme disease and cryptosporidiosis [Dataset]. http://doi.org/10.5281/zenodo.7342183
Explore at:
bin, xlsAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7342183
Dataset updated
Apr 28, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yan Ma; Yan Ma; Zahra Kalantari; Zahra Kalantari; Georgia Destouni; Georgia Destouni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The climate-disease relationship is complex, where multiple driver-pressure factors and interactions combine with climate change in determining the occurrence geographies and timings of various infectious diseases. However, studies often focus on just some selected factor(s) and interaction(s). Such focus choices may limit and bias our understanding and predictive capability of disease sensitivity to climate and other driver-pressure changes. To assess these research choices and identify possible remaining key gaps and biases, a scoping review is applied to climate-disease related publications for Lyme disease and cryptosporidiosis.

This dataset includes three excel files for Lyme disease (“literature_search_LD_raw.xls”), cryptosporidiosis (“literature_search_CY_raw.xls”), and categories of driver-pressure factors ("Table_1.xlsx") respectively, covering all the relevant information for further analysis. The first two excel files contain three sheets named “Search”, “Filter”, and “Quantitative”. Contents of each sheet is explained as follows.

1. “Search”

Sheet “Search” listed all the publications found from the literature searches, including the information of publication year, title, DOI, and Authors. The searches were performed in Web of Science™ (WoS) and considered publications from 1 January 2000 to 10 February 2022. Search terms were (('borreliosis' OR 'Lyme disease') AND ('climate' OR 'climate change' OR 'climate variability')) for Lyme disease, ((‘cryptosporidiosis’ OR ‘cryptosporidium’ OR ‘crypto.’) AND (‘climate’ OR ‘climate change’ OR ‘climate variability’)) for cryptosporidiosis. The search yielded 555 publication results for Lyme disease and 185 for cryptosporidiosis.

2. “Filter”

Sheet “Filter” listed the inclusions and exclusions for searched publications, and categories each included publication belonging to. Excluded articles (marked as blank in column “Included”) are ones that: (i) do not consider both climate and disease; (ii) are for Lyme disease, but not about the Ixodes transmission of the Borrelia pathogen; (iii) are not written in English; and (iv) are not full-text open access. Included articles (marked as “x” in column “Included”) are further classified into following categories based on their focus: for Lyme disease, categories of publications include reviews (mentioning), reviews (specifically discussing), public awareness, mitigation, survey (implications), investigations, projections, and others; for cryptosporidiosis, categories include reviews (mentioning), reviews (specifically discussing), survey (implications), investigations, projections, and others. Articles belonging to any categories are marked as “x”.

3. “Quantitative”

Sheet “Quantitative” lists further information extracted from quantitative studies which are articles under category of investigations and projections in sheet “filter”. Further extracted information is from the methods section of each study or from the full text if necessary. The information includes transmission components, study region (if applicable), category of investigations or projections, included driver-pressure factors, and methods. Transmission components are reproduction host, transmission host, vector, and human for Lyme disease; animal reservoir, environmental reservoir, and human for cryptosporidiosis. Study region considers specific countries as the smallest scale for spatial resolution, so that smaller than whole-country study sites were counted as studies of the associated countries. Included driver-pressure factors are categorized as shown in Table 1 in the file of "Table_1.xlsx", including main categories and their covered variables. Methods include laboratory/field experimentation/observation, statistical analysis, mechanistic modelling, and synthesis/meta-analysis.

Enable GingerCannot connect to Ginger Check your internet connection
or reload the browserDisable in this text fieldRephraseRephrase current sentenceEdit in Ginger×
Environmental DNA (eDNA) Metabarcoding Pilot Study on National Wildlife...
catalog.data.gov
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Fish and Wildlife Service (2025). Environmental DNA (eDNA) Metabarcoding Pilot Study on National Wildlife Refuges - Tabular Data [Dataset]. https://catalog.data.gov/dataset/environmental-dna-edna-metabarcoding-pilot-study-on-national-wildlife-refuges-tabular-data
Explore at:
Dataset updated
Apr 24, 2025
Dataset provided by
U.S. Fish and Wildlife Servicehttp://www.fws.gov/
Description
This reference contains tabular datasets resulting from the eDNA pilot study on National Wildlife Refuges. ZIP file contains all datasets as received from the authors: a folder for each participating refuge containing two Excel workbooks, one for the MiFish marker results and one for the COI marker results. Each workbook has several sheets including one for the raw compiled data, one for each site, and filtered combined data. CSV of filtered data for all participating refuges combined. This dataset was compiled by extracting the filtered datasheet for each refuge from the excel workbook and combining them into a CSV using an r script. CSV of the total OTU, OTU species, unique families, and number of fish, mammal, amphibian, mollusk, and bird species for each participating refuge. This csv was compiled by Rachel Maxey (I&M Data Manager) by extracting the data from the refuge workbooks and combining manually into a CSV. CSV of the full Site data download from Survey 123. Data dictionaries and metadata for site information and eDNA results tables.
Airbnb price
kaggle.com
Updated Aug 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rania Jabberi (2024). Airbnb price [Dataset]. https://www.kaggle.com/datasets/raniajaberi/airbnb-price
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Rania Jabberi
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv, .tsv, and .xlsx.

Recall that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:

data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.

listing_id: unique identifier of listing price: nightly listing price in USD nbhood_full: name of borough and neighborhood where listing is located data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.

listing_id: unique identifier of listing description: listing description room_type: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.

listing_id: unique identifier of listing host_name: name of listing host last_review: date when the listing was last reviewed
SMARTDEST DATASET WP3 v1.0
data.europa.eu
unknown
Updated Jul 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2022). SMARTDEST DATASET WP3 v1.0 [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6787378?locale=de
Explore at:
unknown(9913124)Available download formats
Dataset updated
Jul 1, 2022
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The SMARTDEST DATASET WP3 v1.0 includes data at sub-city level for 7 cities: Amsterdam, Barcelona, Edinburgh, Lisbon, Ljubljana, Turin, and Venice. It is made up of information extracted from public sources at the local level (mostly, city council open data portals) or volunteered geographic information, that is, geospatial content generated by non-professionals using mapping systems available on the Internet (e.g., Geofabrik). Details on data sources and variables are included in a ‘metadata’ spreadsheet in the excel file. The same excel file contains 5 additional spreadsheets. The first one, labelled #1, was used to perform the analysis on the determinants of the geographical spread of tourism supply in SMARTDEST case study’s cities (in the main document D3.3, section 4.1), The second one (labelled #2) offers information that would allow to replicate the analysis on tourism-led population decline reported in section 4.3. As for spreadsheets named #3-AMS, #4-BCN, and #5-EDI, they refer to data sources and variables used to run follow-up analyses discussed in section 5.1, with the objective of digging into the causes of depopulation in Amsterdam, Barcelona, and Edinburgh, respectively. The column ‘row’ can be used to merge the excel file with the shapefile ‘db_task3.3_SmartDest’. Data are available at the buurt level in Amsterdam (an administrative unit roughly corresponding to a neighbourhood), census tract level in Barcelona and Ljubljana, for data zones in Edinburgh, statistical zones in Turin, and località in Venice.
2019-2020 National Survey on Drug Use and Health: Comparison of Population...
catalog.data.gov
data.virginia.gov
Updated Jul 31, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Substance Abuse and Mental Health Services Administration (2025). 2019-2020 National Survey on Drug Use and Health: Comparison of Population Percentages from the United States, Census Regions, States, and the District of Columbia (Documentation for CSV and Excel Files) [Dataset]. https://catalog.data.gov/dataset/2019-2020-national-survey-on-drug-use-and-health-comparison-of-population-percentages-from
Explore at:
Dataset updated
Jul 31, 2025
Dataset provided by
Substance Abuse and Mental Health Services Administrationhttps://www.samhsa.gov/
Area covered
Washington, United States
Description
State estimates for these years are no longer available due to methodological concerns with combining 2019 and 2020 data. We apologize for any inconvenience or confusion this may causeBecause of the COVID-19 pandemic, most respondents answered the survey via the web in Quarter 4 of 2020, even though all responses in Quarter 1 were from in-person interviews. It is known that people may respond to the survey differently while taking it online, thus introducing what is called a mode effect.When the state estimates were released, it was assumed that the mode effect was similar for different groups of people. However, later analyses have shown that this assumption should not be made. Because of these analyses, along with concerns about the rapid societal changes in 2020, it was determined that averages across the two years could be misleading.For more detail on this decision, see the 2019-2020state data page.
Enterprise Survey 2009-2019, Panel Data - Slovenia
microdata.worldbank.org
catalog.ihsn.org
Updated Aug 6, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank Group (WBG) (2020). Enterprise Survey 2009-2019, Panel Data - Slovenia [Dataset]. https://microdata.worldbank.org/index.php/catalog/3762
Explore at:
Dataset updated
Aug 6, 2020
Dataset provided by
World Bank Grouphttp://www.worldbank.org/
European Bank for Reconstruction and Developmenthttp://ebrd.com/
European Investment Bankhttp://eib.org/
Time period covered
2008 - 2019
Area covered
Slovenia
Description
Abstract

The documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.

The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.

As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.

Geographic coverage

National

Analysis unit

The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.

Universe

As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).

Kind of data

Sample survey data [ssd]

Sampling procedure

The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.

Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.

For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.

For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).

Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).

For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.

For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.

For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.

Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).

Mode of data collection

Computer Assisted Personal Interview [capi]

Research instrument

Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.

Response rate

Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.

Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.

For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.

For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.

For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.

Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
i
ENCKEP Evaluation module: code and case studies
rdm.inesctec.pt
Updated Apr 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ENCKEP Evaluation module: code and case studies [Dataset]. https://rdm.inesctec.pt/dataset/ise-2021-003
Explore at:
Dataset updated
Apr 23, 2021
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The content of the dataset is twofold: I. It contains the results of two case studies made by the ENCKEP Simulation tool, developed within the COST ActionENCKEP COST Action: European Network for Collaboration on Kidney ExchangeProgrammes (ENCKEP) (https://www.enckep-cost.eu/) II. It provides the source code of the Evaluation module, developed in R for evaluation of the results based on various metrics. The details of case studies creation, the formats of the files, the details of usage of the Evaluation module are provided in the Handbook of Working Group 3 and 4 of the ENCKEP COST Action: European Network for Collaboration on Kidney ExchangeProgrammes (ENCKEP) ""International Kidney Exchange Programmes in Europe: Practice, Solution Models,Simulation and Evaluation tools"", available at https://www.enckep-cost.eu/ Case studies. Each case study is organised in a zipped folder: Case study #1 : Case_study_French.zip The folder contains the results of the computational experiment on data which was created by modifying the real data in French Kidney Exchange Programme (KEP). Folder contains the following files: input_arcs_FR.csv - file content compatibility graph input_config_FR.json - file with distribution parameters for generation of missed data input_failed_arcs_FR.csv - contain the list of arcs that fail after matching input_failed_pairs_FR.csv - contain the list of pairs that fail after matching input_hla_FR.csv - HLA data of donors and patients input_objective_FR.json - file with the list of criteria for optimisation and the parameters of their usage including the approach of multi-objective optimisation (lexicographic or/and weighted optimisation) Input_paris.csv - characteristics of pairs Input_policy.csv - policy file with the setting for simulations output_file_FR.xlsx - the excel file that merge all the output files of simulator, each file in separator sheet Case study #2: Case_study_international.zip The folder contains the results of computational experiments for several countries where only distribution parameters were provided for each country. List of files (see descriptions of contents of each file above): Input_arcs.csv input_config_FR.json - distribution parameters for generation of data for country FR input_config_NL.json - distribution parameters for generation of data for country NL input_config_UK.json - distribution parameters for generation of data for country UK Input_failed_arcs.csv Input_faield_paris.csv Input_objective.json Input_pairs.csv The three subfolders contain the results for three different policies run, each subfolder has: input_NameOfPolicy_policy.json policy file for the corresponding policy Ouput_files_NameOfPolicy.xlsx output files Where “no collaboration” folder contains the results for the policy where there is no collaboration between the countries (NameOfPolicy = individual ); “Consecutive_collaboration” is the collaboration when countries first run internal matching runs, and only the remaining pairs are participating in the international pool (NameOfPolicy = consecutive) “Borderless_collaboration” content the results when countries completely merge their pools (NameOfPolicy = borderless) II. R Evaluation module ENCKEP_evaluator.R is the code of Evaluation module in R. The details of usage of the module are provided in the above mentioned Handbook. ENCKEP_evaluator_report.pdf is the example of the report generated by the module for Case study #1.
e
Replication Data for: Twenty years of monitoring reveal overfishing of bony...
b2find.eudat.eu
dataverse.nl
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Replication Data for: Twenty years of monitoring reveal overfishing of bony fish stocks in the coastal national park Banc d’Arguin, in Mauritania [Dataset]. https://b2find.eudat.eu/dataset/08fa9897-100d-5d1c-b17a-a5bf1bbda6e4
Explore at:
Dataset updated
Jun 1, 2024
Area covered
Mauritania, Arguin
Description
This dataset comprises of 10 Excel files of fishing data and relevant locations. File 1 was used to produce figure 1 using QGIS. We used R scripts to select and combine data from files 2 and 3 to produce figures 2, 3, 4, 5, and 6. We crossed file 4 with files 5, 6, 7, 8, 9, and 10 separately to add the georeferences of the fishing spots to the catch data, and then we used QGIS to plot the data and produce figures S1 A (file 5), S1 B (file 6), S2 A (file 7), S2 B (file 8), S2 C (file 9), and S2 D (file 10).
f
A globally gridded heterotrophic respiration dataset based on field...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Dec 16, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu, Shibin; Chen, Guo; Yang, Wunian; Zhang;, Wenjie; Du, Manyi; Fan, Shaohui; Yu, Zhen; Yao, Yitong; Gao, Sicong; Tang, Xiaolu (2019). A globally gridded heterotrophic respiration dataset based on field observations [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000133793
Explore at:
Dataset updated
Dec 16, 2019
Authors
Liu, Shibin; Chen, Guo; Yang, Wunian; Zhang;, Wenjie; Du, Manyi; Fan, Shaohui; Yu, Zhen; Yao, Yitong; Gao, Sicong; Tang, Xiaolu
Description
There are two datasets in this data repository: the first one, named “RH.RF.720.360.1980.2016.Yearly.nc”, is a global heterotrophic respiration (RH) product with a spatial resolution of with 0.5 degree and a time resolution of one year. The global RH product is modelled by Random Forest algorithm with field observations and environmental variables. The environmental variables include temperature, precipitation, diurnal temperature range from CRU TS v.4.01 from 1901 to 2016; shortwave radiation; soil organic carbon content from soil grid data (Hengl et al., 2017); soil nitrogen content from ORNL DAAC; nitrogen deposition data from the Earth System Models of GISS-E2-R, CCSM-CAM3.5 and GFDL-AM3 from the 1850s to 2000s; Palmer Drought Severity Index (PDSI); and soil water content.The RH product is provided in network Common Data Form, version 4 (netCDF-4, short name: nc) data format (https://www.unidata.ucar.edu/software/netcdf/). The RH product is named using the following regulation:"RH.modelling approaches.spatial resolution.start YYYY. end YYYY.temporal resolution.nc"“RH.RF.720.360.1980.2016.Yearly.nc” means modelled RH flux (g C m-2 yr-1) by Random Forest (RF) with a 0.5° spatial resolution (size 720 along longitude and 360 along latitude) from start year 1980 to end year 2016 with a yearly temporal resolution.The second file, named “dataset.xlsx”, is the field observation from peer review publications combining Global Soil Respiration Database (SRDB), (version 3, Bond-Lamberty and Thomson, 2014), which is publicly available at https://github.com/bpbond/srdb. Besides, the database was further updated using observations collected from the China Knowledge Resource Integrated Database (www.cnki.net) up to March 2018 according to the criteria of SRDB. This dataset is provided in Microsoft Excel in format of “.xlsx”.R codes to produce main results and land area (named land.area.nc, km^2) are available.

Symbolic Institutional Traps: Language Regimes, Legal Legacy, and...

zenodo.org

bin

Updated Apr 26, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Scott Brown; Scott Brown (2025). Symbolic Institutional Traps: Language Regimes, Legal Legacy, and Organizational Constraint in Postcolonial Economies [Dataset]. http://doi.org/10.5281/zenodo.15285179

Explore at:

binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.15285179

Dataset updated

Apr 26, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Scott Brown; Scott Brown

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

README: Symbolic Institutional Traps and the Liability of Foreignness

Scott M. Brown (University of Puerto Rico)
Email: scott.brown@upr.edu
Data DOI: 10.5281/zenodo.15050209

Overview

This project empirically tests how language regimes embedded in legal and administrative systems create institutional traps that constrain multinational enterprise (MNE) operations and economic integration.
The study combines national and subnational data across four key datasets to measure how symbolic misalignment (such as monolingualism in non-commercial languages) affects regulatory quality, business formation, and workforce access.

📂 Datasets

You must upload the following four files into your Google Colab session before running the code:

Uploaded File	Description
`/content/2020_Rankings.xlsx`	World Bank Ease of Doing Business (EODB) — Global regulatory efficiency indicators (2020 Edition)
`/content/DBNA 2022 Rank and Scores.xlsx`	Doing Business North America (DBNA 2022) — City-level institutional performance across 83 U.S. cities
`/content/Spanish_Speakers_All_States.xlsx`	U.S. Census American Community Survey (ACS) — State-level Spanish-speaking and English proficiency data
`/content/wgidataset.xlsx`	World Governance Indicators (WGI) — Governance quality measures (Regulatory Quality, Government Effectiveness, etc.)

📋 How to Run the Study

Open Google Colab.
Upload the four Excel files listed above.
Copy and paste the Python code provided below into a Colab notebook cell.
Run the code to automatically load the datasets, clean the data, and estimate key regression models.

🚀 Required Python Code

python

# --- 0. Imports ---

import pandas as pd

import statsmodels.api as sm

import statsmodels.formula.api as smf

# --- 1. Load Clean Datasets ---

dbna = pd.read_excel('/content/DBNA 2022 Rank and Scores.xlsx')

acs = pd.read_excel('/content/Spanish_Speakers_All_States.xlsx')

wgi = pd.read_excel('/content/wgidataset.xlsx') # Optional: Governance analysis

# --- 2. Standardize Column Names ---

dbna.columns = dbna.columns.str.strip().str.replace(' ', '_')

acs.columns = acs.columns.str.strip().str.replace(' ', '_')

wgi.columns = wgi.columns.str.strip().str.replace(' ', '_')

# --- 3. Merge Datasets ---

# Merge DBNA and ACS on 'State'

merged_dbna = dbna.merge(acs, on='State', how='left')

# --- 4. Regressions: Language vs Institutional Outcomes ---

# H1: Language (% Spanish) and Starting a Business Score

model1 = smf.ols('Starting_a_Business_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Starting a Business Score ~ Percent Spanish Speakers")

print(model1.summary())

# H3: Language (% Spanish) and Land and Space Use Score

model2 = smf.ols('Land_and_Space_Use_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Land and Space Use Score ~ Percent Spanish Speakers")

print(model2.summary())

# H3: Language (% Spanish) and Getting Electricity Score

model3 = smf.ols('Getting_Electricity_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Getting Electricity Score ~ Percent Spanish Speakers")

print(model3.summary())

# H4: Language (% Spanish) and Employing Workers Score

model4 = smf.ols('Employing_Workers_Score ~ Percent_Spanish_Speakers', data=merged_dbna).fit()

print(" Regression: Employing Workers Score ~ Percent Spanish Speakers")

print(model4.summary())

# --- 5. (Optional) Governance Analysis: Percent Spanish vs. WGI Regulatory Quality ---

# If WGI includes 'State' or 'Country' to merge, otherwise skip

# Example assuming WGI has 'Country' to match 'State'

#wgi_merged = wgi.merge(acs, left_on='Country', right_on='State', how='left')

#model5 = smf.ols('Regulatory_Quality ~ Percent_Spanish_Speakers', data=wgi_merged).fit()

#print(" Regression: Regulatory Quality ~ Percent Spanish Speakers")

#print(model5.summary())

# --- 6. End ---

print(" All regressions completed.")

🧠 Key Concepts

Symbolic Institutional Traps: Language regimes act as hidden barriers, complicating regulatory navigation and labor market integration.
Symbolic Misalignment: Misfit between administrative languages and global commercial norms raises onboarding costs for MNEs.
Institutional Friction: Language encapsulation isolates economies and reduces foreign direct investment (FDI) attractiveness.

📜 Data Documentation

Each dataset has been:

Cleaned for consistent formatting.
Harmonized for cross-dataset integration.
Standardized to facilitate reproducible econometric analysis.
Full codebooks and metadata are available in the appendix of the research paper.

⚡ Notes

The EF EPI (English Proficiency) dataset was not uploaded here. If available, further regressions on symbolic distance can be run.
If any columns do not match exactly (e.g., different spellings), modify the variable names slightly based on print(dbna.columns).

📈 Planned Outputs

The code generates:

Regression outputs on how Spanish-speaking prevalence correlates with:
- Starting a business
- Ease of Doing Business
- Regulatory quality
Subnational institutional performance differences (Puerto Rico vs. U.S. states).

🌍 License and Reuse

Open Data: CC BY 4.0 License
Citation Requested:
Brown, S.M. (2025). Symbolic Institutional Traps and the Liability of Foreignness: Language Regimes as Hidden Barriers to Multinational Entry. University of Puerto Rico. DOI: 10.5281/zenodo.15050209

Z
Albero study: a longitudinal database of the social network and personal...
data.niaid.nih.gov
zenodo.org
Updated Mar 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maya Jariego, Isidro (2021). Albero study: a longitudinal database of the social network and personal networks of a cohort of students at the end of high school [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3532047
Explore at:
Dataset updated
Mar 26, 2021
Dataset provided by
Maya Jariego, Isidro
Alieva, Deniza
Holgado Ramos, Daniel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ABSTRACT

The Albero study analyzes the personal transitions of a cohort of high school students at the end of their studies. The data consist of (a) the longitudinal social network of the students, before (n = 69) and after (n = 57) finishing their studies; and (b) the longitudinal study of the personal networks of each of the participants in the research. The two observations of the complete social network are presented in two matrices in Excel format. For each respondent, two square matrices of 45 alters of their personal networks are provided, also in Excel format. For each respondent, both psychological sense of community and frequency of commuting is provided in a SAV file (SPSS). The database allows the combined analysis of social networks and personal networks of the same set of individuals.

INTRODUCTION

Ecological transitions are key moments in the life of an individual that occur as a result of a change of role or context. This is the case, for example, of the completion of high school studies, when young people start their university studies or try to enter the labor market. These transitions are turning points that carry a risk or an opportunity (Seidman & French, 2004). That is why they have received special attention in research and psychological practice, both from a developmental point of view and in the situational analysis of stress or in the implementation of preventive strategies.

The data we present in this article describe the ecological transition of a group of young people from Alcala de Guadaira, a town located about 16 kilometers from Seville. Specifically, in the “Albero” study we monitored the transition of a cohort of secondary school students at the end of the last pre-university academic year. It is a turning point in which most of them began a metropolitan lifestyle, with more displacements to the capital and a slight decrease in identification with the place of residence (Maya-Jariego, Holgado & Lubbers, 2018).

Normative transitions, such as the completion of studies, affect a group of individuals simultaneously, so they can be analyzed both individually and collectively. From an individual point of view, each student stops attending the institute, which is replaced by new interaction contexts. Consequently, the structure and composition of their personal networks are transformed. From a collective point of view, the network of friendships of the cohort of high school students enters into a gradual process of disintegration and fragmentation into subgroups (Maya-Jariego, Lubbers & Molina, 2019).

These two levels, individual and collective, were evaluated in the “Albero” study. One of the peculiarities of this database is that we combine the analysis of a complete social network with a survey of personal networks in the same set of individuals, with a longitudinal design before and after finishing high school. This allows combining the study of the multiple contexts in which each individual participates, assessed through the analysis of a sample of personal networks (Maya-Jariego, 2018), with the in-depth analysis of a specific context (the relationships between a promotion of students in the institute), through the analysis of the complete network of interactions. This potentially allows us to examine the covariation of the social network with the individual differences in the structure of personal networks.

PARTICIPANTS

The social network and personal networks of the students of the last two years of high school of an institute of Alcala de Guadaira (Seville) were analyzed. The longitudinal follow-up covered approximately a year and a half. The first wave was composed of 31 men (44.9%) and 38 women (55.1%) who live in Alcala de Guadaira, and who mostly expect to live in Alcala (36.2%) or in Seville (37.7%) in the future. In the second wave, information was obtained from 27 men (47.4%) and 30 women (52.6%).

DATE STRUCTURE AND ARCHIVES FORMAT

The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

Social network

The file “Red_Social_t1.xlsx” is a valued matrix of 69 actors that gathers the relations of knowledge and friendship between the cohort of students of the last year of high school in the first observation. The file “Red_Social_t2.xlsx” is a valued matrix of 57 actors obtained 17 months after the first observation.

The data is organized in two longitudinal observations, with information on the complete social network of the cohort of students of the last year, the personal networks of each individual and complementary information on the sense of community and frequency of metropolitan movements, among other variables.

In order to generate each complete social network, the list of 77 students enrolled in the last year of high school was passed to the respondents, asking that in each case they indicate the type of relationship, according to the following values: 1, “his/her name sounds familiar"; 2, "I know him/her"; 3, "we talk from time to time"; 4, "we have good relationship"; and 5, "we are friends." The two resulting complete networks are represented in Figure 2. In the second observation, it is a comparatively less dense network, reflecting the gradual disintegration process that the student group has initiated.

Personal networks

Also in this case the information is organized in two observations. The compressed file “Redes_Personales_t1.csv” includes 69 folders, corresponding to personal networks. Each folder includes a valued matrix of 45 alters in CSV format. Likewise, in each case a graphic representation of the network obtained with Visone (Brandes and Wagner, 2004) is included. Relationship values range from 0 (do not know each other) to 2 (know each other very well).

Second, the compressed file “Redes_Personales_t2.csv” includes 57 folders, with the information equivalent to each respondent referred to the second observation, that is, 17 months after the first interview. The structure of the data is the same as in the first observation.

Sense of community and metropolitan displacements

The SPSS file “Albero.sav” collects the survey data, together with some information-summary of the network data related to each respondent. The 69 rows correspond to the 69 individuals interviewed, and the 118 columns to the variables related to each of them in T1 and T2, according to the following list:

• Socio-economic data. • Data on habitual residence. • Information on intercity journeys. • Identity and sense of community. • Personal network indicators. • Social network indicators.

DATA ACCESS

Social networks and personal networks are available in CSV format. This allows its use directly with UCINET, Visone, Pajek or Gephi, among others, and they can be exported as Excel or text format files, to be used with other programs.

The visual representation of the personal networks of the respondents in both waves is available in the following album of the Graphic Gallery of Personal Networks on Flickr: .

In previous work we analyzed the effects of personal networks on the longitudinal evolution of the socio-centric network. It also includes additional details about the instruments applied. In case of using the data, please quote the following reference:

Maya-Jariego, I., Holgado, D. & Lubbers, M. J. (2018). Efectos de la estructura de las redes personales en la red sociocéntrica de una cohorte de estudiantes en transición de la enseñanza secundaria a la universidad. Universitas Psychologica, 17(1), 86-98. https://doi.org/10.11144/Javeriana.upsy17-1.eerp

The English version of this article can be downloaded from: https://tinyurl.com/yy9s2byl

CONCLUSION

The database of the “Albero” study allows us to explore the co-evolution of social networks and personal networks. In this way, we can examine the mutual dependence of individual trajectories and the structure of the relationships of the cohort of students as a whole. The complete social network corresponds to the same context of interaction: the secondary school. However, personal networks collect information from the different contexts in which the individual participates. The structural properties of personal networks may partly explain individual differences in the position of each student in the entire social network. In turn, the properties of the entire social network partly determine the structure of opportunities in which individual trajectories are displayed.

The longitudinal character and the combination of the personal networks of individuals with a common complete social network, make this database have unique characteristics. It may be of interest both for multi-level analysis and for the study of individual differences.

ACKNOWLEDGEMENTS

The fieldwork for this study was supported by the Complementary Actions of the Ministry of Education and Science (SEJ2005-25683), and was part of the project “Dynamics of actors and networks across levels: individuals, groups, organizations and social settings” (2006 -2009) of the European Science Foundation (ESF). The data was presented for the first time on June 30, 2009, at the European Research Collaborative Project Meeting on Dynamic Analysis of Networks and Behaviors, held at the Nuffield College of the University of Oxford.

REFERENCES

Brandes, U., & Wagner, D. (2004). Visone - Analysis and Visualization of Social Networks. In M. Jünger, & P. Mutzel (Eds.), Graph Drawing Software (pp. 321-340). New York: Springer-Verlag.

Maya-Jariego, I. (2018). Why name generators with a fixed number of alters may be a pragmatic option for personal network analysis. American Journal of
Dataset for the Paper: Issues and Their Causes in WebAssembly Applications:...
zenodo.org
Updated Mar 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen; Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen (2024). Dataset for the Paper: Issues and Their Causes in WebAssembly Applications: An Empirical Study [Dataset]. http://doi.org/10.5281/zenodo.10528609
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10528609
Dataset updated
Mar 7, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen; Muhammad Waseem, Teerath Das, Aakash Ahmad, Peng Liang and Tommi Mikkonen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies the paper titled 'Issues and Their Causes in WebAssembly Applications: An Empirical Study.' The dataset is stored in a Microsoft Excel file, which comprises multiple worksheets. A brief description of each worksheet is provided below.

(1) The 'Selected Systems' worksheet contains information on the 12 chosen open-source WebAssembly applications, along with the URL for each application.

(2) The 'GitHub-Raw Data' worksheet contains information on the initially retrieved 6,667 issues, including the titles, links, and statuses of each individual issue discussion.

(3) The 'SOF-Raw Data' worksheet contains information on the initially retrieved 6,667 questions and answers, including the details of each question and answer, respective links, and associated tags.

(4) The 'GitHubData Random Selected' worksheet contains a list of issues randomly selected from the initial pool of 6,667 issues, as well as extracted data from the discussions associated with these randomly selected issues.

(5) The 'GitHub-(Issues, Causes)' worksheet contains the initial codes categorizing the types of issues and causes.

(6) The 'SOF (Issues, Causes)' worksheet contains information gleaned from a randomly selected subset of 354 Stack Overflow posts. This information includes the title and body of each question, the associated link, tags, as well as key points for types of issues and causes.

(7) The 'Combine (Git and SOF) Data' worksheet contains the compiled issues and causes extracted from both GitHub and Stack Overflow.

(8) The 'Issue Taxonomy' worksheet contains a comprehensive issue taxonomy, which is organized into 9 categories, 20 subcategories, and 120 specific types of issues.

(9) The 'Cause Taxonomy' worksheet contains a comprehensive cause taxonomy, which is organized into 10 categories, 35 subcategories, and 278 specific types of causes.
o
Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...
osti.gov
knb.ecoinformatics.org
Updated Dec 31, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Environmental System Science Data Infrastructure for a Virtual Ecosystem (2020). Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series Data for Billy Barr, East River, Colorado USA [Dataset]. http://doi.org/10.15485/1823516
Explore at:
Unique identifier
https://doi.org/10.15485/1823516
Dataset updated
Dec 31, 2020
Dataset provided by
U.S. DOE > Office of Science > Biological and Environmental Research (BER)
Environmental System Science Data Infrastructure for a Virtual Ecosystem
Area covered
United States, Colorado, East River
Description
A comprehensive Quality Assurance (QA) and Quality Control (QC) statistical framework consists of three major phases: Phase 1—Preliminary raw data sets exploration, including time formatting and combining datasets of different lengths and different time intervals; Phase 2—QA of the datasets, including detecting and flagging of duplicates, outliers, and extreme values; and Phase 3—the development of time series of a desired frequency, imputation of missing values, visualization and a final statistical summary. The time series data collected at the Billy Barr meteorological station (East River Watershed, Colorado) were analyzed. The developed statistical framework is suitable for both real-time and post-data-collection QA/QC analysis of meteorological datasets.The files that are in this data package include one excel file, converted to CSV format (Billy_Barr_raw_qaqc.csv) that contains the raw meteorological data, i.e., input data used for the QA/QC analysis. The second CSV file (Billy_Barr_1hr.csv) is the QA/QC and flagged meteorological data, i.e., output data from the QA/QC analysis. The last file (QAQC_Billy_Barr_2021-03-22.R) is a script written in R that implements the QA/QC and flagging process. The purpose of the CSV data files included in this package is to provide input and output files implemented in the R script.
u
University of Cape Town Student Admissions Data 2006-2014 - South Africa
datafirst.uct.ac.za
Updated Jul 28, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCT Student Administration (2020). University of Cape Town Student Admissions Data 2006-2014 - South Africa [Dataset]. http://www.datafirst.uct.ac.za/Dataportal/index.php/catalog/556
Explore at:
Dataset updated
Jul 28, 2020
Dataset authored and provided by
UCT Student Administration
Time period covered
2006 - 2014
Area covered
South Africa
Description
Abstract

This dataset was generated from a set of Excel spreadsheets from an Information and Communication Technology Services (ICTS) administrative database on student applications to the University of Cape Town (UCT). This database contains information on applications to UCT between the January 2006 and December 2014. In the original form received by DataFirst the data were ill suited to research purposes. This dataset represents an attempt at cleaning and organizing these data into a more tractable format. To ensure data confidentiality direct identifiers have been removed from the data and the data is only made available to accredited researchers through DataFirst's Secure Data Service.

The dataset was separated into the following data files:

Application level information: the "finest" unit of analysis. Individuals may have multiple applications. Uniquely identified by an application ID variable. There are a total of 1,714,669 applications on record.

Individual level information: individuals may have multiple applications. Each individual is uniquely identified by an individual ID variable. Each individual is associated with information on "key subjects" from a separate data file also contained in the database. These key subjects are all separate variables in the individual level data file. There are a total of 285,005 individuals on record.

Secondary Education Information: individuals can also be associated with row entries for each subject. This data file does not have a unique identifier. Instead, each row entry represents a specific secondary school subject for a specific individual. These subjects are quite specific and the data allows the user to distinguish between, for example, higher grade accounting and standard grade accounting. It also allows the user to identify the educational authority issuing the qualification e.g. Cambridge Internal Examinations (CIE) versus National Senior Certificate (NSC).

Tertiary Education Information: the smallest of the four data files. There are multiple entries for each individual in this dataset. Each row entry contains information on the year, institution and transcript information and can be associated with individuals.

Analysis unit

Applications, individuals

Kind of data

Administrative records [adm]

Mode of data collection

Other [oth]

Cleaning operations

The data files were made available to DataFirst as a group of Excel spreadsheet documents from an SQL database managed by the University of Cape Town's Information and Communication Technology Services . The process of combining these original data files to create a research-ready dataset is summarised in a document entitled "Notes on preparing the UCT Student Application Data 2006-2014" accompanying the data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet (2025). Cleaned NHANES 1988-2018 [Dataset]. http://doi.org/10.6084/m9.figshare.21743372.v9

Cleaned NHANES 1988-2018

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.21743372.v9

Dataset updated

Feb 18, 2025

Dataset provided by

figshare

Authors

Vy Nguyen; Lauren Y. M. Middleton; Neil Zhao; Lei Huang; Eliseu Verly; Jacob Kvasnicka; Luke Sagers; Chirag Patel; Justin Colacino; Olivier Jolliet

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The National Health and Nutrition Examination Survey (NHANES) provides data and have considerable potential to study the health and environmental exposure of the non-institutionalized US population. However, as NHANES data are plagued with multiple inconsistencies, processing these data is required before deriving new insights through large-scale analyses. Thus, we developed a set of curated and unified datasets by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-2018), totaling 135,310 participants and 5,078 variables. The variables conveydemographics (281 variables),dietary consumption (324 variables),physiological functions (1,040 variables),occupation (61 variables),questionnaires (1444 variables, e.g., physical activity, medical conditions, diabetes, reproductive health, blood pressure and cholesterol, early childhood),medications (29 variables),mortality information linked from the National Death Index (15 variables),survey weights (857 variables),environmental exposure biomarker measurements (598 variables), andchemical comments indicating which measurements are below or above the lower limit of detection (505 variables).csv Data Record: The curated NHANES datasets and the data dictionaries includes 23 .csv files and 1 excel file.The curated NHANES datasets involves 20 .csv formatted files, two for each module with one as the uncleaned version and the other as the cleaned version. The modules are labeled as the following: 1) mortality, 2) dietary, 3) demographics, 4) response, 5) medications, 6) questionnaire, 7) chemicals, 8) occupation, 9) weights, and 10) comments."dictionary_nhanes.csv" is a dictionary that lists the variable name, description, module, category, units, CAS Number, comment use, chemical family, chemical family shortened, number of measurements, and cycles available for all 5,078 variables in NHANES."dictionary_harmonized_categories.csv" contains the harmonized categories for the categorical variables.“dictionary_drug_codes.csv” contains the dictionary for descriptors on the drugs codes.“nhanes_inconsistencies_documentation.xlsx” is an excel file that contains the cleaning documentation, which records all the inconsistencies for all affected variables to help curate each of the NHANES modules.R Data Record: For researchers who want to conduct their analysis in the R programming language, only cleaned NHANES modules and the data dictionaries can be downloaded as a .zip file which include an .RData file and an .R file.“w - nhanes_1988_2018.RData” contains all the aforementioned datasets as R data objects. We make available all R scripts on customized functions that were written to curate the data.“m - nhanes_1988_2018.R” shows how we used the customized functions (i.e. our pipeline) to curate the original NHANES data.Example starter codes: The set of starter code to help users conduct exposome analysis consists of four R markdown files (.Rmd). We recommend going through the tutorials in order.“example_0 - merge_datasets_together.Rmd” demonstrates how to merge the curated NHANES datasets together.“example_1 - account_for_nhanes_design.Rmd” demonstrates how to conduct a linear regression model, a survey-weighted regression model, a Cox proportional hazard model, and a survey-weighted Cox proportional hazard model.“example_2 - calculate_summary_statistics.Rmd” demonstrates how to calculate summary statistics for one variable and multiple variables with and without accounting for the NHANES sampling design.“example_3 - run_multiple_regressions.Rmd” demonstrates how run multiple regression models with and without adjusting for the sampling design.

Clear search

Close search

Google apps

Main menu

Cleaned NHANES 1988-2018

Data from: DATASET FOR: A multimodal spectroscopic approach combining...

Description:

Usage Notes:

Data from: Dynamic Technical and Environmental Efficiency Performance of...

Superstore Sales Analysis

Dataset (2025) for article "Resource Optimization with MPI Process...

Dataset Structure:

1. raw_data

2. preprocessed_data

3. processed_data

4. Codes

Supplementary Datasets

Dataset of scoping reviews on climate-disease publications for Lyme disease...

Environmental DNA (eDNA) Metabarcoding Pilot Study on National Wildlife...

Airbnb price

SMARTDEST DATASET WP3 v1.0

2019-2020 National Survey on Drug Use and Health: Comparison of Population...

Enterprise Survey 2009-2019, Panel Data - Slovenia

Abstract

Geographic coverage

Analysis unit

Universe

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Response rate

ENCKEP Evaluation module: code and case studies

Replication Data for: Twenty years of monitoring reveal overfishing of bony...

A globally gridded heterotrophic respiration dataset based on field...

Symbolic Institutional Traps: Language Regimes, Legal Legacy, and...

README: Symbolic Institutional Traps and the Liability of Foreignness

Overview

📂 Datasets

📋 How to Run the Study

🚀 Required Python Code

🧠 Key Concepts

📜 Data Documentation

⚡ Notes

📈 Planned Outputs

🌍 License and Reuse

Albero study: a longitudinal database of the social network and personal...

Dataset for the Paper: Issues and Their Causes in WebAssembly Applications:...

Quality Assurance and Quality Control (QA/QC) of Meteorological Time Series...

University of Cape Town Student Admissions Data 2006-2014 - South Africa

Abstract

Analysis unit

Kind of data

Mode of data collection

Cleaning operations

Cleaned NHANES 1988-2018