100+ datasets found

f
Data from: Importing General-Purpose Graphics in R
figshare.com
auckland.figshare.com
application/gzip
Updated Sep 19, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Murrell (2018). Importing General-Purpose Graphics in R [Dataset]. http://doi.org/10.17608/k6.auckland.7108736.v1
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.17608/k6.auckland.7108736.v1
Dataset updated
Sep 19, 2018
Dataset provided by
The University of Auckland
Authors
Paul Murrell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This report discusses some problems that can arise when attempting to import PostScript images into R, when the PostScript image contains coordinate transformations that skew the image. There is a description of some new features in the ‘grImport’ package for R that allow these sorts of images to be imported into R successfully.
Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies [Dataset]. https://data.niaid.nih.gov/resources?id=dryad_w3r2280w0
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.w3r2280w0
Dataset updated
Dec 7, 2023
Dataset provided by
HIV Vaccine Trials Networkhttp://www.hvtn.org/
HIV Prevention Trials Networkhttp://www.hptn.org/
National Institute of Allergy and Infectious Diseaseshttp://www.niaid.nih.gov/
PEPFAR
Authors
Dylan Westfall; Mullins James
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies. Methods This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies" Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005 For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub. The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub. The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results. Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program. To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper. Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd. Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
g
Input data, model output, and R scripts for a machine learning streamflow...
gimi9.com
data.usgs.gov
+3more
Updated Sep 16, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17 [Dataset]. https://gimi9.com/dataset/data-gov_input-data-model-output-and-r-scripts-for-a-machine-learning-streamflow-model-on-the-wyomi
Explore at:
Dataset updated
Sep 16, 2021
Area covered
Wyoming, Wyoming Range
Description
A machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 predictions of spatially and temporally continuous monthly streamflow. Additional information about the datasets is in the metadata included in the four zipped dataset files and about the MLFLOW model is in the readme included in the zipped model archive folder.
d
2010 County and City-Level Water-Use Data and Associated Explanatory...
catalog.data.gov
data.usgs.gov
+4more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
explore.openaire.eu
+2more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Massachusetts General Hospital
Harvard Medical School
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River...
data.csiro.au
researchdata.edu.au
Updated Apr 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson (2024). AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River Water Resource Assessment [Dataset]. http://doi.org/10.25919/y0v9-7b58
Explore at:
Unique identifier
https://doi.org/10.25919/y0v9-7b58
Dataset updated
Apr 16, 2024
Dataset provided by
CSIROhttp://www.csiro.au/
Authors
Ian Watson; Mark Thomas; Seonaid Philip; Uta Stockmann; Ross Searle; Linda Gregory; jason hill; Elisabeth Bui; John Gallant; Peter R Wilson; Peter Wilson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jul 1, 2020 - Jun 30, 2023
Area covered

Dataset funded by
CSIROhttp://www.csiro.au/
Northern Territory Department of Environment, Parks and Water Security
Description
AWC to 60cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 60cm (mm of water to 60cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 60cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 60cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
R 600 Import Data India, R 600 Customs Import Shipment Data
seair.co.in
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, R 600 Import Data India, R 600 Customs Import Shipment Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Z
Food and Agriculture Biomass Input–Output (FABIO) database
data.niaid.nih.gov
zenodo.org
Updated Jun 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kuschnig, Nikolas (2022). Food and Agriculture Biomass Input–Output (FABIO) database [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_2577066
Explore at:
Dataset updated
Jun 8, 2022
Dataset provided by
Bruckner, Martin
Kuschnig, Nikolas
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This data repository provides the Food and Agriculture Biomass Input Output (FABIO) database, a global set of multi-regional physical supply-use and input-output tables covering global agriculture and forestry.

The work is based on mostly freely available data from FAOSTAT, IEA, EIA, and UN Comtrade/BACI. FABIO currently covers 191 countries + RoW, 118 processes and 125 commodities (raw and processed agricultural and food products) for 1986-2013. All R codes and auxilliary data are available on GitHub. For more information please refer to https://fabio.fineprint.global.

The database consists of the following main components, in compressed .rds format:

Z: the inter-commodity input-output matrix, displaying the relationships of intermediate use of each commodity in the production of each commodity, in physical units (tons). The matrix has 24000 rows and columns (125 commodities x 192 regions), and is available in two versions, based on the method to allocate inputs to outputs in production processes: Z_mass (mass allocation) and Z_value (value allocation). Note that the row sums of the Z matrix (= total intermediate use by commodity) are identical in both versions.

Y: the final demand matrix, denoting the consumption of all 24000 commodities by destination country and final use category. There are six final use categories (yielding 192 x 6 = 1152 columns): 1) food use, 2) other use (non-food), 3) losses, 4) stock addition, 5) balancing, and 6) unspecified.

X: the total output vector of all 24000 commodities. Total output is equal to the sum of intermediate and final use by commodity.

L: the Leontief inverse, computed as (I – A)-1, where A is the matrix of input coefficients derived from Z and x. Again, there are two versions, depending on the underlying version of Z (L_mass and L_value).

E: environmental extensions for each of the 24000 commodities, including four resource categories: 1) primary biomass extraction (in tons), 2) land use (in hectares), 3) blue water use (in m3)., and 4) green water use (in m3).

mr_sup_mass/mr_sup_value: For each allocation method (mass/value), the supply table gives the physical supply quantity of each commodity by producing process, with processes in the rows (118 processes x 192 regions = 22656 rows) and commodities in columns (24000 columns).

mr_use: the use table capture the quantities of each commodity (rows) used as an input in each process (columns).

A description of the included countries and commodities (i.e. the rows and columns of the Z matrix) can be found in the auxiliary file io_codes.csv. Separate lists of the country sample (including ISO3 codes and continental grouping) and commodities (including moisture content) are given in the files regions.csv and items.csv, respectively. For information on the individual processes, see auxiliary file su_codes.csv. RDS files can be opened in R. Information on how to read these files can be obtained here: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

Except of X.rds, which contains a matrix, all variables are organized as lists, where each element contains a sparse matrix. Please note that values are always given in physical units, i.e. tonnes or head, as specified in items.csv. The suffixes value and mass only indicate the form of allocation chosen for the construction of the symmetric IO tables (for more details see Bruckner et al. 2019). Product, process and country classifications can be found in the file fabio_classifications.xlsx.

Footprint results are not contained in the database but can be calculated, e.g. by using this script: https://github.com/martinbruckner/fabio_comparison/blob/master/R/fabio_footprints.R

How to cite:

To cite FABIO work please refer to this paper:

Bruckner, M., Wood, R., Moran, D., Kuschnig, N., Wieland, H., Maus, V., Börner, J. 2019. FABIO – The Construction of the Food and Agriculture Input–Output Model. Environmental Science & Technology 53(19), 11302–11312. DOI: 10.1021/acs.est.9b03554

License:

This data repository is distributed under the CC BY-NC-SA 4.0 License. You are free to share and adapt the material for non-commercial purposes using proper citation. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. In case you are interested in a collaboration, I am happy to receive enquiries at martin.bruckner@wu.ac.at.

Known issues:

The underlying FAO data have been manipulated to the minimum extent necessary. Data filling and supply-use balancing, yet, required some adaptations. These are documented in the code and are also reflected in the balancing item in the final demand matrices. For a proper use of the database, I recommend to distribute the balancing item over all other uses proportionally and to do analyses with and without balancing to illustrate uncertainties.
D
World Input-Output Database, 2021 Release, 1965-2000, Long-run WIOD
test.dataverse.nl
dataverse.nl
csv, pdf, xlsx
Updated Mar 28, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pieter Woltjer; Reitze Gouma; Marcel P. Timmer; Pieter Woltjer; Reitze Gouma; Marcel P. Timmer (2022). World Input-Output Database, 2021 Release, 1965-2000, Long-run WIOD [Dataset]. http://doi.org/10.34894/A7AXDN
Explore at:
pdf(24967), xlsx(176646080), xlsx(7260079), csv(1776042757), xlsx(10079684), xlsx(9445396), csv(40131224), xlsx(176466965), csv(682506382), xlsx(8714473)Available download formats
Unique identifier
https://doi.org/10.34894/A7AXDN
Dataset updated
Mar 28, 2022
Dataset provided by
DataverseNL (test)
Authors
Pieter Woltjer; Reitze Gouma; Marcel P. Timmer; Pieter Woltjer; Reitze Gouma; Marcel P. Timmer
License
https://tdvnl.dans.knaw.nl/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.34894/A7AXDNhttps://tdvnl.dans.knaw.nl/api/datasets/:persistentId/versions/2.0/customlicense?persistentId=doi:10.34894/A7AXDN
Description
The Long-run WIOD covers the period 1965-2000. It includes World-Input-Output Tables (WIOTs) that cover 25 countries, and a model for the rest of the world. Data is classified into 23 sector according to the International Standard Industrial Classification revision 3.1 (ISIC Rev. 3.1). The tables adhere to the 1993 version of the SNA. The WIOTs are available in millions of US dollars, as well as in millions of dollars of the previous year (previous-years' prices). The input data used to construct the tables is available, as well as National Input-Output Tables (NIOTs) derived from the WIOTs, and Socio Economic Accounts (SEA), consistent with the data in the input-output tables. Associated Website When using this database, a reference should be made to the following paper: Woltjer, P., Gouma, R. and Timmer, M. P. (2021), "Long-run World Input-Output Database: Version 1.0 Sources and Methods", GGDC Research Memorandum 190
d
Data to Incorporate Water Quality Analysis into Navigation Assessments as...
catalog.data.gov
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data to Incorporate Water Quality Analysis into Navigation Assessments as Demonstrated in the Mississippi River Basin [Dataset]. https://catalog.data.gov/dataset/data-to-incorporate-water-quality-analysis-into-navigation-assessments-as-demonstrated-in-
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Area covered
Mississippi River
Description
This data release includes estimates of annual and monthly mean concentrations and fluxes for nitrate plus nitrite, orthophosphate and suspended sediment for nine sites in the Mississippi River Basin (MRB) produced using the Weighted Regressions on Time, Discharge, and Season (WRTDS) model (Hirsch and De Cicco, 2015). It also includes a model archive (R scripts and readMe file) used to retrieve and format the model input data and run the model. Input data, including discrete concentrations and daily mean streamflow, were retrieved from the National Water Quality Network (https://doi.org/10.5066/P9AEWTB9). Annual and monthly estimates range from water year 1975 through water year 2019 (i.e. October 1, 1974 through September 30, 2019). Annual trends were estimated for three trend periods per parameter. The length of record at some sites required variations in the trend start year. For nitrate plus nitrite, the following trend periods were used at all sites: 1980-2019, 1980-2010 and 2010-2019. For orthophosphate, the same trend periods were used but with 1982 as the start year instead of 1980. For suspended sediment, 1997 was used as the start year for the upper MRB sites and the St. Francisville (MS-STFR) site, but 1980 was used for the rest of the sites. All parameters and sites used 2010 as the start year for the last 10-year trend period. Reference: Hirsch, R.M., and De Cicco, L.A., 2015, User guide to Exploration and Graphics for RivEr Trends (EGRET) and dataRetrieval: R packages for hydrologic data (version 2.0, February 2015): U.S. Geological Survey Techniques and Methods book 4, chap. A10, 93 p., doi:10.3133/tm4A10
d
Data for use in poscrptR post-fire conifer regeneration prediction model
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Data for use in poscrptR post-fire conifer regeneration prediction model [Dataset]. https://catalog.data.gov/dataset/data-for-use-in-poscrptr-post-fire-conifer-regeneration-prediction-model
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
These data support poscrptR (wright et al. 2021). poscrptR is a shiny app that predicts the probability of post-fire conifer regeneration for fire data supplied by the user. The predictive model was fit using presence/absence data collected in 4.4m radius plots (60 square meters). Please refer to Stewart et al. (2020) for more details concerning field data collection, the model fitting process, and limitations. Learn more about shiny apps at https://shiny.rstudio.com. The app is designed to simplify the process of predicting post-fire conifer regeneration under different precipitation and seed production scenarios. The app requires the user to upload two input data sets: 1. a raster of Relativized differenced Normalized Burn Ratio (RdNBR), and 2. a .zip folder containing a fire perimeter shapefile. The app was designed to use Rapid Assessment of Vegetative Condition (RAVG) data inputs. The RAVG website (https://fsapps.nwcg.gov/ravg) has both RdNBR and fire perimeter data sets available for all fires with at least 1,000 acres of National Forest land from 2007 to the present. The fire perimeter must be a zipped shapefile (.zip file, include all shapefile components: .cpg, .dbf, .prj, .sbn, .sbx, .shp, and .shx). RdNBR must be 30m resolution, and both the RdNBR and fire perimeter must use the USA Contiguous Albers Equal Area Conic coordinate reference system (USGS version). RDNBR must be alligned (same origin) as RAVG raster data. References: Stewart, J., van Mantgem, P., Young, D., Shive, K., Preisler, H., Das, A., Stephenson, N., Keeley, J., Safford, H., Welch, K., Thorne, J., 2020. Effects of postfire climate and seed availability on postfire conifer regeneration. Ecological Applications. Wright, M.C., Stewart, J.E., van Mantgem, P.J., Young, D.J., Shive, K.L., Preisler, H.K., Das, A.J., Stephenson, N.L., Keeley, J.E., Safford, H.D., Welch, K.R., and Thorne, J.H. 2021. poscrptR. R package version 0.1.3.
R Lubricant Import Data India, R Lubricant Customs Import Shipment Data
seair.co.in
Updated Nov 22, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2016). R Lubricant Import Data India, R Lubricant Customs Import Shipment Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
Nov 22, 2016
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
R Ft3 Import Data India, R Ft3 Customs Import Shipment Data
seair.co.in
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, R Ft3 Import Data India, R Ft3 Customs Import Shipment Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
R Kleen Import Data India, R Kleen Customs Import Shipment Data
seair.co.in
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, R Kleen Import Data India, R Kleen Customs Import Shipment Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
India
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
Input data for predicted dry bulk densities on the Norwegian continental...
zenodo.org
zip
Updated May 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Diesing; Markus Diesing (2024). Input data for predicted dry bulk densities on the Norwegian continental margin [Dataset]. http://doi.org/10.5281/zenodo.10057726
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10057726
Dataset updated
May 3, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Markus Diesing; Markus Diesing
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Input data relating to R workflow for predicting dry bulk densities on the Norwegian continental margin (https://github.com/diesing-ngu/DBD). The following files are included:
DBD_2023-07-21.csv - Data on dry bulk densities in surface sediments
predictors_ngb.tif - Multi-band georeferenced TIFF-file of predictor variables
predictors_description.txt - Information on variables stored in predictor_ngb.tif including units, statistics, time period and sources.
GrainSizeReg_folk8_classes_2023-06-28.tif - Georeferenced TIFF-file of predicted substrate classes. Used to update the area of interest (exclude areas mapped as Rock and boulders).
GrainSizeReg_folk8_probabilities_2023-06-28.tif - Georeferenced TIFF-file of the prediction probabilities of the predicted substrate classes. Used as additional predctors.
d
Statistical Methods in Water Resources - Supporting Materials
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Statistical Methods in Water Resources - Supporting Materials [Dataset]. https://catalog.data.gov/dataset/statistical-methods-in-water-resources-supporting-materials
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This dataset contains all of the supporting materials to accompany Helsel, D.R., Hirsch, R.M., Ryberg, K.R., Archfield, S.A., and Gilroy, E.J., 2020, Statistical methods in water resources: U.S. Geological Survey Techniques and Methods, book 4, chapter A3, 454 p., https://doi.org/10.3133/tm4a3. [Supersedes USGS Techniques of Water-Resources Investigations, book 4, chapter A3, version 1.1.]. Supplemental material (SM) for each chapter are available to re-create all examples and figures, and to solve the exercises at the end of each chapter, with relevant datasets provided in an electronic format readable by R. The SM provide (1) datasets as .Rdata files for immediate input into R, (2) datasets as .csv files for input into R or for use with other software programs, (3) R functions that are used in the textbook but not part of a published R package, (4) R scripts to produce virtually all of the figures in the book, and (5) solutions to the exercises as .html and .Rmd files. The suffix .Rmd refers to the file format for code written in the R Markdown language; the .Rmd file that is provided in the SM was used to generate the .html file containing the solutions to the exercises. All data used in the in-text examples, figures, and exercises are not new and already available through publicly-available data portals.
m
Namoi AWRA-R (restricted input data implementation)
demo.dev.magda.io
data.gov.au
Updated Aug 8, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bioregional Assessment Program (2023). Namoi AWRA-R (restricted input data implementation) [Dataset]. https://demo.dev.magda.io/dataset/ds-dga-ee0710b9-e71c-43e7-b91e-389fc751f137
Explore at:
Dataset updated
Aug 8, 2023
Dataset provided by
Bioregional Assessment Program
Description
Abstract The dataset was compiled by the Bioregional Assessment Programme from multiple sources referenced within the dataset and/or metadata. The processes undertaken to compile this dataset are …Show full descriptionAbstract The dataset was compiled by the Bioregional Assessment Programme from multiple sources referenced within the dataset and/or metadata. The processes undertaken to compile this dataset are described in the History field in this metadata statement. Namoi AWRA-R (restricted input data implementation) This dataset was supplied to the Bioregional Assessment Programme by DPI Water (NSW Government). Metadata was not provided and has been compiled by the Bioregional Assessment Programme based on known details at the time of acquisition. The metadata within the dataset contains the restricted input data to implement the Namoi AWRA-R model for model calibration or simulation. The restricted input contains simulated time-series extracted from the Namoi Integrated Quantity and Quality Model (IQQM) including: irrigation and other diversions (town water supply, mining), reservoir information (volumes, inflows, releases) and allocation information. Each sub-folder in the associated data has a readme file indicating folder contents and providing general instructions about the use of the data and how it was sourced. Detailed documentation of the AWRA-R model, is provided in: https://publications.csiro.au/rpr/download?pid=csiro:EP154523&dsid=DS2 Documentation about the implementation of AWRA-R in the Namoi bioregion is provided in BA NAM 2.6.1.3 and 2.6.1.4 products. Purpose The resource is used in the development of river system models. Dataset History This dataset was supplied to the Bioregional Assessment Programme by DPI Water (NSW Government). The data was extracted from the IQQM interface and formatted accordingly. It is considered a source dataset because the IQQM model cannot be registered as it was provided under a formal agreement between CSIRO and DPI Water with confidentiality clauses. Dataset Citation Bioregional Assessment Programme (2017) Namoi AWRA-R (restricted input data implementation). Bioregional Assessment Source Dataset. Viewed 12 March 2019, http://data.bioregionalassessments.gov.au/dataset/04fc0b56-ba1d-4981-aaf2-ca6c8eaae609.
Dataset: "Soils and topography control natural disturbance rates and thereby...
smithsonian.figshare.com
bin
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katherine Cushman; Helene C. Muller-Landau; Matteo Detto; Milton Garcia (2024). Dataset: "Soils and topography control natural disturbance rates and thereby forest structure in a lowland tropical landscape" [Dataset]. http://doi.org/10.25573/data.17102600.v2
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.25573/data.17102600.v2
Dataset updated
May 29, 2024
Dataset provided by
Smithsonian Tropical Research Institute
Authors
Katherine Cushman; Helene C. Muller-Landau; Matteo Detto; Milton Garcia
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains raw data, processed data, and code associated with the manuscript "Soils and topography control natural disturbance rates and thereby forest structure in a lowland tropical landscape". This project quantifies canopy disturbances between 2015 and 2020 across Barro Colorado Island, Panama, using drone photogrammetry.Raw image/point cloud data- Raw drone images are in folders "DroneImages_YEAR". Agisoft Metashape project files referencing files as organized in these folders are also included in these folders.- Raw image orthomosaics are in folder "DroneOrthomosaics"- Raw (unaligned) point clouds are in the folder "PointClouds/Raw"Processed point cloud data- Processed (aligned, tiled) point clouds from lidar (2009) and photogrammetry (2015, 2018, 2020) are in the folder "PointClouds/Processed"Analysis dataAll data files directly used in analyses are included in folders starting with "Data_".- Data_Ancillary: shapefiles for soils (BCI_Soils), forest age (Enders_Forest_Age_1935), streams (StreamShapefile), 50 ha plot outline (BCI50ha, and island outline minus 25 m buffer (BCI_Outline_Minus25); information for blocks used for bootstrapping size frequency distributions (bootstrapBlocks.csv) and for aligning data in CloudCompare (gridInfo.csv)- Data_GapShapefiles: shapefiles for canopy disturbance in each period created in Code_ProcessHeightData/DefineGaps.R- Data_HeightRasters: height rasters produced in Code_ProcessHeightData/MakeDSMs.R and Code_ProcessHeightData/DefineGaps.R. Also includes previously created digital elevation model from 2009 lidar (LidarDEM_BCI.tif) and a digital surface model from higher-res 50 ha plot data (DSM_50haPlot_20150629_geo).-Data_INLA: input data for INLA models created in Code_INLA/setupINLA.R (INLA_40m.RData) and .RData objects with results from all INLA models (described in Code_INLA/setupINLA.R). Results from sensitivity analysis for best curvature/slope smoothing scale (INLA_SmoothingScaleResults.csv, also output from Code_INLA/setupINLA.R).- Data_QAQC: All data used to create cloud and QAQC masks (cloud raster products output from ArcGISPro, other rasters from Code_ProcessHeightData/MakeDSMs.R) in file Code_QAQC/AnnualQualityMasks, and resulting mask rasters. Also includes manually annotated gap shapefiles from the 50 ha plot for 2015-2018 from this project (QAQC_IslandData) and from the higher-res monthly data for the 50 ha plot (QAQC_50haData)- Data_TopographyRasters: lidar DEM smoothed at different scales (output from Code_ProcessHeightData/SmoothDEMs.R), resulting curvature and slope smoothed rasters (output from ArcGISPro), and height above nearest drainage raster (distAboveStream_1000.tif, output from ArcGISPro).CodeAll code are in zipped copy of the GitHub repository https://github.com/kccushman/BCI_Photogrammetry, saved at the time of publication (Code_GitHubRepository.zip). Code scripts reference all data files from the folders in which they are organized here.- Code_AlignDroneData: R script for tiling raw point cloud data, and .bat files for aligning point cloud tiles in CloudCompare's command line tools.- Code_GapSizeFrequency: R scripts for fitting and plotting gap size frequency data from gap rasters and shapefiles.- Code_INLA: R scripts for configuring data for INLA models, running INLA models, and analyzing INLA results.- Code_MakeFigures: R scripts for making main and supplemental figures.- Code_ProcessHeightData: R scripts for making canopy height rasters from point cloud data, defining gap rasters/polygons from canopy height data, and smoothing digital elevation model (DEM) topography data from 2009 lidar.- Code_QAQC: R scripts to make data quality masks based on cloud and photogrammetric reconstruction quality, and to find the optimal height correction for 2015 data with lower image overlap.
Z
Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...
data.niaid.nih.gov
zenodo.org
Updated Jun 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Henningsen, Arne (2023). Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_7882078
Explore at:
Dataset updated
Jun 17, 2023
Dataset provided by
Henningsen, Arne
Price, Juan José
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).

We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).

This replication package contains the following files and folders:

README This file

MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:

museum: Name of the museum.

type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).

munic: Municipality, in which the museum is located.

yr: Year of the observation.

units: Number of visit sites.

resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).

vis: Number of (physical) visitors.

aarc: Number of articles published (archeology).

ach: Number of articles published (cultural history).

aah: Number of articles published (art history).

anh: Number of articles published (natural history).

exh: Number of temporary exhibitions.

edu: Number of primary school classes on educational visits to the museum.

ev: Number of events other than exhibitions.

ftesc: Scientific labor (full-time equivalents).

ftensc: Non-scientific labor (full-time equivalents).

expProperty: Running and maintenance costs [1,000 DKK].

expCons: Conservation expenditure [1,000 DKK].

ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).

prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.

DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.

make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.

IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.

IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.

allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.

IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.

/tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.

/figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.
Data from: A systematic evaluation of normalization methods and probe...
data.niaid.nih.gov
datadryad.org
zip
Updated May 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra (2023). A systematic evaluation of normalization methods and probe replicability using infinium EPIC methylation data [Dataset]. http://doi.org/10.5061/dryad.cnp5hqc7v
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.cnp5hqc7v
Dataset updated
May 30, 2023
Dataset provided by
Hospital for Sick Children
Universidade de São Paulo
University of Toronto
Authors
H. Welsh; C. M. P. F. Batalha; W. Li; K. L. Mpye; N. C. Souza-Pinto; M. S. Naslavsky; E. J. Parra
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Background The Infinium EPIC array measures the methylation status of > 850,000 CpG sites. The EPIC BeadChip uses a two-array design: Infinium Type I and Type II probes. These probe types exhibit different technical characteristics which may confound analyses. Numerous normalization and pre-processing methods have been developed to reduce probe type bias as well as other issues such as background and dye bias.
Methods This study evaluates the performance of various normalization methods using 16 replicated samples and three metrics: absolute beta-value difference, overlap of non-replicated CpGs between replicate pairs, and effect on beta-value distributions. Additionally, we carried out Pearson’s correlation and intraclass correlation coefficient (ICC) analyses using both raw and SeSAMe 2 normalized data.
Results The method we define as SeSAMe 2, which consists of the application of the regular SeSAMe pipeline with an additional round of QC, pOOBAH masking, was found to be the best-performing normalization method, while quantile-based methods were found to be the worst performing methods. Whole-array Pearson’s correlations were found to be high. However, in agreement with previous studies, a substantial proportion of the probes on the EPIC array showed poor reproducibility (ICC < 0.50). The majority of poor-performing probes have beta values close to either 0 or 1, and relatively low standard deviations. These results suggest that probe reliability is largely the result of limited biological variation rather than technical measurement variation. Importantly, normalizing the data with SeSAMe 2 dramatically improved ICC estimates, with the proportion of probes with ICC values > 0.50 increasing from 45.18% (raw data) to 61.35% (SeSAMe 2). Methods

Study Participants and Samples

The whole blood samples were obtained from the Health, Well-being and Aging (Saúde, Ben-estar e Envelhecimento, SABE) study cohort. SABE is a cohort of census-withdrawn elderly from the city of São Paulo, Brazil, followed up every five years since the year 2000, with DNA first collected in 2010. Samples from 24 elderly adults were collected at two time points for a total of 48 samples. The first time point is the 2010 collection wave, performed from 2010 to 2012, and the second time point was set in 2020 in a COVID-19 monitoring project (9±0.71 years apart). The 24 individuals were 67.41±5.52 years of age (mean ± standard deviation) at time point one; and 76.41±6.17 at time point two and comprised 13 men and 11 women.

All individuals enrolled in the SABE cohort provided written consent, and the ethic protocols were approved by local and national institutional review boards COEP/FSP/USP OF.COEP/23/10, CONEP 2044/2014, CEP HIAE 1263-10, University of Toronto RIS 39685.

Blood Collection and Processing

Genomic DNA was extracted from whole peripheral blood samples collected in EDTA tubes. DNA extraction and purification followed manufacturer’s recommended protocols, using Qiagen AutoPure LS kit with Gentra automated extraction (first time point) or manual extraction (second time point), due to discontinuation of the equipment but using the same commercial reagents. DNA was quantified using Nanodrop spectrometer and diluted to 50ng/uL. To assess the reproducibility of the EPIC array, we also obtained technical replicates for 16 out of the 48 samples, for a total of 64 samples submitted for further analyses. Whole Genome Sequencing data is also available for the samples described above.

Characterization of DNA Methylation using the EPIC array

Approximately 1,000ng of human genomic DNA was used for bisulphite conversion. Methylation status was evaluated using the MethylationEPIC array at The Centre for Applied Genomics (TCAG, Hospital for Sick Children, Toronto, Ontario, Canada), following protocols recommended by Illumina (San Diego, California, USA).

Processing and Analysis of DNA Methylation Data

The R/Bioconductor packages Meffil (version 1.1.0), RnBeads (version 2.6.0), minfi (version 1.34.0) and wateRmelon (version 1.32.0) were used to import, process and perform quality control (QC) analyses on the methylation data. Starting with the 64 samples, we first used Meffil to infer the sex of the 64 samples and compared the inferred sex to reported sex. Utilizing the 59 SNP probes that are available as part of the EPIC array, we calculated concordance between the methylation intensities of the samples and the corresponding genotype calls extracted from their WGS data. We then performed comprehensive sample-level and probe-level QC using the RnBeads QC pipeline. Specifically, we (1) removed probes if their target sequences overlap with a SNP at any base, (2) removed known cross-reactive probes (3) used the iterative Greedycut algorithm to filter out samples and probes, using a detection p-value threshold of 0.01 and (4) removed probes if more than 5% of the samples having a missing value. Since RnBeads does not have a function to perform probe filtering based on bead number, we used the wateRmelon package to extract bead numbers from the IDAT files and calculated the proportion of samples with bead number < 3. Probes with more than 5% of samples having low bead number (< 3) were removed. For the comparison of normalization methods, we also computed detection p-values using out-of-band probes empirical distribution with the pOOBAH() function in the SeSAMe (version 1.14.2) R package, with a p-value threshold of 0.05, and the combine.neg parameter set to TRUE. In the scenario where pOOBAH filtering was carried out, it was done in parallel with the previously mentioned QC steps, and the resulting probes flagged in both analyses were combined and removed from the data.

Normalization Methods Evaluated

The normalization methods compared in this study were implemented using different R/Bioconductor packages and are summarized in Figure 1. All data was read into R workspace as RG Channel Sets using minfi’s read.metharray.exp() function. One sample that was flagged during QC was removed, and further normalization steps were carried out in the remaining set of 63 samples. Prior to all normalizations with minfi, probes that did not pass QC were removed. Noob, SWAN, Quantile, Funnorm and Illumina normalizations were implemented using minfi. BMIQ normalization was implemented with ChAMP (version 2.26.0), using as input Raw data produced by minfi’s preprocessRaw() function. In the combination of Noob with BMIQ (Noob+BMIQ), BMIQ normalization was carried out using as input minfi’s Noob normalized data. Noob normalization was also implemented with SeSAMe, using a nonlinear dye bias correction. For SeSAMe normalization, two scenarios were tested. For both, the inputs were unmasked SigDF Sets converted from minfi’s RG Channel Sets. In the first, which we call “SeSAMe 1”, SeSAMe’s pOOBAH masking was not executed, and the only probes filtered out of the dataset prior to normalization were the ones that did not pass QC in the previous analyses. In the second scenario, which we call “SeSAMe 2”, pOOBAH masking was carried out in the unfiltered dataset, and masked probes were removed. This removal was followed by further removal of probes that did not pass previous QC, and that had not been removed by pOOBAH. Therefore, SeSAMe 2 has two rounds of probe removal. Noob normalization with nonlinear dye bias correction was then carried out in the filtered dataset. Methods were then compared by subsetting the 16 replicated samples and evaluating the effects that the different normalization methods had in the absolute difference of beta values (|β|) between replicated samples.

Facebook

Twitter

Click to copy link

Link copied

Cite

Paul Murrell (2018). Importing General-Purpose Graphics in R [Dataset]. http://doi.org/10.17608/k6.auckland.7108736.v1

Data from: Importing General-Purpose Graphics in R

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.17608/k6.auckland.7108736.v1

Dataset updated

Sep 19, 2018

Dataset provided by

The University of Auckland

Authors

Paul Murrell

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

This report discusses some problems that can arise when attempting to import PostScript images into R, when the PostScript image contains coordinate transformations that skew the image. There is a description of some new features in the ‘grImport’ package for R that allow these sorts of images to be imported into R successfully.

Clear search

Close search

Google apps

Main menu

Data from: Importing General-Purpose Graphics in R

Data from: Optimized SMRT-UMI protocol produces highly accurate sequence...

Input data, model output, and R scripts for a machine learning streamflow...

2010 County and City-Level Water-Use Data and Associated Explanatory...

Data from: Generalizable EHR-R-REDCap pipeline for a national...

AWC to 60cm DSM data of the Roper catchment NT generated by the Roper River...

R 600 Import Data India, R 600 Customs Import Shipment Data

Food and Agriculture Biomass Input–Output (FABIO) database

World Input-Output Database, 2021 Release, 1965-2000, Long-run WIOD

Data to Incorporate Water Quality Analysis into Navigation Assessments as...

Data for use in poscrptR post-fire conifer regeneration prediction model

R Lubricant Import Data India, R Lubricant Customs Import Shipment Data

R Ft3 Import Data India, R Ft3 Customs Import Shipment Data

R Kleen Import Data India, R Kleen Customs Import Shipment Data

Input data for predicted dry bulk densities on the Norwegian continental...

Statistical Methods in Water Resources - Supporting Materials

Namoi AWRA-R (restricted input data implementation)

Dataset: "Soils and topography control natural disturbance rates and thereby...

Data and Code for "A Ray-Based Input Distance Function to Model Zero-Valued...

Data from: A systematic evaluation of normalization methods and probe...

Data from: Importing General-Purpose Graphics in R