Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Input dataset for R code (first sheet), and BOLD spreadsheet downloaded on April 11, 2022 (next sheets) for "Facing the Infinity".
Facebook
TwitterTrends in nutrient fluxes and streamflow for selected tributaries in the Lake Erie watershed were calculated using monitoring data at 10 locations. Trends in flow-normalized nutrient fluxes were determined by applying a weighted regression approach called WRTDS (Weighted Regression on Time, Discharge, and Season). Site information and streamflow and water-quality records are contained in 3 zipped files named as follows: INFO (site information), Daily (daily streamflow records), and Sample (water-quality records). The INFO, Daily (flow), and Sample files contain the input data, by water-quality parameter and by site as .csv files, used to run trend analyses. These files were generated by the R (version 3.1.2) software package called EGRET - Exploration and Graphics for River Trends (version 2.5.1) (Hirsch and DeCicco, 2015), and can be used directly as input to run graphical procedures and WRTDS trend analyses using EGRET R software. The .csv files are identified according to water-quality parameter (TP, SRP, TN, NO23, and TKN) and site reference number (e.g. TPfiles.1.INFO.csv, SRPfiles.1.INFO.csv, TPfiles.2.INFO.csv, etc.). Water-quality parameter abbreviations and site reference numbers are defined in the file "Site-summary_table.csv" on the landing page, where there is also a site-location map ("Site_map.pdf"). Parameter information details, including abbreviation definitions, appear in the abstract on the Landing Page. SRP data records were available at only 6 of the 10 trend sites, which are identified in the file "site-summary_table.csv" (see landing page) as monitored by the organization NCWQR (National Center for Water Quality Research). The SRP sites are: RAIS, MAUW, SAND, HONE, ROCK, and CUYA. The model-input dataset is presented in 3 parts: 1. INFO.zip (site information) 2. Daily.zip (daily streamflow records) 3. Sample.zip (water-quality records) Reference: Hirsch, R.M., and De Cicco, L.A., 2015 (revised). User Guide to Exploration and Graphics for RivEr Trends (EGRET) and dataRetrieval: R Packages for Hydrologic Data, Version 2.0, U.S. Geological Survey Techniques Methods, 4-A10. U.S. Geological Survey, Reston, VA., 93 p. (at: http://dx.doi.org/10.3133/tm4A10).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This folder contains the input files required for the R code used to analyse data for the Patterns and prevalence of food allergy in adulthood in the UK (PAFA) project. This includes:pafa_data_dictionary_anonymised.csv: The data dictionary describing each column in the anonymised PAFA dataset. "snomed_field_name" lists all column names in the dataset; "field_name_extended" lists the original column name in the REDCap data download, which was then recoded to include SNOMED and FoodEx2 codes for future analyses; "variable_field_name" denotes the corresponding coded field name in the REDCap form; "field_type" denotes the type of REDCap field; "field_label" describes the field name in plain language; "choices_calculations_or_slider_labels" describes the choices provided to the participant for that question.foodex2_codes_with_other.csv: A CSV file with key-value pairs for identifying foods coded in the dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Input data for R script.
Facebook
TwitterA machine learning streamflow (MLFLOW) model was developed in R (model is in the Rscripts folder) for modeling monthly streamflow from 2012 to 2017 in three watersheds on the Wyoming Range in the upper Green River basin. Geospatial information for 125 site features (vector data are in the Sites.shp file) and discrete streamflow observation data and environmental predictor data were used in fitting the MLFLOW model and predicting with the fitted model. Tabular calibration and validation data are in the Model_Fitting_Site_Data.csv file, totaling 971 discrete observations and predictions of monthly streamflow. Geospatial information for 17,518 stream grid cells (raster data are in the Streams.tif file) and environmental predictor data were used for continuous streamflow predictions with the MLFLOW model. Tabular prediction data for all the study area (17,518 stream grid cells) and study period (72 months; 2012–17) are in the Model_Prediction_Stream_Data.csv file, totaling 1,261,296 predictions of spatially and temporally continuous monthly streamflow. Additional information about the datasets is in the metadata included in the four zipped dataset files and about the MLFLOW model is in the readme included in the zipped model archive folder.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data to run the IMACLIM-R France code
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
AWC to 60cm is one of 18 attributes of soils chosen to underpin the land suitability assessment of the Roper River Water Resource Assessment (ROWRA) through the digital soil mapping process (DSM). AWC (available water capacity) indicates the ability of a soil to retain and supply water for plant growth. This AWC raster data represents a modelled dataset of AWC to 60cm (mm of water to 60cm of soil depth) and is derived from analysed site data, spline calculations and environmental covariates. AWC is a parameter used in land suitability assessments for rainfed cropping and for water use efficiency in irrigated land uses. This raster data provides improved soil information used to underpin and identify opportunities and promote detailed investigation for a range of sustainable regional development options and was created within the ‘Land Suitability’ activity of the CSIRO ROWRA. A companion dataset and statistics reflecting reliability of this data are also provided and can be found described in the lineage section of this metadata record. Processing information is supplied in ranger R scripts and attributes were modelled using a Random Forest approach. The DSM process is described in the CSIRO ROWRA published report ‘Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. The Roper River Water Resource Assessment provides a comprehensive overview and integrated evaluation of the feasibility of aquaculture and agriculture development in the Roper catchment NT as well as the ecological, social and cultural (indigenous water values, rights and aspirations) impacts of development. Lineage: This AWC to 60cm dataset has been generated from a range of inputs and processing steps. Following is an overview. For more information refer to the CSIRO ROWRA published reports and in particular ' Soils and land suitability for the Roper catchment, Northern Territory’. A technical report from the CSIRO Roper River Water Resource Assessment to the Government of Australia. 1. Collated existing data (relating to: soils, climate, topography, natural resources, remotely sensed, of various formats: reports, spatial vector, spatial raster etc). 2. Selection of additional soil and land attribute site data locations by a conditioned Latin hypercube statistical sampling method applied across the covariate data space. 3. Fieldwork was carried out to collect new attribute data, soil samples for analysis and build an understanding of geomorphology and landscape processes. 4. Database analysis was performed to extract the data to specific selection criteria required for the attribute to be modelled. 5. The R statistical programming environment was used for the attribute computing. Models were built from selected input data and covariate data using predictive learning from a Random Forest approach implemented in the ranger R package. 6. Create AWC to 60cm Digital Soil Mapping (DSM) attribute raster dataset. DSM data is a geo-referenced dataset, generated from field observations and laboratory data, coupled with environmental covariate data through quantitative relationships. It applies pedometrics - the use of mathematical and statistical models that combine information from soil observations with information contained in correlated environmental variables, remote sensing images and some geophysical measurements. 7. Companion predicted reliability data was produced from the 500 individual Random Forest attribute models created. 8. QA Quality assessment of this DSM attribute data was conducted by three methods. Method 1: Statistical (quantitative) method of the model and input data. Testing the quality of the DSM models was carried out using data withheld from model computations and expressed as OOB and R squared results, giving an estimate of the reliability of the model predictions. These results are supplied. Method 2: Statistical (quantitative) assessment of the spatial attribute output data presented as a raster of the attributes “reliability”. This used the 500 individual trees of the attributes RF models to generate 500 datasets of the attribute to estimate model reliability for each attribute. For continuous attributes the method for estimating reliability is the Coefficient of Variation. This data is supplied. Method 3: Collecting independent external validation site data combined with on-ground expert (qualitative) examination of outputs during validation field trips. Across each of the study areas a two week validation field trip was conducted using a new validation site set which was produced by a random sampling design based on conditioned Latin Hypercube sampling using the reliability data of the attribute. The modelled DSM attribute value was assessed against the actual on-ground value. These results are published in the report cited in this metadata record.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).
We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).
This replication package contains the following files and folders:
README This file
MuseumsDk.csv The original data obtained from the Danish Ministry of Culture and from Statistics Denmark. It includes the following variables:
museum: Name of the museum.
type: Type of museum (Kulturhistorisk museum = cultural history museum; Kunstmuseer = arts museum; Naturhistorisk museum = natural history museum; Blandet museum = mixed museum).
munic: Municipality, in which the museum is located.
yr: Year of the observation.
units: Number of visit sites.
resp: Whether or not the museum has special responsibilities (0 = no special responsibilities; 1 = at least one special responsibility).
vis: Number of (physical) visitors.
aarc: Number of articles published (archeology).
ach: Number of articles published (cultural history).
aah: Number of articles published (art history).
anh: Number of articles published (natural history).
exh: Number of temporary exhibitions.
edu: Number of primary school classes on educational visits to the museum.
ev: Number of events other than exhibitions.
ftesc: Scientific labor (full-time equivalents).
ftensc: Non-scientific labor (full-time equivalents).
expProperty: Running and maintenance costs [1,000 DKK].
expCons: Conservation expenditure [1,000 DKK].
ipc: Consumer Price Index in Denmark (the value for year 2014 is set to 1).
prepare_data.R This R script imports the data set MuseumsDk.csv, prepares it for the empirical analysis (e.g., removing unsuitable observations, preparing variables), and saves the resulting data set as DataPrepared.csv.
DataPrepared.csv This data set is prepared and saved by the R script prepare_data.R. It is used for the empirical analysis.
make_table_descriptive.R This R script imports the data set DataPrepared.csv and creates the LaTeX table /tables/table_descriptive.tex, which provides summary statistics of the variables that are used in the empirical analysis.
IO_Ray.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions with the 'optimal' ordering of outputs, imposes monotonicity on this distance function, creates the LaTeX table /tables/idfRes.tex that presents the estimated parameters of this function, and creates several figures in the folder /figures/ that illustrate the results.
IO_Ray_ordering_outputs.R This R script imports the data set DataPrepared.csv, estimates a ray-based Translog input distance functions, imposes monotonicity for each of the 720 possible orderings of the outputs, and saves all the estimation results as (a huge) R object allOrderings.rds.
allOrderings.rds (not included in the ZIP file, uploaded separately) This is a saved R object created by the R script IO_Ray_ordering_outputs.R that contains the estimated ray-based Translog input distance functions (with and without monotonicity imposed) for each of the 720 possible orderings.
IO_Ray_model_averaging.R This R script loads the R object allOrderings.rds that contains the estimated ray-based Translog input distance functions for each of the 720 possible orderings, does model averaging, and creates several figures in the folder /figures/ that illustrate the results.
/tables/ This folder contains the two LaTeX tables table_descriptive.tex and idfRes.tex (created by R scripts make_table_descriptive.R and IO_Ray.R, respectively) that provide summary statistics of the data set and the estimated parameters (without and with monotonicity imposed) for the 'optimal' ordering of outputs.
/figures/ This folder contains 48 figures (created by the R scripts IO_Ray.R and IO_Ray_model_averaging.R) that illustrate the results obtained with the 'optimal' ordering of outputs and the model-averaged results and that compare these two sets of results.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterThe model.zip file contains input data and code supporting the cod_v2 population estimates. The file modelData.RData provides the input data to the JAGS model and the file modelCode.R contains the source code for the model in the JAGS language. The files can be used to run the model for further assessments and as a starting point for further model development. The data and the model were developed using the statistical software R version 4.0.2 (https://cran.r-project.org/bin/windows/base/old/4.0.2) and JAGS 4.3.0 (https://mcmc-jags.sourceforge.io), a program for analysis of Bayesian graphical models using Gibbs sampling, through the R package runjags 2.2.0 (https://cran.r-project.org/web/packages/runjags).
Facebook
TwitterThis data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Facebook
TwitterOpen Access Introduction A new method for estimating carbon dioxide emissions from rained peatland forest soils was developed for the Greenhouse Gas Inventory of Finland (GHG inventory). The method is based on a set of models (Ojanen et al. 2014, Tuomi et al., 2009) that dynamically compile all relevant carbon inputs and outputs into a time series of soil CO2 emission. A complete description of the method is described in Alm et al. (2023). Here we present the input data and R-scripts (R Core Team, 2020) for computing the time series from year 1990 to 2022 of CO2 emission from soil in forest land on drained organic soil, like it was reported by the Finnish GHG inventory (Statistics Finland, 2023). Time series data The source of forest and area data is the Finnish National Forest Inventory (NFI) as a part of Luke Statutory Services. The NFI standing forest data in the data files includes annual country-wide estimates of mean basal area and standing biomass of Scots pine (Pinus sylvestris L.), Norway spruce (Picea abies (L.) H. Karst) and all the broadleaved forest trees combined. The data concerns forest land on drained organic soil only (class FRA 1 according to the FAO forest land definition). The NFI data for each year has been averaged by different drained peatland forest site types (FTYPE) and by inventory regions of southern and northern Finland. The areas and proportions of FTYPEs of all drained peatland “forests remaining forests” (i.e., forests that have not undergone another change in land use in the past 20 years) in southern and northern Finland (Alm et al., 2023), derived from NFI12 (2014–2018). Annual litter input from harvest residues was estimated using statistics of harvested stem volumes by species, collected and published by Luke (Luke statistics). The stem volumes were converted to whole trees and further to litter fractions and further to The share of residues remaining in forest is estimated by subtracting the amount of the logging residues collected for energy use, the data obtained from Luke statistics/energy. The biomass of live trees, annual litterfall from live trees aboveground and root litter belowground are derived from the National Forest Inventory of Finland (inventory rounds NFI8 to NFI13). The R-code also includes calculation of annual litter production from the harvesting residues. The regression-based transfer models, implemented in the R-code, also need meteorological time series inputs: The soil organic matter decomposition model (Ojanen et al. 2014) uses May-October mean temperature. Decomposition model yasso07 (Tuomi et al., 2009), applied for estimating the CO2 release by decomposition of harvesting residues and above ground litter from natural mortality, is constrained by annual temperature, annual temperature amplitude and annual precipitation. Starting from the original country-wide grid produced by the Finnish Meteorological Institute (FMI) the weather time series were spatially averaged so that the FMI weather grid values were collected from those locations where peatlands representing each FTYPE in southern and northern Finland were observed by the NFI, respectively. The pre-prepared input data are given in files, see Table 1 for descriptions. Table 1. Description of input data files. File Description of data basal.areas.csv Time series of years 1990-2022 for annual average basal area (m2 ha-1) by year, by peatland forest site type (peat_type) and by tree species or group (tree_type). Values of peat_type correspond to FTYPE: 1 Herb-rich type 2 Vaccinium myrtillus type 4 Vaccinium vitis-idaea type 6 Dwarf shrub type 7 Cladina type Values of tree species or group correspond to: 1 Scots pine 2 Norway spruce 3 Broadleaved species biomass.csv Time series of years 1990-2022 for annual biomass (biomass, t ha-1 of dry mass) by year, by biomass component, by tree species and by peatland forest site type (tkg). Values of peat_type correspond to FTYPE: 1 Herb-rich type 2 Vaccinium myrtillus type 4 Vaccinium vitis-idaea type 6 Dwarf shrub type 7 Cladina type dead_litter.csv Time series of years 1990-2022 of annual aboveground litter from dead wood: Harvesting residues and natural mortality combined (C, t ha-1 of dry mass; lognat_litter). Values of region correspond to GHG inventory region: south South Finland north North Finland ghgi_litter.csv Time series of years 1990-2022 for litter AWEN-fractions (A=acid soluble, W=water soluble, E=ethanol soluble, N=non-soluble; C, t ha-1) by different litter types: Above-ground coarse woody litter (coarse_woody_litter), fine woody litter (fine_woody_litter), non-woody litter (non_woody_litter) by litter source and deposition type by region. “org” denotes organic soil. Values of region correspond to GHG inventory region: south South Finland north North Finland Values of ground correspond to litter deposition environment: above Above-ground litter below Below-ground litter lognat_decomp.csv Time series of years 1990-2022 for C, t ha-1 of dry mass, decomposed from logging residues and natural mortality by region. Values of variable “region” correspond to GHG inventory region: south South Finland north North Finland logyasso_weather_data.csv Time series of years 1990-2022 for regional (region) precipitation sum (mm, sum_P), average annual temperature (°C, mean_T) and amplitude of the annual temperature (°C , ampli_T). Values of region correspond to GHG inventory region: south South Finland north North Finland total_area.csv Areas (ha) of drained peatland forests remaining forest land by region and peat_type. Values of variable “region” correspond to GHG inventory region: south South Finland north North Finland Values of peat_type correspond to FTYPE: 1 Herb-rich type 2 Vaccinium myrtillus type 4 Vaccinium vitis-idaea type 6 Dwarf shrub type 7 Cladina type weather_data.csv Time series of years 1990-2022 for 30-year rolling mean temperature for the May-October period (roll_T) used by the soil decomposition models. The values are calculated for each FTYPE (peat_type) using their spatial distributions (see details in Alm et al., 2023). Values of variable “region” correspond to GHG inventory region: south South Finland north North Finland Values of peat_type correspond to FTYPE: 1 Herb-rich type 2 Vaccinium myrtillus type 4 Vaccinium vitis-idaea type 6 Dwarf shrub type 7 Cladina type The R-scripts The scripts are an excerpt from the Finnish greenhouse gas inventory code set, applying the necessary pre-processed input data and producing the soil CO2 emissions for each FTYPE separately. The necessary R-packages (R Core Team, 2020) are managed in the script LIBRARIES.R. Guidance for running the R-scripts is given in the README.txt. References Alm, J., Wall, A., Myllykangas, J-P., Ojanen, P., Heikkinen, J., Henttonen, H. M., Laiho, R., Minkkinen, K., Tuomainen, T. and Mikola, J. A new method for estimating carbon dioxide emissions from drained peatland forest soils for the greenhouse gas inventory of Finland. Biogeosciences https://doi.org/10.5194/bg-20-1-2023, 2023. LUKE Statistics https://www.luke.fi/en/statistics/total-roundwood-removals-and-drain, last access 8.12.2022. https://www.luke.fi/en/statistics/commercial-fellings/commercial-fellings-72023. last access 8.12.2022. Statistics Finland 2023. URL: https://unfccc.int/documents/627718 (last access 13.9.2023). Ojanen, P., Lehtonen, A., Heikkinen, J., Penttilä, T., and Minkkinen, K.: Soil CO2 balance and its uncertainty in forestry drained peatlands in Finland, Forest Ecol. Manage., 325, 60–73, 2014. R Core Team: R: A language and environment for statistical computing. R Foundation forStatistical Computing, Vienna, Austria, URL https://www.R-project.org, 2020. Tuomi, M., Thum, T., Järvinen, H., Fronzek, S., Berg, B., Harmon, M., Trofymow, J.A., Sevanto, S. and Liski, J.: Leaf litter decomposition - Estimates of global variability based on Yasso07 model, Ecol. Modell. 220 (23):3362-3371, 2009.
Facebook
Twitterhttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditionshttp://guides.library.uq.edu.au/deposit_your_data/terms_and_conditions
The files contained herein include BSFG output containing posterior samples for parameters of interest, and R code and other input files used to generate the main manipulations of the data and output presented in the corresponding manuscript.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Problem description
Pizza
The pizza is represented as a rectangular, 2-dimensional grid of R rows and C columns. The cells within the grid are referenced using a pair of 0-based coordinates [r, c] , denoting respectively the row and the column of the cell.
Each cell of the pizza contains either:
mushroom, represented in the input file as M
tomato, represented in the input file as T
Slice
A slice of pizza is a rectangular section of the pizza delimited by two rows and two columns, without holes. The slices we want to cut out must contain at least L cells of each ingredient (that is, at least L cells of mushroom and at least L cells of tomato) and at most H cells of any kind in total - surprising as it is, there is such a thing as too much pizza in one slice. The slices being cut out cannot overlap. The slices being cut do not need to cover the entire pizza.
Goal
The goal is to cut correct slices out of the pizza maximizing the total number of cells in all slices. Input data set The input data is provided as a data set file - a plain text file containing exclusively ASCII characters with lines terminated with a single ‘ ’ character at the end of each line (UNIX- style line endings).
File format
The file consists of:
one line containing the following natural numbers separated by single spaces:
R (1 ≤ R ≤ 1000) is the number of rows
C (1 ≤ C ≤ 1000) is the number of columns
L (1 ≤ L ≤ 1000) is the minimum number of each ingredient cells in a slice
H (1 ≤ H ≤ 1000) is the maximum total number of cells of a slice
Google 2017, All rights reserved.
R lines describing the rows of the pizza (one after another). Each of these lines contains C characters describing the ingredients in the cells of the row (one cell after another). Each character is either ‘M’ (for mushroom) or ‘T’ (for tomato).
Example
3 5 1 6
TTTTT
TMMMT
TTTTT
3 rows, 5 columns, min 1 of each ingredient per slice, max 6 cells per slice
Example input file.
Submissions
File format
The file must consist of:
one line containing a single natural number S (0 ≤ S ≤ R × C) , representing the total number of slices to be cut,
U lines describing the slices. Each of these lines must contain the following natural numbers separated by single spaces:
r 1 , c 1 , r 2 , c 2 describe a slice of pizza delimited by the rows r (0 ≤ r1,r2 < R, 0 ≤ c1, c2 < C) 1 and r 2 and the columns c 1 and c 2 , including the cells of the delimiting rows and columns. The rows ( r 1 and r 2 ) can be given in any order. The columns ( c 1 and c 2 ) can be given in any order too.
Example
0 0 2 1
0 2 2 2
0 3 2 4
3 slices.
First slice between rows (0,2) and columns (0,1).
Second slice between rows (0,2) and columns (2,2).
Third slice between rows (0,2) and columns (3,4).
Example submission file.
© Google 2017, All rights reserved.
Slices described in the example submission file marked in green, orange and purple. Validation
For the solution to be accepted:
the format of the file must match the description above,
each cell of the pizza must be included in at most one slice,
each slice must contain at least L cells of mushroom,
each slice must contain at least L cells of tomato,
total area of each slice must be at most H
Scoring
The submission gets a score equal to the total number of cells in all slices. Note that there are multiple data sets representing separate instances of the problem. The final score for your team is the sum of your best scores on the individual data sets. Scoring example
The example submission file given above cuts the slices of 6, 3 and 6 cells, earning 6 + 3 + 6 = 15 points.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
This USGS data release represents the input data, R script, and output data for WRTDS analyses used to identify trends in suspended sediment loads of Coastal Plain streams and rivers in the eastern United States.
Facebook
TwitterThe dataset was compiled by the Bioregional Assessment Programme from multiple sources referenced within the dataset and/or metadata. The processes undertaken to compile this dataset are described in the History field in this metadata statement.
Namoi AWRA-R (restricted input data implementation)
This dataset was supplied to the Bioregional Assessment Programme by DPI Water (NSW Government). Metadata was not provided and has been compiled by the Bioregional Assessment Programme based on known details at the time of acquisition. The metadata within the dataset contains the restricted input data to implement the Namoi AWRA-R model for model calibration or simulation. The restricted input contains simulated time-series extracted from the Namoi Integrated Quantity and Quality Model (IQQM) including: irrigation and other diversions (town water supply, mining), reservoir information (volumes, inflows, releases) and allocation information.
Each sub-folder in the associated data has a readme file indicating folder contents and providing general instructions about the use of the data and how it was sourced.
Detailed documentation of the AWRA-R model, is provided in: https://publications.csiro.au/rpr/download?pid=csiro:EP154523&dsid=DS2
Documentation about the implementation of AWRA-R in the Namoi bioregion is provided in BA NAM 2.6.1.3 and 2.6.1.4 products.
The resource is used in the development of river system models.
This dataset was supplied to the Bioregional Assessment Programme by DPI Water (NSW Government). The data was extracted from the IQQM interface and formatted accordingly. It is considered a source dataset because the IQQM model cannot be registered as it was provided under a formal agreement between CSIRO and DPI Water with confidentiality clauses.
Bioregional Assessment Programme (2017) Namoi AWRA-R (restricted input data implementation). Bioregional Assessment Source Dataset. Viewed 12 March 2019, http://data.bioregionalassessments.gov.au/dataset/04fc0b56-ba1d-4981-aaf2-ca6c8eaae609.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The core data in this publication are empirical data on two coupled network layers inferred from cross-industrial citation links (patent citation network) and input-output flows among industries (input-output (IO) network). These data are available in quinquennial time steps for the years 1977, 1982, 1987, 1992, 1997, 2002, 2006. The data is available at the 6-digit level. The analyses in the paper mainly refer to 4-digit level results of a balanced panel of industries, i.e. industries for which both patent and IO data are available for the full time horizon. This publication also contains a sample of panel data on industry characteristics (mainly industry size by patent stock and output and network indicators). The data are available ein RData format and supplemented by the R-scripts used to compile and analyze the data.
[Forthcoming. If you use the data, please cite the most recent version of the paper.]
The paper also offers a description of the data and its compilation.
Demand-pull and technology-push are linked to an empirical two-layer network based on coupled cross-industrial input-output (IO) and patent-citation links among 155 4-digit (NAICS) US-industries in 1976-2006. I study the evolution of industry hierarchies and link formation. Both layers co-evolve, but differently: The patent network became denser and increasingly skewed, while market hierarchies are balanced and sluggish in change. Industries became more similar by patent-citations, but less by input use. Similar R&D capabilities as other big industries is beneficial for innovation providing access to knowledge but relying on the same market inputs is unfavorable if it intensifies competition. This may incite industries to explore other technological pathways. Growth in the market is constrained by scarcity and competition, but knowledge as innovation input is non-rival leading to increasing returns and a skewed distribution. This may strengthen existing R&D trajectories while market pressure may trigger a re-direction in both layers. This work is limited by its reliance on endogenously evolving classifications.
following order:
(1) CREATING THE DATA:
(a) The patent data can not be fully reconstructed from the data that are available in this data
publication because one of the intermediate steps relies on proprietary data that can not be
provided here.
For the remainder: You can compile parts of the patent and the IO data from the raw data. To
do so, please use the code and raw data provided in the folders io_data_R_files and
patent_data_R_files. Further detail is provided below.
(b) The folder R_scripts_both provides all code needed to create the merged panel data that is
used in the analysis.
(2) REPRODUCING THE ANALYSES: The folder R_script_both provides all code needed to reproduce the figures, tables, descriptive statistics and regression analyses. Further detail is provided below.
This data publication also provides additional results on the analyses at different levels of data aggregation. You will find it in the folder statistical_output but you can also produce additional results running the code provided.
(1) patent_data_R_files (2) io_data_R_files (3) R_scripts_both (4) data_combined (5) statistical_output
(1) patent_data_R_files
This folder contains 2 subfolders: code, data
code: This subfolder contains the R-scripts of all single steps executed to process the patent raw data. These steps are explained in detail in the Supplementary Material of the paper Hötte, K (2021): "Demand-pull and technology-push [forthcoming]"
data: This subfolder contains the processed data at different levels of aggregation and a folder where to put the source files, i.e. the original NBER patent data used in this analysis that need to be downloaded from https://sites.google.com/site/patentdataproject/Home [accessed on Mar 17, 2021]. To us
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Publication
will_INF.txt and go_INF.txt). They represent the co-occurrence frequency of top-200 infinitival collocates for will and be going to respectively across the twenty decades of Corpus of Historical American English (from the 1810s to the 2000s).1-script-create-input-data-raw.r. The codes preprocess and combine the two files into a long format data frame consisting of the following columns: (i) decade, (ii) coll (for "collocate"), (iii) BE going to (for frequency of the collocates with be going to) and (iv) will (for frequency of the collocates with will); it is available in the input_data_raw.txt. 2-script-create-motion-chart-input-data.R processes the input_data_raw.txt for normalising the co-occurrence frequency of the collocates per million words (the COHA size and normalising base frequency are available in coha_size.txt). The output from the second script is input_data_futurate.txt.input_data_futurate.txt contains the relevant input data for generating (i) the static motion chart as an image plot in the publication (using the script 3-script-create-motion-chart-plot.R), and (ii) the dynamic motion chart (using the script 4-script-motion-chart-dynamic.R).Future Constructions.Rproj file to open an RStudio session whose working directory is associated with the contents of this repository.