https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an archive of the data contained in the "Transformations" section in PubChem for integration into patRoon and other workflows.
For further details see the ECI GitLab site: README and main "tps" folder.
Credits:
Concepts: E Schymanski, E Bolton, J Zhang, T Cheng;
Code (in R): E Schymanski, R Helmus, P Thiessen
Transformations: E Schymanski, J Zhang, T Cheng and many contributors to various lists!
PubChem infrastructure: PubChem team
Reaction InChI (RInChI) calculations (v1.0): Gerd Blanke (previous versions of these files)
Acknowledgements: ECI team who contributed to related efforts, especially: J. Krier, A. Lai, M. Narayanan, T. Kondic, P. Chirsir, E. Palm. All contributors to the NORMAN-SLE transformations!
This data package is associated with the publication “Meta-metabolome ecology reveals that geochemistry and microbial functional potential are linked to organic matter development across seven rivers” submitted to Science of the Total Environment. This data package includes the data necessary to replicate the analyses presented within the manuscript to investigate dissolved organic matter (DOM) development across broad spatial distances and within divergent biomes. Specifically, we included the Fourier transform ion cyclotron mass spectrometry (FTICR-MS) data, geochemistry data, annotated metagenomic data, and results from ecological null modeling analyses in this data package. Additionally, we included the scripts necessary to generate the figures from the manuscript.Complete metagenomic data associated with this data package can be found at the National Center for Biotechnology (NCBI) under Bioproject PRJNA946291.This dataset consists of (1) four folders; (2) a file-level metadata (flmd) file; (3) a data dictionary (dd) file; (4) a factor sheet describing samples; and (5) a readme. The FTICR Data folder contains (1) the processed Fourier transform ion cyclotron mass spectrometry (FTICR-MS) data; (2) a transformation-weighted characteristics dendrogram generated from the FTICR-MS data; and (3) the script used to generate all FTICR-MS related figures. The Geochemical Data folder contains (1) the single geochemistry data filemore » and (2) the R script responsible for generating associated figures. The Metagenomic Data folder contains (1) annotation information across different levels; (2) carbohydrate active enzyme (CAZyme) information from the dbCAN database (Yin et al., 2012); (3) phylogenetic tree data (FASTAs, alignments, and tree file); and (4) the scripts necessary to analyze all of these data and generate figures. The Null Modeling Data folder contains (1) data generated during null modeling for each river and all rivers combined and (2) the R scripts necessary to process the data. All files are .csv, .pdf, .tsv, .tre, .faa, .afa, .tree, or .R.« less
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Abstract This dataset was created within the Bioregional Assessment Programme. Data has not been derived from any source datasets. Metadata has been compiled by the Bioregional Assessment Programme. This dataset contains a set of generic R scripts that are used in the propagation of uncertainty through numerical models. Dataset History The dataset contains a set of R scripts that are loaded as a library. The R scripts are used to carry out the propagation of uncertainty through numerical …Show full descriptionAbstract This dataset was created within the Bioregional Assessment Programme. Data has not been derived from any source datasets. Metadata has been compiled by the Bioregional Assessment Programme. This dataset contains a set of generic R scripts that are used in the propagation of uncertainty through numerical models. Dataset History The dataset contains a set of R scripts that are loaded as a library. The R scripts are used to carry out the propagation of uncertainty through numerical models. The scripts contain the functions to create the statistical emulators and do the necessary data transformations and backtransformations. The scripts are self-documenting and created by Dan Pagendam (CSIRO) and Warren Jin (CSIRO). Dataset Citation Bioregional Assessment Programme (2016) R-scripts for uncertainty analysis v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/322c38ef-272f-4e77-964c-a14259abe9cf.
This data collection consists of archived Geostationary Operational Environmental Satellite-R (GOES-R) Series Geostationary Lightning Mapper (GLM) Level 0 data from the GOES-East and GOES-West satellites in the operational (OPS) and the post-launch test (PLT) phases. The GOES-R Series provides continuity of the GOES mission through 2035 and improvements in geostationary satellite observational data. GOES-16, the first GOES-R satellite, began operating as GOES-East on December 18, 2017. GOES-17 began operating as GOES-West on February 12, 2019. GOES-T launched on March 1, 2022, and was renamed to GOES-18 on March 14, 2022. GOES-U, the final satellite in the series, is scheduled to launch in 2024. GLM is a near-infrared optical transient detector observing the Western Hemisphere. The GLM Level 0 data are composed of Consultative Committee for Space Data Systems (CCSDS) packets containing the science, housekeeping, engineering, and diagnostic telemetry data downlinked from the instrument. The Level 0 data files also contain orbit and attitude/angular rate packets generated by the GOES spacecraft. Each CCSDS packet contains a unique Application Process Identifier (APID) in the primary header that identifies the specific type of packet, and is used to support interpretation of its contents. Users may refer to the GOES-R Series Product Definition and Users’ Guide (PUG) Volume 1 (Main) and Volume 2 (Level 0 Products) for Level 0 data documentation. Related instrument calibration data and Level 1b processing information are archived and available for order at the NOAA CLASS website. The GLM Level 0 data files are delivered in a netCDF-4 file format, however, the constituent CCSDS packets are stored in a byte array making the data opaque for standard netCDF reader applications. The GLM Level 0 data files are packaged in hourly tar files (data bundles) by satellite for the archive. Recently ingested archive tar files are available for 14 days on an anonymous FTP server for users to download. Data archived on offline tape may be requested from NCEI.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
R code and tutorial for downloading and processing agrometeorological data from API client sources. Last update on March 18, 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: This repository provides different artifacts developed in and used for the evaluation of the dissertation "Building Transformation Networks for Consistent Evolution of Interrelated Models". It serves as a reproduction package for the contributions and evaluations of that thesis. The artifacts comprise an approach to evaluate compatibility of QVT-R transformations, evaluations of interoperability issues in transformation networks and approaches to avoid them, a language to define consistency between multiple models, and an evaluation of this language. The package contains a prepared execution environment for the different artifacts. In addition, it provides a script to run the environment for some of the artifacts and automatically resolve all dependencies based on Docker. TechnicalRemarks: Instructions on how to use the data can be found within the repository.
Differential Coexpression ScriptThis script contains the use of previously normalized data to execute the DiffCoEx computational pipeline on an experiment with four treatment groups.differentialCoexpression.rNormalized Transformed Expression Count DataNormalized, transformed expression count data of Medicago truncatula and mycorrhizal fungi is given as an R data frame where the columns denote different genes and rows denote different samples. This data is used for downstream differential coexpression analyses.Expression_Data.zipNormalization and Transformation of Raw Count Data ScriptRaw count data is transformed and normalized with available R packages and RNA-Seq best practices.dataPrep.rRaw_Count_Data_Mycorrhizal_FungiRaw count data from HtSeq for mycorrhizal fungi reads are later transformed and normalized for use in differential coexpression analysis. 'R+' indicates that the sample was obtained from a plant grown in the presence of both mycorrhizal fungi and rhizobia. 'R-' indicate...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 11/15/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated throughout the peer review process.
#Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the `source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and supporting analyses. This script generates the key figures and summary statistics used in the study that then get saved in the "manuscript_figures" folder. Note that all maps were produced using Python code found in the "supporting_code"" folder. Also note that within the "manuscript_figures" folder there is an "extended_data" folder, which contains tables of the summary statistics (e.g., quartiles and sample sizes) behind figures containing box plots or depicting regression coefficients.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset collects a raw dataset and a processed dataset derived from the raw dataset. There is a document containing the analytical code for statistical analysis of the processed dataset in .Rmd format and .html format.
The study examined some aspects of mechanical performance of solid wood composites. We were interested in certain properties of solid wood composites made using different adhesives with different grain orientations at the bondline, then treated at different temperatures prior to testing.
Performance was tested by assessing fracture energy and critical fracture energy, lap shear strength, and compression strength of the composites. This document concerns only the fracture properties, which are the focus of the related paper.
Notes:
* the raw data is provided in this upload, but the processing is not addressed here.
* the authors of this document are a subset of the authors of the related paper.
* this document and the related data files were uploaded at the time of submission for review. An update providing the doi of the related paper will be provided when it is available.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mean skewness and kurtosis for simulated data scenarios.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata
To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).
To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.
To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.
To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The above data includes agreed patients' individual preprocessed fMRI images. NC:control group(15); Stroke:stroke group(13). Image Acquisition All participants were scanned using a 3.0T MR scanner (Philips Achieva Magnetom Avanto, Amsterdam Netherlands) . T2* weighted R-fMRI data were acquired using an echo-planar imaging pulse sequence: 33 axial slices; repetition time (TR) = 2000 ms; echo time (TE) = 30 ms; slice thickness = 3.5 mm; gap = 0.7 mm; flip angle (FA) = 90°; matrix = 64×64; field of view (FOV) = 200 × 200 mm2. During data acquisition, participants were asked to lie quietly in the scanner with their eyes closed. A total of 240 volumes were obtained for each participant. Data Processing R-fMRI data preprocessing was performed with the GRETNA toolbox(http://www.nitrc.org/projects/gretna/) based on the SPM8 (http://www.fil.ion.ucl.ac.uk/spm/software/spm8/). After removal of the first 5 volumes, the functional images were first corrected for time offsets between slices and geometrical displacements due to head movement. The corrected functional data were then normalized to the Montreal Neurological Institute space using an optimum 12-parameter affine transformation and non-linear deformations and then resampled to a 3-mm isotropic resolution. The resulting images were further underwent spatial smoothing (Gaussian kernel with a full width at half maximum = 6 mm), linear detrend ensured the comparability of our results and those of the existing literature.
This project is a collection of files to allow users to reproduce the model development and benchmarking in "Dawnn: single-cell differential abundance with neural networks" (Hall and Castellano, under review). Dawnn is a tool for detecting differential abundance in single-cell RNAseq datasets. It is available as an R package here. Please contact us if you are unable to reproduce any of the analysis in our paper. The files in this collection correspond to the benchmarking dataset based on single-cell RNAseq of mouse emrbyo cells. FILES: Input data Dataset from: "A single-cell molecular map of mouse gastrulation and early organogenesis". Nature 566, pp490–495 (2019). The input data is loaded from the MouseGastrulationData R package. We upload here the RDS file generated by loading the dataset in process_mouse_cells.R in case the R package becomes unavailable MouseGastrulationData_loaded_dataset.RDS Dataset loaded from MouseGastrulationData R package in process_mouse_cells.R (in call to EmbryoAtlasData function). Data processing code process_mouse_cells.R Generates benchmarking dataset from input data. (Loads input data; Runs the standard single-cell RNAseq pipeline). Follows Dann et al. Resulting dataset saved as mouse_gastrulation_data_regen.RDS. simulate_mouse_pc1_Rscript.R R code to simulate P(Condition_1)s for benchmarking. simulate_mouse_pc1_bash.sh Bash script to execute simulate_mouse_pc1_Rscript.R. Outputs stored in benchmark_dataset_mouse_pc1s_regen.csv. simulate_mouse_labels_Rscript.R R code to simulate labels for benchmarking. simulate_mouse_labels_bash.sh Bash script to execute simulate_mouse_labels_Rscript.R. Outputs stored in benchmark_dataset_mouse.csv. Resulting datasets mouse_gastrulation_data_regen.RDS Seurat dataset generated by process_mouse_cells.R. benchmark_dataset_mouse.csv Cell labels generated by simulate_mouse_labels_bash.sh. benchmark_dataset_mouse_pc1s_regen.csv P(Condition_1)s generated by simulate_mouse_pc1_bash.sh.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
The 2015-16 Budget is officially available at budget.gov.au as the authoritative source of Budget Papers and Portfolio Budget Statement (PBS) documents. This dataset is a collection of data sources from the 2015-16 Budget, including:\r \r * The Portfolio Budget Statement Excel spreadsheets – Available after PBSs tabled in the Senate (~8.30pm Budget night).\r * A Machine Readable CSV of all PBS Excel Spreadsheet Line Items – Available after PBSs tabled in the Senate and translated (~8.30pm Budget night).\r \r Data from the 2015-16 Budget are provided to assist those who wish to analyse, visualise and programmatically access the 2015-16 Budget. \r \r Data users should refer to footnotes and memoranda in the original files as these are not usually captured in machine readable CSVs.\r \r We welcome your feedback and comments below.\r \r This dataset was prepared by the Department of Finance and the Department of the Treasury.\r \r
\r The PBS Excel files published should include the following financial tables with headings and footnotes. Only the line item data (table 2.2) is available in CSV at this stage. Much of the other data is also available in the Budget Papers 1 and 4 in aggregate form:\r \r * Table 1.1: Entity Resource Statement;\r * Table 1.2: Entity 2015-16 Budget Measures;\r * Table 2.1: Budgeted Expenses for Outcome X;\r * Table 2.2: Programme Expenses and Programme Components.\r * Table 3.1.1: Movement of Administered Funds Between Years;\r * Table 3.1.2: Estimates of Special Account Flows and Balances;\r * Table 3.1.3: Australian Government Indigenous Expenditure (AGIE);\r * Tables 3.2.1 to 3.2.6: Departmental Budgeted Financial Statements; and\r * Tables 3.2.7 to 3.2.11: Administered Budgeted Financial Statements.\r \r Please note, total expenses reported in the CSV file ‘2015-16 PBS line items dataset’ was prepared from individual entity programme expense tables. Totalling these figures does not produce the total expense figure in ‘Table1: Estimates of General Government Expenses’ (Statement 6, Budget Paper 1). \r \r Differences relate to:\r \r 1. Intra entity charging for services which are eliminated for the reporting of general government financial statements;\r 2. Entity expenses that involve revaluation of assets and liabilities are reported as other economic flows in general government financial statements; and\r 3. Additional entities’ expenses are included in general government sector expenses (e.g. Australian Strategic Policy Institute Limited and other entities) noting that only entities that are directly government funded are required to prepare a PBS.\r \r The original PBS Excel files and published documents include sub-totals and totals by entity and appropriation type which are not included in the line item CSV. These can be calculated programmatically. Where modifications are identified they will be updated as required. \r \r If a corrigendum to an entities PBS is issued after budget night, tables will be updated as necessary.\r \r The structure of the line item CSV is;\r \r * Portfolio\r * Department/Entity\r * Outcome\r * Program\r * Expense type\r * Appropriation type\r * Description\r * 2014-15\r * 2015-16\r * 2016-17\r * 2017-18\r * 2018-19\r * Source document\r * Source table\r * URL\r \r The data transformation is expected to be complete by midday 13 May. We may put up an incomplete CSV which will continue to be updated as additional PBSs are transformed into data form.\r \r The following Portfolios are included in the line item CSV: \r \r * Agriculture\r * Attorney General's\r * Communications\r * Defence\r * Education and Training \r * Employment\r * Environment\r * Finance\r * Foreign Affairs and Trade\r * Health\r * Human Services\r * Immigration and Border Protection\r * Industry and Science\r * Infrastructure and Regional Development\r * Parliamentary Departments\r * Prime Minister and Cabinet\r * Social Services\r * Treasury\r * Veterans' Affairs\r \r
\r We have made a number of data tables from the Budget Papers available in Excel and CSV formats.\r \r Below is the list of the tables published and whether we’ve translated them into CSV form this year:\r \r * Budget Paper 1: Appendix A1 - Estimates of expenses by function and sub‑function\r * Budget Paper 1: Overview - Appendix C Major Intiatives\r * Budget Paper 1: Overview - Appendix D Major Savings\r * Budget Paper 1: Statement 3 - Table 7: Reconciliation of underlying cash balance estimates\r * Budget Paper 1: Statement 4 - Table 1: Australian Government general government receipts\r * Budget Paper 1: Statement 4 - Table 7: Australian Government general government (cash) receipts\r * Budget Paper 1: Statement 4 - Table 10: Reconciliation of 2015‑16 general government (accrual) revenue\r * Budget Paper 1: Statement 4 - Supplementary table 3: Australian Government (accrual) revenue\r * Budget Paper 1: Statement 10 - Table 1: Australian Government general government sector receipts, payments, net Future Fund earnings and underlying cash balance\r * Budget Paper 1: Statement 10 - Table 4: Australian Government general government sector taxation receipts, non‑taxation receipts and total receipts\r * Budget Paper 1: Statement 10 - Table 5: Australian Government general government sector net debt and net interest payments\r * Budget Paper 4 Table 1.1 – Agency Resourcing\r * Budget Paper 4 Table 1.2 – Special Appropriations\r * Budget Paper 4 Table 1.3 – Special Accounts\r * Budget Paper 4 Table 2.2 – Staffing Tables\r * Budget Paper 4 Table 3.1 – Departmental Expenses\r * Budget Paper 4 Table 3.2 – Net Capital Investment\r \r
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains all the scripts used to carry out the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Hunter bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016). See History for a detailed explanation of the dataset contents.
References:
Herron N, Crosbie R, Peeters L, Marvanek S, Ramage A and Wilkins A (2016) Groundwater numerical modelling for the Hunter subregion. Product 2.6.2 for the Hunter subregion from the Northern Sydney Basin Bioregional Assessment. Department of the Environment, Bureau of Meteorology, CSIRO and Geoscience Australia, Australia.
This dataset uses the results of the design of experiment runs of the groundwater model of the Hunter subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Herron et al. 2016).
A flow chart of the way the various files and scripts interact is provided in HUN_GW_UA_Flowchart.png (editable version in HUN_GW_UA_Flowchart.gliffy).
R-script HUN_DoE_Parameters.R creates the set of parameters for the design of experiment in HUN_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset HUN GW Model v01). Associated with this spreadsheet is file HUN_GW_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure.
The results of the design of experiment model runs are summarised in files HUN_GW_dmax_DoE_Predictions.csv, HUN_GW_tmax_DoE_Predictions.csv, HUN_GW_DoE_Observations.csv, HUN_GW_DoE_mean_BL_BF_hist.csv which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observed groundwater levels and SW-GW fluxes respectively. These are generated with post-processing scripts in dataset HUN GW Model v01 from the output (as exemplified in dataset HUN GW Model simulate ua999 pawsey v01).
Spreadsheets HUN_GW_dmax_Predictions.csv and HUN_GW_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction.
Spreadsheet HUN_GW_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing the observation, the spatial coordinates, the observed value and the number of observations at this location (from dataset HUN bores v01). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). Spreadsheet HUN_GW_mean_BL_BF_hist.csv has similar information, but on the SW-GW flux. The observed values are from dataset HUN Groundwater Flowrate Time Series v01
These files are used in script HUN_GW_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets HUN_GW_dmax_SI.csv, HUN_GW_tmax_SI.csv, HUN_GW_hobs_SI.py, HUN_GW_mean_BF_hist_SI.csv
Script HUN_GW_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow rates. The results are stored in HUN_GW_DoE_ObjFun.csv and HUN_GW_ObjFun.csv.
The latter files are used in scripts HUN_GW_dmax_CreatePosteriorParameters.R to carry out the Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Herron et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file HUN_GW_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention HUN_GW_dmax_Posterior_Parameters_OO_$OFName$.csv where $OFName$ is the name of the objective function. Python script HUN_GW_PosteriorParameters_Percentiles.py summarizes these posterior parameter combinations and stores the results in HUN_GW_PosteriorParameters_Percentiles.csv.
The same set of spreadsheets is used to test convergence of the emulator performance with script HUN_GW_emulator_convergence.R and batch file HUN_GW_emulator_convergence.slurm to produce spreadsheet HUN_GW_convergence_objfun_BF.csv.
The posterior parameter distributions are sampled with scripts HUN_GW_dmax_tmax_MCsampler.R and associated .slurm batch file. The script create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. A single emulator and associated output is included for illustrative purposes.
Script HUN_GW_collate_predictions.csv collates all posterior predictive distributions in spreadsheets HUN_GW_dmax_PosteriorPredictions.csv and HUN_GW_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet HUN_GW_dmax_tmax_excprob.csv with script HUN_GW_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and the threshold of the objective function and the acceptance rate.
The script HUN_GW_dmax_tmax_MCsampler.R is also used to evaluate parameter distributions HUN_GW_dmax_Posterior_Parameters_HUN_OF_probe439.csv and HUN_GW_dmax_Posterior_Parameters_Mackie_OF_probe439.csv. These are, for one predictions, different parameter distributions, in which the latter represents local information. The corresponding dmax values are stored in HUN_GW_dmax_probe439_HUN.csv and HUN_GW_dmax_probe439_Mackie.csv
Bioregional Assessment Programme (XXXX) HUN GW Uncertainty Analysis v01. Bioregional Assessment Derived Dataset. Viewed 09 October 2018, http://data.bioregionalassessments.gov.au/dataset/c25db039-5082-4dd6-bb9d-de7c37f6949a.
Derived From HUN GW Model code v01
Derived From NSW Office of Water Surface Water Entitlements Locations v1_Oct2013
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From Travelling Stock Route Conservation Values
Derived From HUN GW Model v01
Derived From NSW Wetlands
Derived From Climate Change Corridors Coastal North East NSW
Derived From Communities of National Environmental Significance Database - RESTRICTED - Metadata only
Derived From Climate Change Corridors for Nandewar and New England Tablelands
Derived From National Groundwater Dependent Ecosystems (GDE) Atlas
Derived From R-scripts for uncertainty analysis v01
Derived From Asset database for the Hunter subregion on 27 August 2015
Derived From Birds Australia - Important Bird Areas (IBA) 2009
Derived From Estuarine Macrophytes of Hunter Subregion NSW DPI Hunter 2004
Derived From Hunter CMA GDEs (DRAFT DPI pre-release)
Derived From Camerons Gorge Grassy White Box Endangered Ecological Community (EEC) 2008
Derived From Atlas of Living Australia NSW ALA Portal 20140613
Derived From [Spatial Threatened Species and Communities (TESC) NSW
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit contents and complementary data regarding the r/The_Donald community and its main moderation interventions, used for the corresponding article indicated in the title.
An accompanying R notebook can be found in: https://github.com/amauryt/make_reddit_great_again
If you use this dataset please cite the related article.
The dataset timeframe of the Reddit contents (submissions and comments) spans from 30 weeks before Quarantine (2018-11-28) to 30 weeks after Restriction (2020-09-23). The original Reddit content was collected from the Pushshift monthly data files, transformed, and loaded into two SQLite databases.
The first database, the_donald.sqlite, contains all the available content from r/The_Donald created during the dataset timeframe, with the last content being posted several weeks before the timeframe upper limit. It only has two tables: submissions and comments. It should be noted that the IDs of contents are on base 10 (numeric integer), unlike the original base 36 (alphanumeric) used on Reddit and Pushshift. This is for efficient storage and processing. If necessary, many programming languages or libraries can easily convert IDs from one base to another.
The second database, core_the_donald.sqlite, contains all the available content from core users of r/The_Donald made platform-wise (i.e., within and without the subreddit) during the dataset timeframe. Core users are defined as those who authored either a submission or a comment a week in r/The_Donald during the 30 weeks prior to the subreddit's Quarantine. The database has four tables: submissions, comments, subreddits, and perspective_scores. The subreddits table contains the names of the subreddits to which submissions and comments were made (their IDs are also on base 10). The perspective_scores table contains comment toxicity scores.
The Perspective API was used to score comments based on the attributes toxicity and severe_toxicity. It should be noted that not all of the comments in core_the_donald have a score because the comment body was blank or because the Perspective API returned a request error (after three tries). However, the percentage of missing scores is minuscule.
A third file, mbfc_scores.csv, contains the bias and factual reporting accuracy collected in October 2021 from Media Bias / Fact Check (MBFC). Both attributes are scored on a Likert-like manner. One can associate submissions to MBFC scores by doing a join by the domain column.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
LScDC Word-Category RIG MatrixApril 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk / suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny MirkesGetting StartedThis file describes the Word-Category RIG Matrix for theLeicester Scientific Corpus (LSC) [1], the procedure to build the matrix and introduces the Leicester Scientific Thesaurus (LScT) with the construction process. The Word-Category RIG Matrix is a 103,998 by 252 matrix, where rows correspond to words of Leicester Scientific Dictionary-Core (LScDC) [2] and columns correspond to 252 Web of Science (WoS) categories [3, 4, 5]. Each entry in the matrix corresponds to a pair (category,word). Its value for the pair shows the Relative Information Gain (RIG) on the belonging of a text from the LSC to the category from observing the word in this text. The CSV file of Word-Category RIG Matrix in the published archive is presented with two additional columns of the sum of RIGs in categories and the maximum of RIGs over categories (last two columns of the matrix). So, the file ‘Word-Category RIG Matrix.csv’ contains a total of 254 columns.This matrix is created to be used in future research on quantifying of meaning in scientific texts under the assumption that words have scientifically specific meanings in subject categories and the meaning can be estimated by information gains from word to categories. LScT (Leicester Scientific Thesaurus) is a scientific thesaurus of English. The thesaurus includes a list of 5,000 words from the LScDC. We consider ordering the words of LScDC by the sum of their RIGs in categories. That is, words are arranged in their informativeness in the scientific corpus LSC. Therefore, meaningfulness of words evaluated by words’ average informativeness in the categories. We have decided to include the most informative 5,000 words in the scientific thesaurus. Words as a Vector of Frequencies in WoS CategoriesEach word of the LScDC is represented as a vector of frequencies in WoS categories. Given the collection of the LSC texts, each entry of the vector consists of the number of texts containing the word in the corresponding category.It is noteworthy that texts in a corpus do not necessarily belong to a single category, as they are likely to correspond to multidisciplinary studies, specifically in a corpus of scientific texts. In other words, categories may not be exclusive. There are 252 WoS categories and a text can be assigned to at least 1 and at most 6 categories in the LSC. Using the binary calculation of frequencies, we introduce the presence of a word in a category. We create a vector of frequencies for each word, where dimensions are categories in the corpus.The collection of vectors, with all words and categories in the entire corpus, can be shown in a table, where each entry corresponds to a pair (word,category). This table is build for the LScDC with 252 WoS categories and presented in published archive with this file. The value of each entry in the table shows how many times a word of LScDC appears in a WoS category. The occurrence of a word in a category is determined by counting the number of the LSC texts containing the word in a category. Words as a Vector of Relative Information Gains Extracted for CategoriesIn this section, we introduce our approach to representation of a word as a vector of relative information gains for categories under the assumption that meaning of a word can be quantified by their information gained for categories.For each category, a function is defined on texts that takes the value 1, if the text belongs to the category, and 0 otherwise. For each word, a function is defined on texts that takes the value 1 if the word belongs to the text, and 0 otherwise. Consider LSC as a probabilistic sample space (the space of equally probable elementary outcomes). For the Boolean random variables, the joint probability distribution, the entropy and information gains are defined.The information gain about the category from the word is the amount of information on the belonging of a text from the LSC to the category from observing the word in the text [6]. We used the Relative Information Gain (RIG) providing a normalised measure of the Information Gain. This provides the ability of comparing information gains for different categories. The calculations of entropy, Information Gains and Relative Information Gains can be found in the README file in the archive published. Given a word, we created a vector where each component of the vector corresponds to a category. Therefore, each word is represented as a vector of relative information gains. It is obvious that the dimension of vector for each word is the number of categories. The set of vectors is used to form the Word-Category RIG Matrix, in which each column corresponds to a category, each row corresponds to a word and each component is the relative information gain from the word to the category. In Word-Category RIG Matrix, a row vector represents the corresponding word as a vector of RIGs in categories. We note that in the matrix, a column vector represents RIGs of all words in an individual category. If we choose an arbitrary category, words can be ordered by their RIGs from the most informative to the least informative for the category. As well as ordering words in each category, words can be ordered by two criteria: sum and maximum of RIGs in categories. The top n words in this list can be considered as the most informative words in the scientific texts. For a given word, the sum and maximum of RIGs are calculated from the Word-Category RIG Matrix.RIGs for each word of LScDC in 252 categories are calculated and vectors of words are formed. We then form the Word-Category RIG Matrix for the LSC. For each word, the sum (S) and maximum (M) of RIGs in categories are calculated and added at the end of the matrix (last two columns of the matrix). The Word-Category RIG Matrix for the LScDC with 252 categories, the sum of RIGs in categories and the maximum of RIGs over categories can be found in the database.Leicester Scientific Thesaurus (LScT)Leicester Scientific Thesaurus (LScT) is a list of 5,000 words form the LScDC [2]. Words of LScDC are sorted in descending order by the sum (S) of RIGs in categories and the top 5,000 words are selected to be included in the LScT. We consider these 5,000 words as the most meaningful words in the scientific corpus. In other words, meaningfulness of words evaluated by words’ average informativeness in the categories and the list of these words are considered as a ‘thesaurus’ for science. The LScT with value of sum can be found as CSV file with the published archive. Published archive contains following files:1) Word_Category_RIG_Matrix.csv: A 103,998 by 254 matrix where columns are 252 WoS categories, the sum (S) and the maximum (M) of RIGs in categories (last two columns of the matrix), and rows are words of LScDC. Each entry in the first 252 columns is RIG from the word to the category. Words are ordered as in the LScDC.2) Word_Category_Frequency_Matrix.csv: A 103,998 by 252 matrix where columns are 252 WoS categories and rows are words of LScDC. Each entry of the matrix is the number of texts containing the word in the corresponding category. Words are ordered as in the LScDC.3) LScT.csv: List of words of LScT with sum (S) values. 4) Text_No_in_Cat.csv: The number of texts in categories. 5) Categories_in_Documents.csv: List of WoS categories for each document of the LSC.6) README.txt: Description of Word-Category RIG Matrix, Word-Category Frequency Matrix and LScT and forming procedures.7) README.pdf (same as 6 in PDF format)References[1] Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2[2] Suzen, Neslihan (2019): LScDC (Leicester Scientific Dictionary-Core). figshare. Dataset. https://doi.org/10.25392/leicester.data.9896579.v3[3] Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4] WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [5] Suzen, N., Mirkes, E. M., & Gorban, A. N. (2019). LScDC-new large scientific dictionary. arXiv preprint arXiv:1912.06858. [6] Shannon, C. E. (1948). A mathematical theory of communication. Bell system technical journal, 27(3), 379-423.
Abstract The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The …Show full descriptionAbstract The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. This dataset contains all the scripts used to conduct the uncertainty analysis for the maximum drawdown and time to maximum drawdown at the groundwater receptors in the Clarence-Moreton bioregion and all the resulting posterior predictions. This is described in product 2.6.2 Groundwater numerical modelling (Cui et al. 2016). See History for a detailed explanation of the dataset contents. Dataset History This dataset uses the results of the design of experiment runs of the MODFLOW groundwater model of the Clarence-Moreton subregion to train emulators to (a) constrain the prior parameter ensembles into the posterior parameter ensembles and to (b) generate the predictive posterior ensembles of maximum drawdown and time to maximum drawdown. This is described in product 2.6.2 Groundwater numerical modelling (Cui et al. 2016). A flow chart of the way the various files and scripts interact is provided in CLM_MF_dmax_v02_Flowchart.png (editable version in CLM_MF_dmax_v02_Flowchart.gliffy). R-script CLM_DoE_Parameters.R creates the set of parameters for the design of experiment in CLM_DoE_Parameters.csv. Each of these parameter combinations is evaluated with the groundwater model (dataset CLM groundwater model V1). Associated with this spreadsheet is file CLM_MF_Parameters.csv. This file contains, for each parameter, if it is included in the sensitivity analysis, tied to another parameters, the initial value and range, the transformation, the type of prior distribution with its mean and covariance structure. The results of the design of experiment model runs are summarised in files CLM_MF_dmax_DoE_Predictions.csv, CLM_MF_tmax_DoE_Predictions.csv, CLM_MF_DoE_Observations.csv, which have the maximum additional drawdown, the time to maximum additional drawdown for each receptor and the simulated equivalents to observations respectively. The first two are generated with post-processing scripts in dataset groundwater model V1, while for the last file, additional script CLM_MF_postprocess_riverflux.py is used to summarise the simulated equivalents to the surface water groundwater exchange flux. Spreadsheets CLM_MF_dmax_Predictions.csv and CLM_MF_tmax_Predictions.csv capture additional information on each prediction; the name of the prediction, transformation, min, max and median of design of experiment, a boolean to indicate the prediction is to be included in the uncertainty analysis, the layer it is assigned to and which objective function to use to constrain the prediction. Spreadsheet CLM_MF_dmax_Observations.csv has additional information on each observation; the name of the observation, a boolean to indicate to use the observation, the min and max of the design of experiment, a metadata statement describing if the observation is steady state (SS) or transient (TR) and the source of the spatial coordinates (from dataset CLM - Bore water level NSW). Further it has the distance of each bore to the nearest blue line network and the distance to each prediction (both in km). These files are used in script CLM_MF_SI.py to generate sensitivity indices (based on the Plischke et al. (2013) method) for each group of observations and predictions. These indices are saved in spreadsheets CLM_MF_SI_dmaxL1.csv, CLM_MF_SI_dmaxL2.csv, CLM_MF_SI_dmaxL3.csv, CLM_MF_SI_dmaxL4.csv, CLM_MF_SI_dmaxL6.csv, CLM_MF_SI_hobs.csv, CLM_MF_SI_Qcsg.csv, CLM_MF_SI_objfun.csv. Script CLM_MF_dmax_ObjFun.py calculates the objective function values for the design of experiment runs. Each prediction in layer 1 has a tailored objective function which is a weighted sum of the residuals between observations and predictions with weights based on the distance between observation and prediction. In addition to that there is an objective function for the baseflow and CSG water production rates. The results are stored in CLM_MF_DoE_ObjFun.csv and CLM_MF_ObjFun.csv. The latter files are used in scripts CLM_MF_dmax_CreatePosteriorParameters_oo.R and CLM_MF_dmax_CreatePosteriorParameters_gen.R to carry out the Markov Chain Monte Carlo sampling of the prior parameter distributions with the Approximate Bayesian Computation methodology as described in Cui et al (2016) by generating and applying emulators for each objective function. The scripts use the scripts in dataset R-scripts for uncertainty analysis v01. These files are run on the high performance computation cluster machines with batch file CLM_MF_dmax_CreatePosterior.slurm. These scripts result in posterior parameter combinations for each objective function, stored in directory PosteriorParameters, with filename convention CLM_MF_dmax_Posterior_Parameters_OO_%i_batch.csv % 1-982. The general posterior parameter distribution (i.e. without the distance weighted groundwater level observations) is stored in CLM_MF_dmax_Posterior_Parameters_gen_batch1.csv. The same set of spreadsheets is used to test convergence of the emulator performance with script CLM_MF_emulator_convergence.R and batch file CLM_MF_emulator_convergence.slurm to produce spreadsheet CLM_MF_convergence_objfun_qriv.csv. The posterior parameter distributions are sampled with scripts CLM_MF_dmax_MCsampler_OO_i.R, CLM_MF_dmax_MCsampler_gen_i.R, CLM_MF_tmax_MCsampler_OO_i.R, CLM_MF_tmax_MCsampler_gen_i.R and associated .slurm batch files. Files ending in OO_i.R sample for predictions that have a groundwater level observation constrained objective function, files ending in gen_i.R sample the predictions that have the general objective function. The scripts create and apply an emulator for each prediction. The emulator and results are stored in directory Emulators. This directory is not part of the this dataset but can be regenerated by running the scripts on the high performance computation clusters. Script CLM_MF_collate_predictions.csv collates all posterior predictive distributions in spreadsheets CLM_MF_dmax_PosteriorPredictions.csv and CLM_MF_tmax_PosteriorPredictions.csv. These files are further summarised in spreadsheet CLM_MF_dmax_tmax_excprob.csv with script CLM_MF_exc_prob. This spreadsheet contains for all predictions the coordinates, layer, number of samples in the posterior parameter distribution and the 5th, 50th and 95th percentile of dmax and tmax, the probability of exceeding 1 cm and 20 cm drawdown, the maximum dmax value from the design of experiment and for the predictions in layer 1 the threshold of the objective function and the acceptance rate. Dataset Citation Bioregional Assessment Programme (2016) CLM MODFLOW Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 10 July 2017, http://data.bioregionalassessments.gov.au/dataset/25e01e3c-7b87-4200-9ef2-5c5405627130. Dataset Ancestors Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204 Derived From Qld 100K mapsheets - Mount Lindsay Derived From Qld 100K mapsheets - Helidon Derived From Qld 100K mapsheets - Ipswich Derived From CLM - Woogaroo Subgroup extent Derived From CLM - Interpolated surfaces of Alluvium depth Derived From CLM - Extent of Logan and Albert river alluvial systems Derived From CLM - Bore allocations NSW v02 Derived From CLM - Bore allocations NSW Derived From CLM - Bore assignments NSW and QLD summary tables Derived From CLM - Geology NSW & Qld combined v02 Derived From CLM - Orara-Bungawalbin bedrock Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014 Derived From CLM groundwater model hydraulic property data Derived From CLM - Koukandowie FM bedrock Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb) Derived From NSW Office of Water - National Groundwater Information System 20140701 Derived From CLM - Gatton Sandstone extent Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014 Derived From Bioregional Assessment areas v03 Derived From NSW Geological Survey - geological units DRAFT line work. Derived From Mean Annual Climate Data of Australia 1981 to 2012 Derived From CLM Preliminary Assessment Extent Definition & Report( CLM PAE) Derived From Qld 100K mapsheets - Caboolture Derived From CLM - AWRA Calibration Gauges SubCatchments Derived From CLM - NSW Office of Water Gauge Data for Tweed, Richmond & Clarence rivers. Extract 20140901 Derived From Qld 100k mapsheets - Murwillumbah Derived From AHGFContractedCatchment - V2.1 - Bremer-Warrill Derived From Bioregional Assessment areas v01 Derived From Bioregional Assessment areas v02 Derived From QLD Current Exploration Permits for Minerals (EPM) in Queensland 6/3/2013 Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111 Derived From CLM - Bore water level NSW Derived From Climate model 0.05x0.05 cells and cell centroids Derived From CLM - New South Wales Department of Trade and Investment 3D geological model layers Derived From CLM - Metgasco 3D geological model formation top grids Derived From R-scripts for uncertainty analysis v01 Derived From State Transmissivity Estimates for Hydrogeology Cross-Cutting Project Derived From CLM - Extent of Bremer river and Warrill creek alluvial systems Derived From NSW Catchment Management Authority Boundaries 20130917 Derived From Pilot points for prediction interpolation of layer 1 in CLM groundwater model Derived From Qld 100K mapsheets - Esk Derived From
Biogeoclimatic Ecosystem Classification (BEC) system is the ecosystem classification adopted in the forest management within British Columbia based on vegetation, soil, and climate characteristics whereas Site Series is the smallest unit of the system. The Ministry of Forests, Lands, Natural Resource Operations and Rural Development held under the Government of British Columbia (“the Ministry”) developed a web-based tool known as BEC Map for maintaining and sharing the information of the BEC system, but the Site Series information was not included in the tool due to its quantity and complexity. In order to allow users to explore and interact with the information, this project aimed to develop a web-based tool with high data quality and flexibility to users for the Site Series classes using the “Shiny” and “Leaflet” packages in R. The project started with data classification and pre-processing of the raster images and attribute tables through identification of client requirements, spatial database design and data cleaning. After data transformation was conducted, spatial relationships among these data were developed for code development. The code development included the setting-up of web map and interactive tools for facilitating user friendliness and flexibility. The codes were further tested and enhanced to meet the requirements of the Ministry. The web-based tool provided an efficient and effective platform to present the complicated Site Series features with the use of Web Mapping System (WMS) in map rendering. Four interactive tools were developed to allow users to examine and interact with the information. The study also found that the mode filter performed well in data preservation and noise minimization but suffered from long processing time and creation of tiny sliver polygons.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.