https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: none of the data sets published here contain actual data, they are for testing purposes only.
This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:
dataset_30_nodes_interactions.csv
:contains 30 rows (nodes).dataset_30_edges_interactions.csv
: contains 47 rows (edges).dataset_30
refers to the same graph.Each dataset contains the following columns:
Name of the Column | Type | Description |
UniProt ID | string | protein identification |
label | string | protein label (type of node) |
properties | string | a dictionary containing properties related to the protein. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
Relationship ID | string | relationship identification |
Source ID | string | identification of the source protein in the relationship |
Target ID | string | identification of the target protein in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_30* |
30 | 47 |
Y |
dataset_60* |
60 |
181 |
Y |
dataset_120* |
120 |
689 |
Y |
dataset_240* |
240 |
2819 |
Y |
dataset_300* |
300 |
4658 |
Y |
dataset_600* |
600 |
18004 |
Y |
dataset_1200* |
1200 |
71785 |
Y |
dataset_2400* |
2400 |
288600 |
Y |
dataset_3000* |
3000 |
449727 |
Y |
dataset_6000* |
6000 |
1799413 |
Y |
dataset_12000* |
12000 |
7199863 |
Y |
dataset_24000* |
24000 |
28792361 |
Y |
dataset_30000* |
30000 |
44991744 |
Y |
This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | node identification |
label | string | node label (type of node) |
properties | string | a dictionary containing properties related to the node. |
Each dataset contains the following columns:
Name of the Column | Type | Description |
ID | string | relationship identification |
source | string | identification of the source node in the relationship |
target | string | identification of the target node in the relationship |
label | string | relationship label (type of relationship) |
properties | string | a dictionary containing properties related to the relationship. |
Graph | Number of Nodes | Number of Edges | Sparse graph |
dataset_dummy* | 3 | 6 | N |
dataset_dummy2* | 3 | 6 | N |
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset contains >800K CSV files behind the GitTables 1M corpus.
For more information about the GitTables corpus, visit:
- our website for GitTables, or
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
https://i.imgur.com/PcSDv8A.png" alt="Imgur">
The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.
Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.
The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.
The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.
The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.
This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes.
https://i.imgur.com/HOtyghv.png" alt="Imgur">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset is Electronic Health Record Predicting collected from a private Hospital in Indonesia. It contains the patients laboratory test results used to determine next patient treatment whether in care or out care patient. The task embedded to the dataset is classification prediction.
The first dataset (Oesophageal Cancer Clinical.csv) has clinical data of Oesophageal Carcinoma patients.
The second dataset (Oesophageal Cancer Protein.csv) has the protein expression data for the same set of patients.
The two datasets contain information on the same patients. However, the clinical dataset contains a greater number of patient records than corresponding protein expression data in the second dataset. The clinical dataset has patient_barcode as the unique identifier, whereas, in the protein expression dataset the Sample_ID is used. In both datasets, the patient_barcode can be derived as, “TCGA”. - “ tissue_source_site”. - “patient_id”, e.g. -TCGA-2H-A9GF.
There are a huge amount of columns in these datasets, 83 clinical and 223 protein, giving you a great ability to try derive some interesting findings of oesophageal cancer from this real dataset.
This dataset can be used to analyse genes, mutations and interesting findings relating to this type of cancer.
[IMPORTANT NOTE: Sample file posted on Datarade is not the complete dataset, as Datarade permits only a single CSV file. Visit https://www.careprecise.com/healthcare-provider-data-sample.htm for more complete samples.] Updated every month, CarePrecise developed the AHD to provide a comprehensive database of U.S. hospital information. Extracted from the CarePrecise master provider database with information all of the 6.3 million HIPAA-covered US healthcare providers and additional sources, the Authoritative Hospital Database (AHD) contains records for all HIPAA-covered hospitals. In this database of hospitals we include bed counts, patient satisfaction data, hospital system ownership, hospital charges and cases by Zip Code®, and more. Most records include a cabinet-level or director-level contact. A PlaceKey is provided where available.
The AHD includes bed counts for 95% of hospitals, full contact information on 85%, and fax numbers for 62%. We include detailed patient satisfaction data, employee counts, and medical procedure volumes.
The AHD integrates directly with our extended provider data product to bring you the physicians and practice groups affiliated with the hospitals. This combination of data is the only commercially available hospital dataset of this depth.
NEW: Hospital NPI to CCN Rollup A CarePrecise Exclusive. Using advanced record-linkage technology, the AHD now includes a new file that makes it possible to mine the vast hospital information available in the National Provider Identifier registry database. Hospitals may have dozens of NPI records, each with its own information about a unit, listing facility type and/or medical specialties practiced, as well as separate contact names. To wield the power of this new feature, you'll need the CarePrecise Master Bundle, which contains all of the publicly available NPI registry data. These data are available in other CarePrecise data products.
Counts are approximate due to ongoing updates. Please review the current AHD information here: https://www.careprecise.com/detail_authoritative_hospital_database.htm
The AHD is sold as-is and no warranty is offered regarding accuracy, timeliness, completeness, or fitness for any purpose.
[doc] formats - csv - 3
This dataset contains one csv file at the root:
data.csv
col1|col2 dog|woof cat|meow pokemon|pika human|hello
We define the config name in the YAML config, as well as the exact location of the file, the separator as "|", the name of the columns, and the number of rows to ignore (the row #1 is a row of column headers, that will be replaced by the names option, and the row #0 is ignored). The reference for the options is the documentation… See the full description on the dataset page: https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3.
This data description contains code (written in the R programming language), as well as processed data and results presented in a research article (see references). No raw data are provided and the data that are made available cannot be linked to study participants. The sample consists of 180 of 308 eligible participants (adult primary care patients in Sweden, living with chronic illness) who responded to a Swedish web-based questionnaire at two time points. Using a confirmatory factor analysis, we calculated latent factor scores for 9 constructs, based on 34 questionnaire items. In this dataset, we share the latent factor scores and the latent profile analysis results. Although raw data are not shared, we provide the questionnaire item, including response scales. The code that was used to produce the latent factor scores and latent profile analysis results is also provided.
The study was performed as part of a research project exploring how the use of eHealth services in chronic care influence interaction and collaboration between patients and healthcare. The purpose of the study was to identify subgroups of primary care patients who are similar with respect to their experiences of co-care, as measured by the DoCCA scale (von Thiele Schwarz, 2021). Baseline data were collected after patients had been introduced to an eHealth service that aimed to support them in their self-care and digital communication with healthcare; follow-up data were collected 7 months later. All patients were treated at the same primary care center, located in the Stockholm Region in Sweden.
Cited reference: von Thiele Schwarz U, Roczniewska M, Pukk Härenstam K, Karlgren K, Hasson H, Menczel S, Wannheden C. The work of having a chronic condition: Development and psychometric evaluation of the Distribution of Co-Care Activities (DoCCA) Scale. BMC Health Services Research (2021) 21:480. doi: 10.1186/s12913-021-06455-8
The DATASET consists of two files: factorscores_docca.csv and latent-profile-analysis-results_docca.csv.
factorscores_docca.csv: This file contains 18 variables (columns) and 180 cases (rows). The variables represent latent factors (measured at two time points, T1 and T2) and the values are latent factor scores. The questionnaire data that were used to produce the latent factor scores consist of 20 items that measure experiences of collaboration with healthcare, based on the DoCCA scale. These items were included in the latent profile analysis. Additionally, latent factor scores reflecting perceived self-efficacy in self-care (6 items), satisfaction with healthcare (2 items), self-rated health (2 items), and perceived impact of e-health (4 items) were calculated. These items were used to make comparisons between profiles resulting from the latent profile analysis. Variable definitions are provided in a separate file (see below).
latent-profile-analysis-results_docca.csv: This file contains 14 variables (columns) and 180 cases (rows). The variables represent profile classifications (numbers and labels) and posterior classification probabilities for each of the identified profiles, 4 profiles at T1 and 5 profiles at T2. Transition probabilities (from T1 to T2 profiles) were not calculated due to lacking configural similarity of profiles at T1 and T2; hence no transition probabilities are provided.
The ASSOCIATED DOCUMENTATION consists of one file with variable definitions in English and Swedish, and four script files (written in the R programming language):
variable-definitions_swe-eng.xlsx: This file consists of four sheets. Sheet 1 (scale-items_original_swedish) specifies the questionnaire items (in Swedish) that were used to calculate the latent factor scores; response scales are included. Sheet 2 (scale-items_translated_english) provides an English translation of the questionnaire items and response scales provided in Sheet 1. Sheet 3 (factorscores_docca) defines the variables in the factorscores_docca.csv dataset. Sheet 4 (latent-profile-analysis-results) defines the variables in the latent-profile-analysis-results_docca.csv dataset.
R-script_Step-0_Factor-scores.R: R script file with the code that was used to calculate the latent factor scores. This script can only be run with access to the raw data file which is not publicly shared due to ethical constraints. Hence, the purpose of the script file is code transparency. Also, the script shows the model specification that was used in the confirmatory factor analysis (CFA). Missingness in data was accounted for by using Full Information Maximum Likelihood (FIML).
R-script_Step-1_Latent-profile-analysis.R: R script file with the code that was used to run the latent profile analyses at T1 and T2 and produce profile plots. This code can be run with the provided dataset factorscores_docca.csv. Note that the script generates the results that are provided in the latent-profile-analysis-results_docca.csv dataset.
R-script_Step-2_Non-parametric-tests.R: R script file with the code that was used to run non-parametric tests for comparing exogenous variables between profiles at T1 and T2. This script uses the following datasets: factorscores_docca.csv and latent-profile-analysis-results_docca.csv.
R-script_Step-3_Class-transitions.R: R script file with the code that was used to create a sankey diagram for illustrating class transitions. This script uses the following dataset: latent-profile-analysis-results_docca.csv.
Software requirements: To run the code, the R software environment and R packages specified in the script files need to be installed (open source). The scripts were produced in R version 4.2.1.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Patients Table:
This table stores information about individual patients, including their names and contact details.
Doctors Table:
This table contains details about healthcare providers, including their names, specializations, and contact information.
Appointments Table:
This table records scheduled appointments, linking patients to doctors.
MedicalProcedure Table:
This table stores details about medical procedures associated with specific appointments.
Billing Table:
This table maintains records of billing transactions, associating them with specific patients.
demo Table:
This table appears to be a demonstration or testing table, possibly unrelated to the healthcare management system.
This dataset schema is designed to capture comprehensive information about patients, doctors, appointments, medical procedures, and billing transactions in a healthcare management system. Adjustments can be made based on specific requirements, and additional attributes can be included as needed.
The Heart Disease Data Set is provided by the Cleveland Clinic Foundation for Heart Disease. It's a CSV file with 303 rows. Each row contains information about a patient (a sample), and each column describes an attribute of the patient (a feature). We use the features to predict whether a patient has a heart disease (binary classification). It is originally hosted here.
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset containing the deidentified patient and clinical factor data used in the associated simulation study. The CSV file contains patient data used to inform arrival distributions and historical summary statistics for model calibration, covering an eight-month in 2019. The JSON file contains clinical factor weights for each general surgery procedure included in the study with associated clinical factor sub-levels. The DOCX file contains supplementary material used in conjunction with the clinical factor selection forms for each procedure to provide examples of considerations that may be used for certain clinical criteria.
Data-specific information for: patientData.csv Patient data used to inform arrival distributions and historical summary statistics for model calibration, covering 8 months in 2019. Number of columns: 17 Number of rows: 846 Column headings: INDEX = numerical index ID = assigned patient ID for purposes of this study HISTORICAL_ASSIGNED_CATEGORY = the urgency category the patient was historically treated under HISTORICAL_PLACED_ON_LIST_DATE_EPOCH = the number of days from an epoch a patient was placed on the list historically HISTORICAL_OPERATION_DATE_EPOCH = the number of days from an epoch of a patient was admitted to surgery historically DAYS_WAITING = the difference between HISTORICAL_PLACED_ON_LIST_DATE_EPOCH and HISTORICAL_OPERATION_DATE_EPOCH in days PROCEDURE = the surgical procedure for which the patient was historically placed on the waiting list OR_TIME = the total operating theatre duration in minutes (sum of PROCEDURE_TIME and CHANGEOVER_TIME) PROCEDURE_TIME = the duration of the respective procedure in minutes CHANGEOVER_TIME = the duration of the required changeover in minutes C1 = clinical factor score 1 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C2 = clinical factor score 2 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C3 = clinical factor score 3 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C4 = clinical factor score 4 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C5 = clinical factor score 5 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) FACTORS_SUM = the sum of C1..C5 ASSIGNED_CATEGORY = the revised urgency category assigned to the patient upon clinician review of case files when collecting clinical factor data Data-specific information for: clinicalFactorWeights.json Clinical factor weights for each procedure and associated clinical factor sub-levels. Number of variables: 10 General form of data structure: CX = clinical factor X CX_LY = clinical factor X level Y CX_LY_weight = weight of clinical factor X level Y { "Procedure 1": { "C1": { "C1_L1": "C1_L1_weight", "C1_L2": "C1_L2_weight", "C1_L3": "C1_L3_weight" }, "C2": { "C2_L1": "C2_L1_weight", "C2_L2": "C2_L2_weight", "C2_L3": "C2_L3_weight" }, "C3": { "C3_L1": "C3_L1_weight", "C3_L2": "C3_L2_weight", "C3_L3": "C3_L3_weight" }, "C4": { "C4_L1": "C4_L1_weight", "C4_L2": "C4_L2_weight", "C4_L3": "C4_L3_weight" }, "C5": { "C5_L1": "C5_L1_weight", "C5_L2": "C5_L2_weight", "C5_L3": "C5_L3_weight" } } } Data-specific information for: Clinical Factor Selection Form Supplementary Information and Definitions.docx Supplementary material is provided to be used in conjunction with the clinical factor selection forms for each procedure, to provide examples of considerations that may be used for specific clinical criteria.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.
With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.
We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.
Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.
Usage
You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.
Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.
Data Extraction: In your terminal, you can call either
make
(recommended), or
julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl
Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.
Further Reading
Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
This data release provides data in support of an assessment of water quality and discharge in the Herring River at the Chequessett Neck Road dike in Wellfleet, Massachusetts, from November 2015 to September 2017. The assessment was a cooperative project among the U.S. Geological Survey, National Park Service, Cape Cod National Seashore, and the Friends of Herring River to characterize environmental conditions prior to a future removal of the dike. It is described in U.S. Geological Survey (USGS) Scientific Investigations Report "Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015 – September 2017." This data release is structured as a set of comma-separated values (CSV) files, each of which contains information on data source (or laboratory used for analysis), USGS site identification (ID) number, beginning date of time of observation or sampling, ending date and time of observation or sampling and data such as flow rate and analytical results. The CSV files include calculated tidal daily flows (Flood_Tide_Tidal_Day.csv and Ebb_Tide_Tidal_Day.csv) that were used in Huntington and others (2020) for estimation of nutrient loads. Tidal daily flows are the estimated mean daily discharges for two consecutive flood and ebb tide cycles (average duration: 24 hours, 48 minutes). The associated date is the day on which most of the flow occurred. CSV files contain quality assurance data for water-quality samples including blanks (Blanks.csv), replicates (Replicates.csv), standard reference materials (Standard_Reference_Material.csv), and atmospheric ammonium contamination (NH4_Atmospheric_Contamination.csv). One CSV file (EWI_vs_ISCO.csv) contains data comparing composite samples collected by an automatic sampler (ISCO) at a fixed point with depth-integrated samples collected at equal width increments (EWI). One CSV file (Cross_Section_Field_Parameters.csv) contains field parameter data (specific conductance, temperature, pH, and dissolved oxygen) collected at a fixed location and data collected along the cross sections at variable water depths and horizontal distances across the openings of the culverts at the Chequessett Neck Road dike. One CSV file (LOADEST_Bias_Statistics.csv) contains data that include estimated natural log of load, model residuals, Z-scores, and seasonal model residuals for winter (December, January, and February); spring (March, April and May); summer (June, July and August); and fall (September, October, and November). The data release also includes a data dictionary (Data_Dictionary.csv) that provides detailed descriptions of each field in each CSV file, including: data filename; laboratory or data source; U.S. Geological Survey site ID numbers; data types; constituent (analyte) U.S. Geological Survey parameter codes; descriptions of parameters; units; methods; minimum reporting limits; limits of quantitation, if appropriate; method reference citations; and minimum, maximum, median, and average values for each analyte. The data release also includes an abbreviations file (Abbreviations.pdf) that defines all the abbreviations in the data dictionary and CSV files. Note that the USGS site ID includes a leading zero (011058798) and some of the parameter codes contain leading zeros, so care must be taken in opening and subsequently saving these files in other formats where leading zeros may be dropped.
Raw Data in .csv format for use with the R data wrangling scripts.
https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988
The aim of this research was to examine the concurrent relationships between these positive and negative experiences of German adults simultaneously as well as their changes over three weeks in 2020. Owing to German federalism, we expected these changes to differ between German states.: Research data set Sample 2
Various population statistics, including structured demographics data.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.