100+ datasets found
  1. n

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • data.niaid.nih.gov
    • explore.openaire.eu
    • +2more
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Harvard Medical School
    Massachusetts General Hospital
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

    Methods eLAB Development and Source Code (R statistical software):

    eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

    eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

    Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

    The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

    Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

    Data Dictionary (DD)

    EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

    Study Cohort

    This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

    Statistical Analysis

    OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

  2. Sample Graph Datasets in CSV Format

    • zenodo.org
    csv
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Edwin Carreño; Edwin Carreño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample Graph Datasets in CSV Format

    Note: none of the data sets published here contain actual data, they are for testing purposes only.

    Description

    This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

    • dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
    • dataset_30_edges_interactions.csv: contains 47 rows (edges).
    • the common identifier dataset_30 refers to the same graph.

    CSV nodes

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    UniProt IDstringprotein identification
    labelstringprotein label (type of node)
    propertiesstringa dictionary containing properties related to the protein.

    CSV edges

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    Relationship IDstringrelationship identification
    Source IDstringidentification of the source protein in the relationship
    Target IDstringidentification of the target protein in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata

    GraphNumber of NodesNumber of EdgesSparse graph

    dataset_30*

    30

    47

    Y

    dataset_60*

    60

    181

    Y

    dataset_120*

    120

    689

    Y

    dataset_240*

    240

    2819

    Y

    dataset_300*

    300

    4658

    Y

    dataset_600*

    600

    18004

    Y

    dataset_1200*

    1200

    71785

    Y

    dataset_2400*

    2400

    288600

    Y

    dataset_3000*

    3000

    449727

    Y

    dataset_6000*

    6000

    1799413

    Y

    dataset_12000*

    12000

    7199863

    Y

    dataset_24000*

    24000

    28792361

    Y

    dataset_30000*

    30000

    44991744

    Y

    This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

    CSV nodes (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringnode identification
    labelstringnode label (type of node)
    propertiesstringa dictionary containing properties related to the node.

    CSV edges (tiny graphs)

    Each dataset contains the following columns:

    Name of the ColumnTypeDescription
    IDstringrelationship identification
    sourcestringidentification of the source node in the relationship
    targetstringidentification of the target node in the relationship
    labelstringrelationship label (type of relationship)
    propertiesstringa dictionary containing properties related to the relationship.

    Metadata (tiny graphs)

    GraphNumber of NodesNumber of EdgesSparse graph
    dataset_dummy*36N
    dataset_dummy2*36N
  3. GitTables 1M - CSV files

    • zenodo.org
    zip
    Updated Jun 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 6, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains >800K CSV files behind the GitTables 1M corpus.

    For more information about the GitTables corpus, visit:

    - our website for GitTables, or

    - the main GitTables download page on Zenodo.

  4. Simulated Data for Patient Time Series Record Linkage

    • figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Soliman (2023). Simulated Data for Patient Time Series Record Linkage [Dataset]. http://doi.org/10.6084/m9.figshare.19224786.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ahmed Soliman
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.

  5. CSV file used in statistical analyses

    • data.csiro.au
    • researchdata.edu.au
    • +1more
    Updated Oct 13, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
    Explore at:
    Dataset updated
    Oct 13, 2014
    Dataset authored and provided by
    CSIROhttp://www.csiro.au/
    License

    https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/

    Time period covered
    Mar 14, 2008 - Jun 9, 2009
    Dataset funded by
    CSIROhttp://www.csiro.au/
    Description

    A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.

  6. 🔍 Diverse CSV Dataset Samples

    • kaggle.com
    Updated Nov 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Samy Baladram
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    https://i.imgur.com/PcSDv8A.png" alt="Imgur">

    Overview

    The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

    Files

    Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

    Format

    The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

    Quality Assurance

    The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

    Acknowledgements

    The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

    This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">

  7. m

    EHR Dataset for Patient Treatment Classification

    • data.mendeley.com
    • paperswithcode.com
    Updated May 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mujiono Sadikin (2020). EHR Dataset for Patient Treatment Classification [Dataset]. http://doi.org/10.17632/7kv3rctx7m.1
    Explore at:
    Dataset updated
    May 10, 2020
    Authors
    Mujiono Sadikin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset is Electronic Health Record Predicting collected from a private Hospital in Indonesia. It contains the patients laboratory test results used to determine next patient treatment whether in care or out care patient. The task embedded to the dataset is classification prediction.

  8. Real Oesophageal Cancer Datasets

    • kaggle.com
    Updated Aug 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AM (2021). Real Oesophageal Cancer Datasets [Dataset]. https://www.kaggle.com/amandam1/oesophageal-cancer-patient-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AM
    Description

    Context and Content

    The first dataset (Oesophageal Cancer Clinical.csv) has clinical data of Oesophageal Carcinoma patients.

    The second dataset (Oesophageal Cancer Protein.csv) has the protein expression data for the same set of patients.

    The two datasets contain information on the same patients. However, the clinical dataset contains a greater number of patient records than corresponding protein expression data in the second dataset. The clinical dataset has patient_barcode as the unique identifier, whereas, in the protein expression dataset the Sample_ID is used. In both datasets, the patient_barcode can be derived as, “TCGA”. - “ tissue_source_site”. - “patient_id”, e.g. -TCGA-2H-A9GF.

    Inspiration

    There are a huge amount of columns in these datasets, 83 clinical and 223 protein, giving you a great ability to try derive some interesting findings of oesophageal cancer from this real dataset.

    This dataset can be used to analyse genes, mutations and interesting findings relating to this type of cancer.

  9. CarePrecise Authoritative Hospital Database (AHD)

    • datarade.ai
    .csv, .xls
    Updated Aug 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CarePrecise (2021). CarePrecise Authoritative Hospital Database (AHD) [Dataset]. https://datarade.ai/data-products/careprecise-authoritative-hospital-database-ahd-careprecise
    Explore at:
    .csv, .xlsAvailable download formats
    Dataset updated
    Aug 27, 2021
    Dataset authored and provided by
    CarePrecise
    Area covered
    United States of America
    Description

    [IMPORTANT NOTE: Sample file posted on Datarade is not the complete dataset, as Datarade permits only a single CSV file. Visit https://www.careprecise.com/healthcare-provider-data-sample.htm for more complete samples.] Updated every month, CarePrecise developed the AHD to provide a comprehensive database of U.S. hospital information. Extracted from the CarePrecise master provider database with information all of the 6.3 million HIPAA-covered US healthcare providers and additional sources, the Authoritative Hospital Database (AHD) contains records for all HIPAA-covered hospitals. In this database of hospitals we include bed counts, patient satisfaction data, hospital system ownership, hospital charges and cases by Zip Code®, and more. Most records include a cabinet-level or director-level contact. A PlaceKey is provided where available.

    The AHD includes bed counts for 95% of hospitals, full contact information on 85%, and fax numbers for 62%. We include detailed patient satisfaction data, employee counts, and medical procedure volumes.

    The AHD integrates directly with our extended provider data product to bring you the physicians and practice groups affiliated with the hospitals. This combination of data is the only commercially available hospital dataset of this depth.

    NEW: Hospital NPI to CCN Rollup A CarePrecise Exclusive. Using advanced record-linkage technology, the AHD now includes a new file that makes it possible to mine the vast hospital information available in the National Provider Identifier registry database. Hospitals may have dozens of NPI records, each with its own information about a unit, listing facility type and/or medical specialties practiced, as well as separate contact names. To wield the power of this new feature, you'll need the CarePrecise Master Bundle, which contains all of the publicly available NPI registry data. These data are available in other CarePrecise data products.

    Counts are approximate due to ongoing updates. Please review the current AHD information here: https://www.careprecise.com/detail_authoritative_hospital_database.htm

    The AHD is sold as-is and no warranty is offered regarding accuracy, timeliness, completeness, or fitness for any purpose.

  10. h

    doc-formats-csv-3

    • huggingface.co
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasets examples (2023). doc-formats-csv-3 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 23, 2023
    Dataset authored and provided by
    Datasets examples
    Description

    [doc] formats - csv - 3

    This dataset contains one csv file at the root:

    data.csv

    ignored comment

    col1|col2 dog|woof cat|meow pokemon|pika human|hello

    We define the config name in the YAML config, as well as the exact location of the file, the separator as "|", the name of the columns, and the number of rows to ignore (the row #1 is a row of column headers, that will be replaced by the names option, and the row #0 is ignored). The reference for the options is the documentation… See the full description on the dataset page: https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3.

  11. r

    Data and code for: Better self-care through co-care? A latent profile...

    • researchdata.se
    • demo.researchdata.se
    Updated Aug 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carolina Wannheden; Marta Roczniewska; Henna Hasson; Klas Karlgren; Ulrica von Thiele Schwarz (2024). Data and code for: Better self-care through co-care? A latent profile analysis of primary care patients’ experiences of e-health-supported chronic care management [Dataset]. http://doi.org/10.48723/kzja-5k21
    Explore at:
    (19302), (61550), (4737), (3482), (12552), (38887), (3082)Available download formats
    Dataset updated
    Aug 19, 2024
    Dataset provided by
    Karolinska Institutet
    Authors
    Carolina Wannheden; Marta Roczniewska; Henna Hasson; Klas Karlgren; Ulrica von Thiele Schwarz
    Time period covered
    Oct 2018 - Jun 2019
    Area covered
    Stockholm County
    Description

    This data description contains code (written in the R programming language), as well as processed data and results presented in a research article (see references). No raw data are provided and the data that are made available cannot be linked to study participants. The sample consists of 180 of 308 eligible participants (adult primary care patients in Sweden, living with chronic illness) who responded to a Swedish web-based questionnaire at two time points. Using a confirmatory factor analysis, we calculated latent factor scores for 9 constructs, based on 34 questionnaire items. In this dataset, we share the latent factor scores and the latent profile analysis results. Although raw data are not shared, we provide the questionnaire item, including response scales. The code that was used to produce the latent factor scores and latent profile analysis results is also provided.

    The study was performed as part of a research project exploring how the use of eHealth services in chronic care influence interaction and collaboration between patients and healthcare. The purpose of the study was to identify subgroups of primary care patients who are similar with respect to their experiences of co-care, as measured by the DoCCA scale (von Thiele Schwarz, 2021). Baseline data were collected after patients had been introduced to an eHealth service that aimed to support them in their self-care and digital communication with healthcare; follow-up data were collected 7 months later. All patients were treated at the same primary care center, located in the Stockholm Region in Sweden.

    Cited reference: von Thiele Schwarz U, Roczniewska M, Pukk Härenstam K, Karlgren K, Hasson H, Menczel S, Wannheden C. The work of having a chronic condition: Development and psychometric evaluation of the Distribution of Co-Care Activities (DoCCA) Scale. BMC Health Services Research (2021) 21:480. doi: 10.1186/s12913-021-06455-8

    The DATASET consists of two files: factorscores_docca.csv and latent-profile-analysis-results_docca.csv.

    • factorscores_docca.csv: This file contains 18 variables (columns) and 180 cases (rows). The variables represent latent factors (measured at two time points, T1 and T2) and the values are latent factor scores. The questionnaire data that were used to produce the latent factor scores consist of 20 items that measure experiences of collaboration with healthcare, based on the DoCCA scale. These items were included in the latent profile analysis. Additionally, latent factor scores reflecting perceived self-efficacy in self-care (6 items), satisfaction with healthcare (2 items), self-rated health (2 items), and perceived impact of e-health (4 items) were calculated. These items were used to make comparisons between profiles resulting from the latent profile analysis. Variable definitions are provided in a separate file (see below).

    • latent-profile-analysis-results_docca.csv: This file contains 14 variables (columns) and 180 cases (rows). The variables represent profile classifications (numbers and labels) and posterior classification probabilities for each of the identified profiles, 4 profiles at T1 and 5 profiles at T2. Transition probabilities (from T1 to T2 profiles) were not calculated due to lacking configural similarity of profiles at T1 and T2; hence no transition probabilities are provided.

    The ASSOCIATED DOCUMENTATION consists of one file with variable definitions in English and Swedish, and four script files (written in the R programming language):

    • variable-definitions_swe-eng.xlsx: This file consists of four sheets. Sheet 1 (scale-items_original_swedish) specifies the questionnaire items (in Swedish) that were used to calculate the latent factor scores; response scales are included. Sheet 2 (scale-items_translated_english) provides an English translation of the questionnaire items and response scales provided in Sheet 1. Sheet 3 (factorscores_docca) defines the variables in the factorscores_docca.csv dataset. Sheet 4 (latent-profile-analysis-results) defines the variables in the latent-profile-analysis-results_docca.csv dataset.

    • R-script_Step-0_Factor-scores.R: R script file with the code that was used to calculate the latent factor scores. This script can only be run with access to the raw data file which is not publicly shared due to ethical constraints. Hence, the purpose of the script file is code transparency. Also, the script shows the model specification that was used in the confirmatory factor analysis (CFA). Missingness in data was accounted for by using Full Information Maximum Likelihood (FIML).

    • R-script_Step-1_Latent-profile-analysis.R: R script file with the code that was used to run the latent profile analyses at T1 and T2 and produce profile plots. This code can be run with the provided dataset factorscores_docca.csv. Note that the script generates the results that are provided in the latent-profile-analysis-results_docca.csv dataset.

    • R-script_Step-2_Non-parametric-tests.R: R script file with the code that was used to run non-parametric tests for comparing exogenous variables between profiles at T1 and T2. This script uses the following datasets: factorscores_docca.csv and latent-profile-analysis-results_docca.csv.

    • R-script_Step-3_Class-transitions.R: R script file with the code that was used to create a sankey diagram for illustrating class transitions. This script uses the following dataset: latent-profile-analysis-results_docca.csv.

    Software requirements: To run the code, the R software environment and R packages specified in the script files need to be installed (open source). The scripts were produced in R version 4.2.1.

  12. Healthcare Management System

    • kaggle.com
    Updated Dec 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anouska Abhisikta (2023). Healthcare Management System [Dataset]. https://www.kaggle.com/datasets/anouskaabhisikta/healthcare-management-system
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 23, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Anouska Abhisikta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Patients Table:

    • PatientID: Unique identifier for each patient.
    • firstname: First name of the patient.
    • lastname: Last name of the patient.
    • email: Email address of the patient.

    This table stores information about individual patients, including their names and contact details.

    Doctors Table:

    • DoctorID: Unique identifier for each doctor.
    • DoctorName: Full name of the doctor.
    • Specialization: Area of medical specialization.
    • DoctorContact: Contact details of the doctor.

    This table contains details about healthcare providers, including their names, specializations, and contact information.

    Appointments Table:

    • AppointmentID: Unique identifier for each appointment.
    • Date: Date of the appointment.
    • Time: Time of the appointment.
    • PatientID: Foreign key referencing the Patients table, indicating the patient for the appointment.
    • DoctorID: Foreign key referencing the Doctors table, indicating the doctor for the appointment.

    This table records scheduled appointments, linking patients to doctors.

    MedicalProcedure Table:

    • ProcedureID: Unique identifier for each medical procedure.
    • ProcedureName: Name or description of the medical procedure.
    • AppointmentID: Foreign key referencing the Appointments table, indicating the appointment associated with the procedure.

    This table stores details about medical procedures associated with specific appointments.

    Billing Table:

    • InvoiceID: Unique identifier for each billing transaction.
    • PatientID: Foreign key referencing the Patients table, indicating the patient for the billing transaction.
    • Items: Description of items or services billed.
    • Amount: Amount charged for the billing transaction.

    This table maintains records of billing transactions, associating them with specific patients.

    demo Table:

    • ID: Primary key, serves as a unique identifier for each record.
    • Name: Name of the entity.
    • Hint: Additional information or hint about the entity.

    This table appears to be a demonstration or testing table, possibly unrelated to the healthcare management system.

    This dataset schema is designed to capture comprehensive information about patients, doctors, appointments, medical procedures, and billing transactions in a healthcare management system. Adjustments can be made based on specific requirements, and additional attributes can be included as needed.

  13. h

    heart-disease

    • huggingface.co
    Updated Jun 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Buiani (2020). heart-disease [Dataset]. https://huggingface.co/datasets/buio/heart-disease
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 10, 2020
    Authors
    Marco Buiani
    Description

    The Heart Disease Data Set is provided by the Cleveland Clinic Foundation for Heart Disease. It's a CSV file with 303 rows. Each row contains information about a patient (a sample), and each column describes an attribute of the patient (a feature). We use the features to predict whether a patient has a heart disease (binary classification). It is originally hosted here.

  14. d

    Data Management Plan Examples Database

    • search.dataone.org
    • borealisdata.ca
    Updated Sep 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
    Explore at:
    Dataset updated
    Sep 4, 2024
    Dataset provided by
    Borealis
    Authors
    Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak
    Time period covered
    Jan 1, 2011 - Jan 1, 2023
    Description

    This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.

  15. f

    Data from: Managing surgical waiting lists through dynamic priority scoring...

    • figshare.com
    docx
    Updated May 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Powers; James M. McGree; David Grieve; Ratna Aseervatham; Suzanne Ryan; Paul Corry (2023). Managing surgical waiting lists through dynamic priority scoring - Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.19771738.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 28, 2023
    Dataset provided by
    figshare
    Authors
    Jack Powers; James M. McGree; David Grieve; Ratna Aseervatham; Suzanne Ryan; Paul Corry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset containing the deidentified patient and clinical factor data used in the associated simulation study. The CSV file contains patient data used to inform arrival distributions and historical summary statistics for model calibration, covering an eight-month in 2019. The JSON file contains clinical factor weights for each general surgery procedure included in the study with associated clinical factor sub-levels. The DOCX file contains supplementary material used in conjunction with the clinical factor selection forms for each procedure to provide examples of considerations that may be used for certain clinical criteria.

    Data-specific information for: patientData.csv Patient data used to inform arrival distributions and historical summary statistics for model calibration, covering 8 months in 2019. Number of columns: 17 Number of rows: 846 Column headings: INDEX = numerical index ID = assigned patient ID for purposes of this study HISTORICAL_ASSIGNED_CATEGORY = the urgency category the patient was historically treated under HISTORICAL_PLACED_ON_LIST_DATE_EPOCH = the number of days from an epoch a patient was placed on the list historically HISTORICAL_OPERATION_DATE_EPOCH = the number of days from an epoch of a patient was admitted to surgery historically DAYS_WAITING = the difference between HISTORICAL_PLACED_ON_LIST_DATE_EPOCH and HISTORICAL_OPERATION_DATE_EPOCH in days PROCEDURE = the surgical procedure for which the patient was historically placed on the waiting list OR_TIME = the total operating theatre duration in minutes (sum of PROCEDURE_TIME and CHANGEOVER_TIME) PROCEDURE_TIME = the duration of the respective procedure in minutes CHANGEOVER_TIME = the duration of the required changeover in minutes C1 = clinical factor score 1 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C2 = clinical factor score 2 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C3 = clinical factor score 3 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C4 = clinical factor score 4 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C5 = clinical factor score 5 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) FACTORS_SUM = the sum of C1..C5 ASSIGNED_CATEGORY = the revised urgency category assigned to the patient upon clinician review of case files when collecting clinical factor data Data-specific information for: clinicalFactorWeights.json Clinical factor weights for each procedure and associated clinical factor sub-levels. Number of variables: 10 General form of data structure: CX = clinical factor X CX_LY = clinical factor X level Y CX_LY_weight = weight of clinical factor X level Y { "Procedure 1": { "C1": { "C1_L1": "C1_L1_weight", "C1_L2": "C1_L2_weight", "C1_L3": "C1_L3_weight" }, "C2": { "C2_L1": "C2_L1_weight", "C2_L2": "C2_L2_weight", "C2_L3": "C2_L3_weight" }, "C3": { "C3_L1": "C3_L1_weight", "C3_L2": "C3_L2_weight", "C3_L3": "C3_L3_weight" }, "C4": { "C4_L1": "C4_L1_weight", "C4_L2": "C4_L2_weight", "C4_L3": "C4_L3_weight" }, "C5": { "C5_L1": "C5_L1_weight", "C5_L2": "C5_L2_weight", "C5_L3": "C5_L3_weight" } } } Data-specific information for: Clinical Factor Selection Form Supplementary Information and Definitions.docx Supplementary material is provided to be used in conjunction with the clinical factor selection forms for each procedure, to provide examples of considerations that may be used for specific clinical criteria.

  16. Z

    UCI and OpenML Data Sets for Ordinal Quantification

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moreo, Alejandro (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
    Explore at:
    Dataset updated
    Jul 25, 2023
    Dataset provided by
    Moreo, Alejandro
    Bunse, Mirko
    Senz, Martin
    Sebastiani, Fabrizio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

    With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

    We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

    Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

    Usage

    You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

    Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

    Data Extraction: In your terminal, you can call either

    make

    (recommended), or

    julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

    Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

    Further Reading

    Implementation of our experiments: https://github.com/mirkobunse/regularized-oq

  17. d

    Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015–September 2017 [Dataset]. https://catalog.data.gov/dataset/tidal-daily-discharge-and-quality-assurance-data-supporting-an-assessment-of-water-quality
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Wellfleet, Herring River, Massachusetts
    Description

    This data release provides data in support of an assessment of water quality and discharge in the Herring River at the Chequessett Neck Road dike in Wellfleet, Massachusetts, from November 2015 to September 2017. The assessment was a cooperative project among the U.S. Geological Survey, National Park Service, Cape Cod National Seashore, and the Friends of Herring River to characterize environmental conditions prior to a future removal of the dike. It is described in U.S. Geological Survey (USGS) Scientific Investigations Report "Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015 – September 2017." This data release is structured as a set of comma-separated values (CSV) files, each of which contains information on data source (or laboratory used for analysis), USGS site identification (ID) number, beginning date of time of observation or sampling, ending date and time of observation or sampling and data such as flow rate and analytical results. The CSV files include calculated tidal daily flows (Flood_Tide_Tidal_Day.csv and Ebb_Tide_Tidal_Day.csv) that were used in Huntington and others (2020) for estimation of nutrient loads. Tidal daily flows are the estimated mean daily discharges for two consecutive flood and ebb tide cycles (average duration: 24 hours, 48 minutes). The associated date is the day on which most of the flow occurred. CSV files contain quality assurance data for water-quality samples including blanks (Blanks.csv), replicates (Replicates.csv), standard reference materials (Standard_Reference_Material.csv), and atmospheric ammonium contamination (NH4_Atmospheric_Contamination.csv). One CSV file (EWI_vs_ISCO.csv) contains data comparing composite samples collected by an automatic sampler (ISCO) at a fixed point with depth-integrated samples collected at equal width increments (EWI). One CSV file (Cross_Section_Field_Parameters.csv) contains field parameter data (specific conductance, temperature, pH, and dissolved oxygen) collected at a fixed location and data collected along the cross sections at variable water depths and horizontal distances across the openings of the culverts at the Chequessett Neck Road dike. One CSV file (LOADEST_Bias_Statistics.csv) contains data that include estimated natural log of load, model residuals, Z-scores, and seasonal model residuals for winter (December, January, and February); spring (March, April and May); summer (June, July and August); and fall (September, October, and November). The data release also includes a data dictionary (Data_Dictionary.csv) that provides detailed descriptions of each field in each CSV file, including: data filename; laboratory or data source; U.S. Geological Survey site ID numbers; data types; constituent (analyte) U.S. Geological Survey parameter codes; descriptions of parameters; units; methods; minimum reporting limits; limits of quantitation, if appropriate; method reference citations; and minimum, maximum, median, and average values for each analyte. The data release also includes an abbreviations file (Abbreviations.pdf) that defines all the abbreviations in the data dictionary and CSV files. Note that the USGS site ID includes a leading zero (011058798) and some of the parameter codes contain leading zeros, so care must be taken in opening and subsequently saving these files in other formats where leading zeros may be dropped.

  18. Raw Data - CSV Files

    • osf.io
    Updated Apr 27, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Katelyn Conn (2020). Raw Data - CSV Files [Dataset]. https://osf.io/h5wbt
    Explore at:
    Dataset updated
    Apr 27, 2020
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Katelyn Conn
    Description

    Raw Data in .csv format for use with the R data wrangling scripts.

  19. p

    MHeCor studies Sample 2 data set.csv

    • psycharchives.org
    Updated Jan 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). MHeCor studies Sample 2 data set.csv [Dataset]. https://www.psycharchives.org/en/item/2712257d-4098-4b9d-8ba0-ead118a68123
    Explore at:
    Dataset updated
    Jan 7, 2023
    License

    https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988

    Description

    The aim of this research was to examine the concurrent relationships between these positive and negative experiences of German adults simultaneously as well as their changes over three weeks in 2020. Owing to German federalism, we expected these changes to differ between German states.: Research data set Sample 2

  20. g

    Demographics

    • health.google.com
    Updated Oct 7, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Demographics [Dataset]. https://health.google.com/covid-19/open-data/raw-data
    Explore at:
    Dataset updated
    Oct 7, 2021
    Variables measured
    key, population, population_male, rural_population, urban_population, population_female, population_density, clustered_population, population_age_00_09, population_age_10_19, and 11 more
    Description

    Various population statistics, including structured demographics data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm

Data from: Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

Search
Clear search
Close search
Google apps
Main menu