100+ datasets found

n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
explore.openaire.eu
+2more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

Sample Graph Datasets in CSV Format

zenodo.org

csv

Updated Dec 9, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Edwin Carreño; Edwin Carreño (2024). Sample Graph Datasets in CSV Format [Dataset]. http://doi.org/10.5281/zenodo.14335015

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.14335015

Dataset updated

Dec 9, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Edwin Carreño; Edwin Carreño

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Sample Graph Datasets in CSV Format

Note: none of the data sets published here contain actual data, they are for testing purposes only.

Description

This data repository contains graph datasets, where each graph is represented by two CSV files: one for node information and another for edge details. To link the files to the same graph, their names include a common identifier based on the number of nodes. For example:

dataset_30_nodes_interactions.csv:contains 30 rows (nodes).
dataset_30_edges_interactions.csv: contains 47 rows (edges).
the common identifier dataset_30 refers to the same graph.

CSV nodes

Each dataset contains the following columns:

Name of the Column	Type	Description
UniProt ID	string	protein identification
label	string	protein label (type of node)
properties	string	a dictionary containing properties related to the protein.

CSV edges

Each dataset contains the following columns:

Name of the Column	Type	Description
Relationship ID	string	relationship identification
Source ID	string	identification of the source protein in the relationship
Target ID	string	identification of the target protein in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_30*	30	47	Y
dataset_60*	60	181	Y
dataset_120*	120	689	Y
dataset_240*	240	2819	Y
dataset_300*	300	4658	Y
dataset_600*	600	18004	Y
dataset_1200*	1200	71785	Y
dataset_2400*	2400	288600	Y
dataset_3000*	3000	449727	Y
dataset_6000*	6000	1799413	Y
dataset_12000*	12000	7199863	Y
dataset_24000*	24000	28792361	Y
dataset_30000*	30000	44991744	Y

This repository include two (2) additional tiny graph datasets to experiment before dealing with larger datasets.

CSV nodes (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	node identification
label	string	node label (type of node)
properties	string	a dictionary containing properties related to the node.

CSV edges (tiny graphs)

Each dataset contains the following columns:

Name of the Column	Type	Description
ID	string	relationship identification
source	string	identification of the source node in the relationship
target	string	identification of the target node in the relationship
label	string	relationship label (type of relationship)
properties	string	a dictionary containing properties related to the relationship.

Metadata (tiny graphs)

Graph	Number of Nodes	Number of Edges	Sparse graph
dataset_dummy*	3	6	N
dataset_dummy2*	3	6	N

GitTables 1M - CSV files
zenodo.org
zip
Updated Jun 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth (2022). GitTables 1M - CSV files [Dataset]. http://doi.org/10.5281/zenodo.6515973
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6515973
Dataset updated
Jun 6, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Madelon Hulsebos; Çağatay Demiralp; Paul Groth; Madelon Hulsebos; Çağatay Demiralp; Paul Groth
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This dataset contains >800K CSV files behind the GitTables 1M corpus.

For more information about the GitTables corpus, visit:

- our website for GitTables, or

- the main GitTables download page on Zenodo.
Simulated Data for Patient Time Series Record Linkage
figshare.com
zip
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ahmed Soliman (2023). Simulated Data for Patient Time Series Record Linkage [Dataset]. http://doi.org/10.6084/m9.figshare.19224786.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19224786.v1
Dataset updated
Jun 3, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ahmed Soliman
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This simulated dataset constitutes two files (after decompression), namely: sim_ergo_1600.csv and sim_pat_1600.csv.1. ergo.csv contains heart rate timeseries data for 1600 patients' ergometric tests. For each patient, 20 different ergometric tests were simulated. Each row in this file constitutes three field values: Ergo_ID, Heart Rate (BPM), and timestamp.2. pat.csv contains only four sample readings from each of the patient's 20 ergometric tests. Each row contains three values: patient_ID, Heart Rate, and timestamp. The goal is to link patients (identified by their patient_ID in the pat.csv file) to their corresponding ergometric tests (identified by their Ergo_ID in the ergo.csv file). This is done solely on matching the timestamp-value pairs from both files.The timeseries record linkage task described above is efficiently accomplished by the proposed tslink2 algorithm. tslink2 is implemented in C++ and is publicly availabe at https://github.com/ahmsoliman/tslink2Data is simulated such that correctly linked/matched identifiers follow the following formula:|Ergo_ID - patient_ID| mod 104 == 0The above formula is useful in evaluating the linkage algorithm performance.
CSV file used in statistical analyses
data.csiro.au
researchdata.edu.au
+1more
Updated Oct 13, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CSIRO (2014). CSV file used in statistical analyses [Dataset]. http://doi.org/10.4225/08/543B4B4CA92E6
Explore at:
Unique identifier
https://doi.org/10.4225/08/543B4B4CA92E6
Dataset updated
Oct 13, 2014
Dataset authored and provided by
CSIROhttp://www.csiro.au/
License
https://research.csiro.au/dap/licences/csiro-data-licence/https://research.csiro.au/dap/licences/csiro-data-licence/
Time period covered
Mar 14, 2008 - Jun 9, 2009
Dataset funded by
CSIROhttp://www.csiro.au/
Description
A csv file containing the tidal frequencies used for statistical analyses in the paper "Estimating Freshwater Flows From Tidally-Affected Hydrographic Data" by Dan Pagendam and Don Percival.
🔍 Diverse CSV Dataset Samples
kaggle.com
Updated Nov 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samy Baladram (2023). 🔍 Diverse CSV Dataset Samples [Dataset]. https://www.kaggle.com/datasets/samybaladram/multidisciplinary-csv-datasets-collection/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Samy Baladram
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
https://i.imgur.com/PcSDv8A.png" alt="Imgur">

Overview

The dataset provided here is a rich compilation of various data files gathered to support diverse analytical challenges and education in data science. It is especially curated to provide researchers, data enthusiasts, and students with real-world data across different domains, including biostatistics, travel, real estate, sports, media viewership, and more.

Files

Below is a brief overview of what each CSV file contains: - Addresses: Practical examples of string manipulation and address data formatting in CSV. - Air Travel: Historical dataset suitable for analyzing trends in air travel over a period of three years. - Biostats: A dataset of office workers' biometrics, ideal for introductory statistics and biology. - Cities: Geographic and administrative data for urban analysis or socio-demographic studies. - Car Crashes in Catalonia: Weekly traffic accident data from Catalonia, providing a base for public policy research. - De Niro's Film Ratings: Analyze trends in film ratings over time with this entertainment-focused dataset. - Ford Escort Sales: Pre-owned vehicle sales data, perfect for regression analysis or price prediction models. - Old Faithful Geyser: Geological data for pattern recognition and prediction in natural phenomena. - Freshman Year Weights and BMIs: Dataset depicting weight and BMI changes for health and lifestyle studies. - Grades: Education performance data which can be correlated with demographics or study patterns. - Home Sales: A dataset reflecting the housing market dynamics, useful for economic analysis or real estate appraisal. - Hooke's Law Demonstration: Physics data illustrating the classic principle of elasticity in springs. - Hurricanes and Storm Data: Climate data on hurricane and storm frequency for environmental risk assessments. - Height and Weight Measurements: Public health research dataset on anthropometric data. - Lead Shot Specs: Detailed engineering data for material sciences and manufacturing studies. - Alphabet Letter Frequency: Text analysis dataset for frequency distribution studies in large text samples. - MLB Player Statistics: Comprehensive athletic data set for analysis of performance metrics in sports. - MLB Teams' Seasonal Performance: A dataset combining financial and sports performance data from the 2012 MLB season. - TV News Viewership: Media consumption data which can be used to analyze viewing patterns and trends. - Historical Nile Flood Data: A unique environmental dataset for historical trend analysis in flood levels. - Oscar Winner Ages: A dataset to explore age trends among Oscar-winning actors and actresses. - Snakes and Ladders Statistics: Data from the game outcomes useful in studying probability and game theory. - Tallahassee Cab Fares: Price modeling data from the real-world pricing of taxi services. - Taxable Goods Data: A snapshot of economic data concerning taxation impact on prices. - Tree Measurements: Ecological and environmental science data related to tree growth and forest management. - Real Estate Prices from Zillow: Market analysis dataset for those interested in housing price determinants.

Format

The enclosed data respect the comma-separated values (CSV) file format standards, ensuring compatibility with most data processing libraries in Python, R, and other languages. The datasets are ready for import into Jupyter notebooks, RStudio, or any other integrated development environment (IDE) used for data science.

Quality Assurance

The data is pre-checked for common issues such as missing values, duplicate records, and inconsistent entries, offering a clean and reliable dataset for various analytical exercises. With initial header lines in some CSV files, users can easily identify dataset fields and start their analysis without additional data cleaning for headers.

Acknowledgements

The dataset adheres to the GNU LGPL license, making it freely available for modification and distribution, provided that the original source is cited. This opens up possibilities for educators to integrate real-world data into curricula, researchers to validate models against diverse datasets, and practitioners to refine their analytical skills with hands-on data.

This dataset has been compiled from https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html, with gratitude to the authors and maintainers for their dedication to providing open data resources for educational and research purposes. https://i.imgur.com/HOtyghv.png" alt="Imgur">
m
EHR Dataset for Patient Treatment Classification
data.mendeley.com
paperswithcode.com
Updated May 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mujiono Sadikin (2020). EHR Dataset for Patient Treatment Classification [Dataset]. http://doi.org/10.17632/7kv3rctx7m.1
Explore at:
Unique identifier
https://doi.org/10.17632/7kv3rctx7m.1
Dataset updated
May 10, 2020
Authors
Mujiono Sadikin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is Electronic Health Record Predicting collected from a private Hospital in Indonesia. It contains the patients laboratory test results used to determine next patient treatment whether in care or out care patient. The task embedded to the dataset is classification prediction.
Real Oesophageal Cancer Datasets
kaggle.com
Updated Aug 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AM (2021). Real Oesophageal Cancer Datasets [Dataset]. https://www.kaggle.com/amandam1/oesophageal-cancer-patient-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
AM
Description
Context and Content

The first dataset (Oesophageal Cancer Clinical.csv) has clinical data of Oesophageal Carcinoma patients.

The second dataset (Oesophageal Cancer Protein.csv) has the protein expression data for the same set of patients.

The two datasets contain information on the same patients. However, the clinical dataset contains a greater number of patient records than corresponding protein expression data in the second dataset. The clinical dataset has patient_barcode as the unique identifier, whereas, in the protein expression dataset the Sample_ID is used. In both datasets, the patient_barcode can be derived as, “TCGA”. - “ tissue_source_site”. - “patient_id”, e.g. -TCGA-2H-A9GF.

Inspiration

There are a huge amount of columns in these datasets, 83 clinical and 223 protein, giving you a great ability to try derive some interesting findings of oesophageal cancer from this real dataset.

This dataset can be used to analyse genes, mutations and interesting findings relating to this type of cancer.
CarePrecise Authoritative Hospital Database (AHD)
datarade.ai
.csv, .xls
Updated Aug 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CarePrecise (2021). CarePrecise Authoritative Hospital Database (AHD) [Dataset]. https://datarade.ai/data-products/careprecise-authoritative-hospital-database-ahd-careprecise
Explore at:
.csv, .xlsAvailable download formats
Dataset updated
Aug 27, 2021
Dataset authored and provided by
CarePrecise
Area covered
United States of America
Description
[IMPORTANT NOTE: Sample file posted on Datarade is not the complete dataset, as Datarade permits only a single CSV file. Visit https://www.careprecise.com/healthcare-provider-data-sample.htm for more complete samples.] Updated every month, CarePrecise developed the AHD to provide a comprehensive database of U.S. hospital information. Extracted from the CarePrecise master provider database with information all of the 6.3 million HIPAA-covered US healthcare providers and additional sources, the Authoritative Hospital Database (AHD) contains records for all HIPAA-covered hospitals. In this database of hospitals we include bed counts, patient satisfaction data, hospital system ownership, hospital charges and cases by Zip Code®, and more. Most records include a cabinet-level or director-level contact. A PlaceKey is provided where available.

The AHD includes bed counts for 95% of hospitals, full contact information on 85%, and fax numbers for 62%. We include detailed patient satisfaction data, employee counts, and medical procedure volumes.

The AHD integrates directly with our extended provider data product to bring you the physicians and practice groups affiliated with the hospitals. This combination of data is the only commercially available hospital dataset of this depth.

NEW: Hospital NPI to CCN Rollup A CarePrecise Exclusive. Using advanced record-linkage technology, the AHD now includes a new file that makes it possible to mine the vast hospital information available in the National Provider Identifier registry database. Hospitals may have dozens of NPI records, each with its own information about a unit, listing facility type and/or medical specialties practiced, as well as separate contact names. To wield the power of this new feature, you'll need the CarePrecise Master Bundle, which contains all of the publicly available NPI registry data. These data are available in other CarePrecise data products.

Counts are approximate due to ongoing updates. Please review the current AHD information here: https://www.careprecise.com/detail_authoritative_hospital_database.htm

The AHD is sold as-is and no warranty is offered regarding accuracy, timeliness, completeness, or fitness for any purpose.
h
doc-formats-csv-3
huggingface.co
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasets examples (2023). doc-formats-csv-3 [Dataset]. https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 23, 2023
Dataset authored and provided by
Datasets examples
Description
[doc] formats - csv - 3

This dataset contains one csv file at the root:

data.csv

ignored comment

col1|col2 dog|woof cat|meow pokemon|pika human|hello

We define the config name in the YAML config, as well as the exact location of the file, the separator as "|", the name of the columns, and the number of rows to ignore (the row #1 is a row of column headers, that will be replaced by the names option, and the row #0 is ignored). The reference for the options is the documentation… See the full description on the dataset page: https://huggingface.co/datasets/datasets-examples/doc-formats-csv-3.
r
Data and code for: Better self-care through co-care? A latent profile...
researchdata.se
demo.researchdata.se
Updated Aug 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carolina Wannheden; Marta Roczniewska; Henna Hasson; Klas Karlgren; Ulrica von Thiele Schwarz (2024). Data and code for: Better self-care through co-care? A latent profile analysis of primary care patients’ experiences of e-health-supported chronic care management [Dataset]. http://doi.org/10.48723/kzja-5k21
Explore at:
(19302), (61550), (4737), (3482), (12552), (38887), (3082)Available download formats
Unique identifier
https://doi.org/10.48723/kzja-5k21
Dataset updated
Aug 19, 2024
Dataset provided by
Karolinska Institutet
Authors
Carolina Wannheden; Marta Roczniewska; Henna Hasson; Klas Karlgren; Ulrica von Thiele Schwarz
Time period covered
Oct 2018 - Jun 2019
Area covered
Stockholm County
Description
This data description contains code (written in the R programming language), as well as processed data and results presented in a research article (see references). No raw data are provided and the data that are made available cannot be linked to study participants. The sample consists of 180 of 308 eligible participants (adult primary care patients in Sweden, living with chronic illness) who responded to a Swedish web-based questionnaire at two time points. Using a confirmatory factor analysis, we calculated latent factor scores for 9 constructs, based on 34 questionnaire items. In this dataset, we share the latent factor scores and the latent profile analysis results. Although raw data are not shared, we provide the questionnaire item, including response scales. The code that was used to produce the latent factor scores and latent profile analysis results is also provided.

The study was performed as part of a research project exploring how the use of eHealth services in chronic care influence interaction and collaboration between patients and healthcare. The purpose of the study was to identify subgroups of primary care patients who are similar with respect to their experiences of co-care, as measured by the DoCCA scale (von Thiele Schwarz, 2021). Baseline data were collected after patients had been introduced to an eHealth service that aimed to support them in their self-care and digital communication with healthcare; follow-up data were collected 7 months later. All patients were treated at the same primary care center, located in the Stockholm Region in Sweden.

Cited reference: von Thiele Schwarz U, Roczniewska M, Pukk Härenstam K, Karlgren K, Hasson H, Menczel S, Wannheden C. The work of having a chronic condition: Development and psychometric evaluation of the Distribution of Co-Care Activities (DoCCA) Scale. BMC Health Services Research (2021) 21:480. doi: 10.1186/s12913-021-06455-8

The DATASET consists of two files: factorscores_docca.csv and latent-profile-analysis-results_docca.csv.

factorscores_docca.csv: This file contains 18 variables (columns) and 180 cases (rows). The variables represent latent factors (measured at two time points, T1 and T2) and the values are latent factor scores. The questionnaire data that were used to produce the latent factor scores consist of 20 items that measure experiences of collaboration with healthcare, based on the DoCCA scale. These items were included in the latent profile analysis. Additionally, latent factor scores reflecting perceived self-efficacy in self-care (6 items), satisfaction with healthcare (2 items), self-rated health (2 items), and perceived impact of e-health (4 items) were calculated. These items were used to make comparisons between profiles resulting from the latent profile analysis. Variable definitions are provided in a separate file (see below).

latent-profile-analysis-results_docca.csv: This file contains 14 variables (columns) and 180 cases (rows). The variables represent profile classifications (numbers and labels) and posterior classification probabilities for each of the identified profiles, 4 profiles at T1 and 5 profiles at T2. Transition probabilities (from T1 to T2 profiles) were not calculated due to lacking configural similarity of profiles at T1 and T2; hence no transition probabilities are provided.

The ASSOCIATED DOCUMENTATION consists of one file with variable definitions in English and Swedish, and four script files (written in the R programming language):

variable-definitions_swe-eng.xlsx: This file consists of four sheets. Sheet 1 (scale-items_original_swedish) specifies the questionnaire items (in Swedish) that were used to calculate the latent factor scores; response scales are included. Sheet 2 (scale-items_translated_english) provides an English translation of the questionnaire items and response scales provided in Sheet 1. Sheet 3 (factorscores_docca) defines the variables in the factorscores_docca.csv dataset. Sheet 4 (latent-profile-analysis-results) defines the variables in the latent-profile-analysis-results_docca.csv dataset.

R-script_Step-0_Factor-scores.R: R script file with the code that was used to calculate the latent factor scores. This script can only be run with access to the raw data file which is not publicly shared due to ethical constraints. Hence, the purpose of the script file is code transparency. Also, the script shows the model specification that was used in the confirmatory factor analysis (CFA). Missingness in data was accounted for by using Full Information Maximum Likelihood (FIML).

R-script_Step-1_Latent-profile-analysis.R: R script file with the code that was used to run the latent profile analyses at T1 and T2 and produce profile plots. This code can be run with the provided dataset factorscores_docca.csv. Note that the script generates the results that are provided in the latent-profile-analysis-results_docca.csv dataset.

R-script_Step-2_Non-parametric-tests.R: R script file with the code that was used to run non-parametric tests for comparing exogenous variables between profiles at T1 and T2. This script uses the following datasets: factorscores_docca.csv and latent-profile-analysis-results_docca.csv.

R-script_Step-3_Class-transitions.R: R script file with the code that was used to create a sankey diagram for illustrating class transitions. This script uses the following dataset: latent-profile-analysis-results_docca.csv.

Software requirements: To run the code, the R software environment and R packages specified in the script files need to be installed (open source). The scripts were produced in R version 4.2.1.
Healthcare Management System
kaggle.com
Updated Dec 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anouska Abhisikta (2023). Healthcare Management System [Dataset]. https://www.kaggle.com/datasets/anouskaabhisikta/healthcare-management-system
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 23, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Anouska Abhisikta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Patients Table:

PatientID: Unique identifier for each patient.

firstname: First name of the patient.

lastname: Last name of the patient.

email: Email address of the patient.

This table stores information about individual patients, including their names and contact details.

Doctors Table:

DoctorID: Unique identifier for each doctor.

DoctorName: Full name of the doctor.

Specialization: Area of medical specialization.

DoctorContact: Contact details of the doctor.

This table contains details about healthcare providers, including their names, specializations, and contact information.

Appointments Table:

AppointmentID: Unique identifier for each appointment.

Date: Date of the appointment.

Time: Time of the appointment.

PatientID: Foreign key referencing the Patients table, indicating the patient for the appointment.

DoctorID: Foreign key referencing the Doctors table, indicating the doctor for the appointment.

This table records scheduled appointments, linking patients to doctors.

MedicalProcedure Table:

ProcedureID: Unique identifier for each medical procedure.

ProcedureName: Name or description of the medical procedure.

AppointmentID: Foreign key referencing the Appointments table, indicating the appointment associated with the procedure.

This table stores details about medical procedures associated with specific appointments.

Billing Table:

InvoiceID: Unique identifier for each billing transaction.

PatientID: Foreign key referencing the Patients table, indicating the patient for the billing transaction.

Items: Description of items or services billed.

Amount: Amount charged for the billing transaction.

This table maintains records of billing transactions, associating them with specific patients.

demo Table:

ID: Primary key, serves as a unique identifier for each record.

Name: Name of the entity.

Hint: Additional information or hint about the entity.

This table appears to be a demonstration or testing table, possibly unrelated to the healthcare management system.

This dataset schema is designed to capture comprehensive information about patients, doctors, appointments, medical procedures, and billing transactions in a healthcare management system. Adjustments can be made based on specific requirements, and additional attributes can be included as needed.
h
heart-disease
huggingface.co
Updated Jun 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marco Buiani (2020). heart-disease [Dataset]. https://huggingface.co/datasets/buio/heart-disease
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 10, 2020
Authors
Marco Buiani
Description
The Heart Disease Data Set is provided by the Cleveland Clinic Foundation for Heart Disease. It's a CSV file with 303 rows. Each row contains information about a patient (a sample), and each column describes an attribute of the patient (a feature). We use the features to predict whether a patient has a heart disease (binary classification). It is originally hosted here.
d
Data Management Plan Examples Database
search.dataone.org
borealisdata.ca
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak (2024). Data Management Plan Examples Database [Dataset]. http://doi.org/10.5683/SP3/SDITUG
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/SDITUG
Dataset updated
Sep 4, 2024
Dataset provided by
Borealis
Authors
Evering, Danica; Acharya, Shrey; Pratt, Isaac; Behal, Sarthak
Time period covered
Jan 1, 2011 - Jan 1, 2023
Description
This dataset is comprised of a collection of example DMPs from a wide array of fields; obtained from a number of different sources outlined below. Data included/extracted from the examples include the discipline and field of study, author, institutional affiliation and funding information, location, date created, title, research and data-type, description of project, link to the DMP, and where possible external links to related publications or grant pages. This CSV document serves as the content for a McMaster Data Management Plan (DMP) Database as part of the Research Data Management (RDM) Services website, located at https://u.mcmaster.ca/dmps. Other universities and organizations are encouraged to link to the DMP Database or use this dataset as the content for their own DMP Database. This dataset will be updated regularly to include new additions and will be versioned as such. We are gathering submissions at https://u.mcmaster.ca/submit-a-dmp to continue to expand the collection.
f
Data from: Managing surgical waiting lists through dynamic priority scoring...
figshare.com
docx
Updated May 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Powers; James M. McGree; David Grieve; Ratna Aseervatham; Suzanne Ryan; Paul Corry (2023). Managing surgical waiting lists through dynamic priority scoring - Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.19771738.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19771738.v1
Dataset updated
May 28, 2023
Dataset provided by
figshare
Authors
Jack Powers; James M. McGree; David Grieve; Ratna Aseervatham; Suzanne Ryan; Paul Corry
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset containing the deidentified patient and clinical factor data used in the associated simulation study. The CSV file contains patient data used to inform arrival distributions and historical summary statistics for model calibration, covering an eight-month in 2019. The JSON file contains clinical factor weights for each general surgery procedure included in the study with associated clinical factor sub-levels. The DOCX file contains supplementary material used in conjunction with the clinical factor selection forms for each procedure to provide examples of considerations that may be used for certain clinical criteria.

Data-specific information for: patientData.csv Patient data used to inform arrival distributions and historical summary statistics for model calibration, covering 8 months in 2019. Number of columns: 17 Number of rows: 846 Column headings: INDEX = numerical index ID = assigned patient ID for purposes of this study HISTORICAL_ASSIGNED_CATEGORY = the urgency category the patient was historically treated under HISTORICAL_PLACED_ON_LIST_DATE_EPOCH = the number of days from an epoch a patient was placed on the list historically HISTORICAL_OPERATION_DATE_EPOCH = the number of days from an epoch of a patient was admitted to surgery historically DAYS_WAITING = the difference between HISTORICAL_PLACED_ON_LIST_DATE_EPOCH and HISTORICAL_OPERATION_DATE_EPOCH in days PROCEDURE = the surgical procedure for which the patient was historically placed on the waiting list OR_TIME = the total operating theatre duration in minutes (sum of PROCEDURE_TIME and CHANGEOVER_TIME) PROCEDURE_TIME = the duration of the respective procedure in minutes CHANGEOVER_TIME = the duration of the required changeover in minutes C1 = clinical factor score 1 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C2 = clinical factor score 2 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C3 = clinical factor score 3 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C4 = clinical factor score 4 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) C5 = clinical factor score 5 weight for the respective procedure and sub-level (from clinicalFactorWeights.json) FACTORS_SUM = the sum of C1..C5 ASSIGNED_CATEGORY = the revised urgency category assigned to the patient upon clinician review of case files when collecting clinical factor data Data-specific information for: clinicalFactorWeights.json Clinical factor weights for each procedure and associated clinical factor sub-levels. Number of variables: 10 General form of data structure: CX = clinical factor X CX_LY = clinical factor X level Y CX_LY_weight = weight of clinical factor X level Y { "Procedure 1": { "C1": { "C1_L1": "C1_L1_weight", "C1_L2": "C1_L2_weight", "C1_L3": "C1_L3_weight" }, "C2": { "C2_L1": "C2_L1_weight", "C2_L2": "C2_L2_weight", "C2_L3": "C2_L3_weight" }, "C3": { "C3_L1": "C3_L1_weight", "C3_L2": "C3_L2_weight", "C3_L3": "C3_L3_weight" }, "C4": { "C4_L1": "C4_L1_weight", "C4_L2": "C4_L2_weight", "C4_L3": "C4_L3_weight" }, "C5": { "C5_L1": "C5_L1_weight", "C5_L2": "C5_L2_weight", "C5_L3": "C5_L3_weight" } } } Data-specific information for: Clinical Factor Selection Form Supplementary Information and Definitions.docx Supplementary material is provided to be used in conjunction with the clinical factor selection forms for each procedure, to provide examples of considerations that may be used for specific clinical criteria.
Z
UCI and OpenML Data Sets for Ordinal Quantification
data.niaid.nih.gov
zenodo.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moreo, Alejandro (2023). UCI and OpenML Data Sets for Ordinal Quantification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8177301
Explore at:
Dataset updated
Jul 25, 2023
Dataset provided by
Moreo, Alejandro
Bunse, Mirko
Senz, Martin
Sebastiani, Fabrizio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These four labeled data sets are targeted at ordinal quantification. The goal of quantification is not to predict the label of each individual instance, but the distribution of labels in unlabeled sets of data.

With the scripts provided, you can extract CSV files from the UCI machine learning repository and from OpenML. The ordinal class labels stem from a binning of a continuous regression label.

We complement this data set with the indices of data items that appear in each sample of our evaluation. Hence, you can precisely replicate our samples by drawing the specified data items. The indices stem from two evaluation protocols that are well suited for ordinal quantification. To this end, each row in the files app_val_indices.csv, app_tst_indices.csv, app-oq_val_indices.csv, and app-oq_tst_indices.csv represents one sample.

Our first protocol is the artificial prevalence protocol (APP), where all possible distributions of labels are drawn with an equal probability. The second protocol, APP-OQ, is a variant thereof, where only the smoothest 20% of all APP samples are considered. This variant is targeted at ordinal quantification tasks, where classes are ordered and a similarity of neighboring classes can be assumed.

Usage

You can extract four CSV files through the provided script extract-oq.jl, which is conveniently wrapped in a Makefile. The Project.toml and Manifest.toml specify the Julia package dependencies, similar to a requirements file in Python.

Preliminaries: You have to have a working Julia installation. We have used Julia v1.6.5 in our experiments.

Data Extraction: In your terminal, you can call either

make

(recommended), or

julia --project="." --eval "using Pkg; Pkg.instantiate()" julia --project="." extract-oq.jl

Outcome: The first row in each CSV file is the header. The first column, named "class_label", is the ordinal class.

Further Reading

Implementation of our experiments: https://github.com/mirkobunse/regularized-oq
d
Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of...
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015–September 2017 [Dataset]. https://catalog.data.gov/dataset/tidal-daily-discharge-and-quality-assurance-data-supporting-an-assessment-of-water-quality
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Wellfleet, Herring River, Massachusetts
Description
This data release provides data in support of an assessment of water quality and discharge in the Herring River at the Chequessett Neck Road dike in Wellfleet, Massachusetts, from November 2015 to September 2017. The assessment was a cooperative project among the U.S. Geological Survey, National Park Service, Cape Cod National Seashore, and the Friends of Herring River to characterize environmental conditions prior to a future removal of the dike. It is described in U.S. Geological Survey (USGS) Scientific Investigations Report "Assessment of Water Quality and Discharge in the Herring River, Wellfleet, Massachusetts, November 2015 – September 2017." This data release is structured as a set of comma-separated values (CSV) files, each of which contains information on data source (or laboratory used for analysis), USGS site identification (ID) number, beginning date of time of observation or sampling, ending date and time of observation or sampling and data such as flow rate and analytical results. The CSV files include calculated tidal daily flows (Flood_Tide_Tidal_Day.csv and Ebb_Tide_Tidal_Day.csv) that were used in Huntington and others (2020) for estimation of nutrient loads. Tidal daily flows are the estimated mean daily discharges for two consecutive flood and ebb tide cycles (average duration: 24 hours, 48 minutes). The associated date is the day on which most of the flow occurred. CSV files contain quality assurance data for water-quality samples including blanks (Blanks.csv), replicates (Replicates.csv), standard reference materials (Standard_Reference_Material.csv), and atmospheric ammonium contamination (NH4_Atmospheric_Contamination.csv). One CSV file (EWI_vs_ISCO.csv) contains data comparing composite samples collected by an automatic sampler (ISCO) at a fixed point with depth-integrated samples collected at equal width increments (EWI). One CSV file (Cross_Section_Field_Parameters.csv) contains field parameter data (specific conductance, temperature, pH, and dissolved oxygen) collected at a fixed location and data collected along the cross sections at variable water depths and horizontal distances across the openings of the culverts at the Chequessett Neck Road dike. One CSV file (LOADEST_Bias_Statistics.csv) contains data that include estimated natural log of load, model residuals, Z-scores, and seasonal model residuals for winter (December, January, and February); spring (March, April and May); summer (June, July and August); and fall (September, October, and November). The data release also includes a data dictionary (Data_Dictionary.csv) that provides detailed descriptions of each field in each CSV file, including: data filename; laboratory or data source; U.S. Geological Survey site ID numbers; data types; constituent (analyte) U.S. Geological Survey parameter codes; descriptions of parameters; units; methods; minimum reporting limits; limits of quantitation, if appropriate; method reference citations; and minimum, maximum, median, and average values for each analyte. The data release also includes an abbreviations file (Abbreviations.pdf) that defines all the abbreviations in the data dictionary and CSV files. Note that the USGS site ID includes a leading zero (011058798) and some of the parameter codes contain leading zeros, so care must be taken in opening and subsequently saving these files in other formats where leading zeros may be dropped.
Raw Data - CSV Files
osf.io
Updated Apr 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Katelyn Conn (2020). Raw Data - CSV Files [Dataset]. https://osf.io/h5wbt
Explore at:
Dataset updated
Apr 27, 2020
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Katelyn Conn
Description
Raw Data in .csv format for use with the R data wrangling scripts.
p
MHeCor studies Sample 2 data set.csv
psycharchives.org
Updated Jan 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). MHeCor studies Sample 2 data set.csv [Dataset]. https://www.psycharchives.org/en/item/2712257d-4098-4b9d-8ba0-ead118a68123
Explore at:
Dataset updated
Jan 7, 2023
License
https://doi.org/10.23668/psycharchives.4988https://doi.org/10.23668/psycharchives.4988
Description
The aim of this research was to examine the concurrent relationships between these positive and negative experiences of German adults simultaneously as well as their changes over three weeks in 2020. Owing to German federalism, we expected these changes to differ between German states.: Research data set Sample 2
g
Demographics
health.google.com
Updated Oct 7, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Demographics [Dataset]. https://health.google.com/covid-19/open-data/raw-data
Explore at:
Dataset updated
Oct 7, 2021
Variables measured
key, population, population_male, rural_population, urban_population, population_female, population_density, clustered_population, population_age_00_09, population_age_10_19, and 11 more
Description
Various population statistics, including structured demographics data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm

Data from: Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.5061/dryad.rjdfn2zcm

Dataset updated

Jan 9, 2022

Dataset provided by

Harvard Medical School
Massachusetts General Hospital

Authors

Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller

License

https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

Description

Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.

Clear search

Close search

Google apps

Main menu

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Sample Graph Datasets in CSV Format

Sample Graph Datasets in CSV Format

Description

CSV nodes

CSV edges

Metadata

CSV nodes (tiny graphs)

CSV edges (tiny graphs)

Metadata (tiny graphs)

GitTables 1M - CSV files

Simulated Data for Patient Time Series Record Linkage

CSV file used in statistical analyses

🔍 Diverse CSV Dataset Samples

Overview

Files

Format

Quality Assurance

Acknowledgements

EHR Dataset for Patient Treatment Classification

Real Oesophageal Cancer Datasets

Context and Content

Inspiration

CarePrecise Authoritative Hospital Database (AHD)

doc-formats-csv-3

ignored comment

Data and code for: Better self-care through co-care? A latent profile...

Healthcare Management System

heart-disease

Data Management Plan Examples Database

Data from: Managing surgical waiting lists through dynamic priority scoring...

UCI and OpenML Data Sets for Ordinal Quantification

Tidal Daily Discharge and Quality Assurance Data Supporting an Assessment of...

Raw Data - CSV Files

MHeCor studies Sample 2 data set.csv

Demographics

Data from: Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry