54 datasets found

The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases,...
zenodo.org
bin, csv, zip
Updated Jan 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux (2024). The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC Articles [Dataset]. http://doi.org/10.5281/zenodo.10079370
Explore at:
zip, bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10079370
Dataset updated
Jan 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains multi-modal data from over 75,000 open access and de-identified case reports, including metadata, clinical cases, image captions and more than 130,000 images. Images and clinical cases belong to different medical specialties, such as oncology, cardiology, surgery and pathology. The structure of the dataset allows to easily map images with their corresponding article metadata, clinical case, captions and image labels. Details of the data structure can be found in the file data_dictionary.csv.

Almost 100,000 patients and almost 400,000 medical doctors and researchers were involved in the creation of the articles included in this dataset. The citation data of each article can be found in the metadata.parquet file.

Refer to the examples showcased in this GitHub repository to understand how to optimize the use of this dataset.

For a detailed insight about the contents of this dataset, please refer to this data article published in Data In Brief.
mimic-iii-clinical-database-demo-1.4
kaggle.com
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montassar bellah (2025). mimic-iii-clinical-database-demo-1.4 [Dataset]. https://www.kaggle.com/datasets/montassarba/mimic-iii-clinical-database-demo-1-4
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Montassar bellah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.

Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.

MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.

The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.

Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.

This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.

Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.

The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180

Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.

CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.

DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/

Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.

Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.

Conflicts of Interest The authors declare no competing financial interests.

References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...
p
MIMIC-IV
physionet.org
Updated Oct 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark (2024). MIMIC-IV [Dataset]. http://doi.org/10.13026/kpb9-mt58
Explore at:
Unique identifier
https://doi.org/10.13026/kpb9-mt58
Dataset updated
Oct 11, 2024
Authors
Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
S
2017Female
health.data.ny.gov
application/rdfxml +5
Updated Nov 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Department of Health (2020). 2017Female [Dataset]. https://health.data.ny.gov/dataset/2017Female/4ij4-sjp2
Explore at:
csv, tsv, json, application/rdfxml, xml, application/rssxmlAvailable download formats
Dataset updated
Nov 24, 2020
Authors
New York State Department of Health
Description
The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.
n
Data from: Generalizable EHR-R-REDCap pipeline for a national...
data.niaid.nih.gov
explore.openaire.eu
+2more
zip
Updated Jan 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.rjdfn2zcm
Dataset updated
Jan 9, 2022
Dataset provided by
Harvard Medical School
Massachusetts General Hospital
Authors
Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.

Methods eLAB Development and Source Code (R statistical software):

eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).

eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.

Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.

The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).

Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.

Data Dictionary (DD)

EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.

Study Cohort

This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.

Statistical Analysis

OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Hospital Inpatient Discharges (SPARCS De-Identified): 2010
healthdata.gov
health.data.ny.gov
+1more
application/rdfxml +5
Updated Apr 8, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
health.data.ny.gov (2025). Hospital Inpatient Discharges (SPARCS De-Identified): 2010 [Dataset]. https://healthdata.gov/State/Hospital-Inpatient-Discharges-SPARCS-De-Identified/2adj-zbc9/data
Explore at:
csv, tsv, application/rssxml, application/rdfxml, json, xmlAvailable download formats
Dataset updated
Apr 8, 2025
Dataset provided by
health.data.ny.gov
Description
The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified dataset contains discharge level detail on patient characteristics, diagnoses, treatments, services, charges, and costs. This data contains basic record level detail regarding the discharge; however the data does not contain protected health information (PHI) under Health Insurance Portability and Accountability Act (HIPAA). The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.
Hospital Inpatient Discharges (SPARCS De-Identified): 2022
health.data.ny.gov
healthdata.gov
application/rdfxml +5
Updated Feb 13, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Department of Health (2024). Hospital Inpatient Discharges (SPARCS De-Identified): 2022 [Dataset]. https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/5dtw-tffi
Explore at:
csv, tsv, xml, application/rssxml, application/rdfxml, jsonAvailable download formats
Dataset updated
Feb 13, 2024
Dataset authored and provided by
New York State Department of Health
Description
The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges.

This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.

For more information visit: https://www.health.ny.gov/statistics/sparcs/
mimic-iv-clinical-database-demo-2.2
kaggle.com
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montassar bellah (2025). mimic-iv-clinical-database-demo-2.2 [Dataset]. https://www.kaggle.com/datasets/montassarba/mimic-iv-clinical-database-demo-2-2/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Montassar bellah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Abstract The Medical Information Mart for Intensive Care (MIMIC)-IV database is comprised of deidentified electronic health records for patients admitted to the Beth Israel Deaconess Medical Center. Access to MIMIC-IV is limited to credentialed users. Here, we have provided an openly-available demo of MIMIC-IV containing a subset of 100 patients. The dataset includes similar content to MIMIC-IV, but excludes free-text clinical notes. The demo may be useful for running workshops and for assessing whether the MIMIC-IV is appropriate for a study before making an access request.

Background The increasing adoption of digital electronic health records has led to the existence of large datasets that could be used to carry out important research across many areas of medicine. Research progress has been limited, however, due to limitations in the way that the datasets are curated and made available for research. The MIMIC datasets allow credentialed researchers around the world unprecedented access to real world clinical data, helping to reduce the barriers to conducting important medical research. The public availability of the data allows studies to be reproduced and collaboratively improved in ways that would not otherwise be possible.

Methods First, the set of individuals to include in the demo was chosen. Each person in MIMIC-IV is assigned a unique subject_id. As the subject_id is randomly generated, ordering by subject_id results in a random subset of individuals. We only considered individuals with an anchor_year_group value of 2011 - 2013 or 2014 - 2016 to ensure overlap with MIMIC-CXR v2.0.0. The first 100 subject_id who satisfied the anchor_year_group criteria were selected for the demo dataset.

All tables from MIMIC-IV were included in the demo dataset. Tables containing patient information, such as emar or labevents, were filtered using the list of selected subject_id. Tables which do not contain patient level information were included in their entirety (e.g. d_items or d_labitems). Note that all tables which do not contain patient level information are prefixed with the characters 'd_'.

Deidentification was performed following the same approach as the MIMIC-IV database. Protected health information (PHI) as listed in the HIPAA Safe Harbor provision was removed. Patient identifiers were replaced using a random cipher, resulting in deidentified integer identifiers for patients, hospitalizations, and ICU stays. Stringent rules were applied to structured columns based on the data type. Dates were shifted consistently using a random integer removing seasonality, day of the week, and year information. Text fields were filtered by manually curated allow and block lists, as well as context-specific regular expressions. For example, columns containing dose values were filtered to only contain numeric values. If necessary, a free-text deidentification algorithm was applied to remove PHI from free-text. Results of this algorithm were manually reviewed and verified to remove identified PHI.

Data Description MIMIC-IV is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-IV Clinical Database page [1] or the MIMIC-IV online documentation [2]. The demo shares an identical schema and structure to the equivalent version of MIMIC-IV.

Data files are distributed in comma separated value (CSV) format following the RFC 4180 standard [3]. The dataset is also made available on Google BigQuery. Instructions to accessing the dataset on BigQuery are provided on the online MIMIC-IV documentation, under the cloud page [2].

An additional file is included: demo_subject_id.csv. This is a list of the subject_id used to filter MIMIC-IV to the demo subset.

Usage Notes The MIMIC-IV demo provides researchers with the opportunity to better understand MIMIC-IV data.

CSV files can be opened natively using any text editor or spreadsheet program. However, as some tables are large it may be preferable to navigate the data via a relational database. We suggest either working with the data in Google BigQuery (see the "Files" section for access details) or creating an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.

Code is made available for use with MIMIC-IV on the MIMIC-IV code repository [4]. Code provided includes derivation of clinical concepts, tutorials, and reproducible analyses.

Release Notes Release notes for the demo follow the release notes for the MIMIC-IV database.

Ethics This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the pr...
S
Data from: Alcohol-related
health.data.ny.gov
application/rdfxml +5
Updated Nov 14, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Department of Health (2016). Alcohol-related [Dataset]. https://health.data.ny.gov/Health/Alcohol-related/iw8p-6qdm
Explore at:
json, application/rdfxml, csv, application/rssxml, xml, tsvAvailable download formats
Dataset updated
Nov 14, 2016
Authors
New York State Department of Health
Description
The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services and charges. This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed. A downloadable file of this dataset is available at: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/mpue-vn67. For more information, including changes to the data from previous years, please visit http://www.health.ny.gov/statistics/sparcs/access/. The "About" tab contains additional details concerning this dataset.
c
Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...
cancerimagingarchive.net
csv, dicom, n/a +1
Updated May 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
Explore at:
sqlite and zip, dicom, csv, n/aAvailable download formats
Unique identifier
https://doi.org/10.7937/cf2p-aw56
Dataset updated
May 2, 2025
Dataset authored and provided by
The Cancer Imaging Archive
License
https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/
Time period covered
May 2, 2025
Dataset funded by
National Cancer Institutehttp://www.cancer.gov/
Description
Abstract
These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.
This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.
Introduction
Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).
These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).
This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.
Methods
Subject Inclusion and Exclusion Criteria
The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.
Data Acquisition
To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.
Data Analysis
Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.
Usage Notes
This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.
To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.
Hospital Inpatient Discharges (SPARCS De-Identified): 2021
health.data.ny.gov
healthdata.gov
application/rdfxml +5
Updated May 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Department of Health (2024). Hospital Inpatient Discharges (SPARCS De-Identified): 2021 [Dataset]. https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/tg3i-cinn
Explore at:
application/rdfxml, csv, json, application/rssxml, tsv, xmlAvailable download formats
Dataset updated
May 8, 2024
Dataset authored and provided by
New York State Department of Health
Description
The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.

Note: The full dataset may be downloaded in a smaller, compressed file format from the attachments section.
S
2012 NY Discharges
health.data.ny.gov
application/rdfxml +5
Updated Sep 9, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Department of Health (2019). 2012 NY Discharges [Dataset]. https://health.data.ny.gov/Health/2012-NY-Discharges/u4wj-heze
Explore at:
application/rdfxml, csv, xml, json, application/rssxml, tsvAvailable download formats
Dataset updated
Sep 9, 2019
Authors
New York State Department of Health
Area covered
New York
Description
The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified dataset contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data contains basic record level detail regarding the discharge; however the data does not contain protected health information (PHI) under Health Insurance Portability and Accountability Act (HIPAA). The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed. A downloadable file with this data is available for ease of download at: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/3m9u-ws8e. For more information check out: http://www.health.ny.gov/statistics/sparcs/ or go to the “About” tab.
2015 de-identified NY inpatient discharge (SPARCS)
kaggle.com
Updated Jan 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonas Almeida (2018). 2015 de-identified NY inpatient discharge (SPARCS) [Dataset]. https://www.kaggle.com/datasets/jonasalmeida/2015-deidentified-ny-inpatient-discharge-sparcs/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 24, 2018
Dataset provided by
Kaggle
Authors
Jonas Almeida
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
New York
Description
Public Health Data

This is the public dataset made available at https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/82xm-y6g8 by the Dept of Health of New York state. The following description can be found at that page:

The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.

It would be nice to ...

... for example, be able to predict length of stay in the hospital using the parameters likely to be available when teh patient is admitted.
h
Our Future Health Linked Health Records Data
healthdatagateway.org
unknown
Updated Jun 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Our Future Health (2024). Our Future Health Linked Health Records Data [Dataset]. https://healthdatagateway.org/dataset/889
Explore at:
unknownAvailable download formats
Dataset updated
Jun 19, 2024
Dataset authored and provided by
Our Future Health
License
https://research.ourfuturehealth.org.uk/apply-to-access-the-data/https://research.ourfuturehealth.org.uk/apply-to-access-the-data/
Description
Our Future Health is a prospective, observational cohort study of the general adult population of the United Kingdom (UK). The programme aims to support a wide range of observational health research. We gather personal, health and lifestyle information from each participant through a self-completed baseline health questionnaire and at an in-person clinic visit. We will further link this data to other health-related data sets. Participants have also given consent for us to recontact them, for example to invite them to take part in further or repeat data collections, or other embedded studies such as clinical trials.

The Our Future Health programme is currently open to all adults (18 years and older) living in the UK. In July 2022, we started recruiting participants in England and will continue to expand across the rest of the UK. The data we’ve gathered so far (June 2025) includes linked NHS England clinical data on 1,527,723 participants

Additional linked datasets are available: - ‘Baseline Health Questionnaire Data’ which contains baseline demographic information and responses to our health questionnaire from 1,781,891 participants. - ‘Genotype Array Data’ which includes genotype array data on 707,522 variants from a subset of 650,979 participants - Clinical Measurements Data which contains clinical data from 1,324,884 participants.

The data is stored in the Our Future Health Trusted Research Environment. We de-identify all participant data we gather before it’s available for use. All researchers will need to become registered researchers at Our Future Health and have an approved research study before they're given access to the data.

We aim to collect a variety of data types from up to 5 million adult participants from across the UK. We hope to make more data types available on a quarterly basis.
Extended Lombardy's Neonatal Screening Dataset
zenodo.org
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). Extended Lombardy's Neonatal Screening Dataset [Dataset]. http://doi.org/10.5281/zenodo.15118680
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15118680
Dataset updated
Apr 8, 2025
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Neonatal Screening dataset comprises de-identified medical records collected from routine newborn screening programs. It includes data from [insert number] neonates screened within 2012-2021, across more than 100 hospitals. The dataset captures early-life biological and demographic information, including gestational age, birth weight, sex, and time of sample collection, as well as biochemical markers obtained from dried blood spot samples typically collected within 24–72 hours after birth.

The dataset also contains confirmatory diagnostic information for screen-positive cases, allowing for the evaluation of screening accuracy and follow-up outcomes.

This dataset supports epidemiological analysis, algorithm development for early disease detection, and assessments of screening program performance.
o
Data from: Medical data formatting to improve physician interpretation speed...
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Jun 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Peterson (2022). Medical data formatting to improve physician interpretation speed in the military healthcare system [Dataset]. http://doi.org/10.5061/dryad.mkkwh712w
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.mkkwh712w
Dataset updated
Jun 13, 2022
Authors
Jacob Peterson
Description
Study Design: De-identified chemistry and hematology results were presented to participants using the two data formats (tabular and fishbone diagram) along with questionnaires requesting the identification of individual values and trends. Participants completed the two questionnaires in a balanced crossover experiment. After completing both questionnaires participants were asked to complete a 3-question survey rating perceived ease of use and indicating an overall preference for one of the data formats. Participants: A total of 35 participants were recruited at a daily internal medicine residency didactic session. Participants were asked to abstain if they were unfamiliar with either data format. Patient Cases: Each laboratory data format was applied to a pair of basic metabolic panels (BMP) and a pair of complete blood counts (CBC) labeled as being from sequential days (one CBC and BMP for each day). The laboratory data were identical in quantity and type of information but individual result values used for each data format differed. Procedure: Before the study, every participant was informed about the project and confirmed familiarity with both data formats. Participants were each given both questionnaires (one for each data format) and a survey with the lab data hidden by a cover sheet. Participants were informed they would have 60 seconds to answer as many questions as possible about the data set provided and then would answer a set of questions about a set of data. The questions were designed so that each questionnaire requested identical cognitive tasks in the same order. For example, question three asked to identify a trend on both questionnaires but one questionnaire asked about anemia, the other about renal dysfunction. The study materials were distributed randomly but were prepared such that 50% of participants had the questionnaire with data formatted using a table as the first questionnaire. The remaining 50% started the questionnaire with data formatted using fishbone diagrams. Participants completed the two questionnaires in the assigned order and then completed a three-question survey. Outcome Measures: Responses were graded manually with incorrect or partially correct answers both counted as erroneous interpretations. Omitted questions, which were rare, were not considered to have undergone interpretation and were counted neither towards total interpretations nor as erroneous. For each questionnaire, the number of questions answered and the number of errors committed were recorded. For the survey results, the ratings for ease of use (1-5 on a Likert scale with 5 being easy) were recorded for each data format. The data format preference of each participant was also recorded. Objective: The purpose of this project was to improve the ease and speed of physician comprehension when interpreting daily laboratory data for patients admitted within the Military Healthcare System (MHS). Materials and Methods: A JavaScript program was created to convert the laboratory data obtained via the outpatient electronic medical record (EMR) into a “fishbone diagram” format that is familiar to most physicians. Using a balanced crossover design, 35 internal medicine trainees and staff were asked to complete timed comprehension tests for laboratory data sets formatted in the outpatient EMR’s format and in fishbone diagram format. The number of responses per second and error rate per response were measured for each format. Participants were asked to rate relative ease of use for each format and indicate which format they preferred. Results: Comprehension speed increased 37% (6.28 seconds per interpretation) with the fishbone diagram format with no observed increase in errors. Using a Likert scale of 1 to 5 (1 being hard, 5 easy), participants indicated the new format was easier to use (4.14 for fishbone vs 2.14 for table) with 89% expressing a preference for the new format. Discussion: The publically available web application that converts tabular lab data to fishbone diagram format is currently used 10,000-12,000 times per month across the MHS, delivering significant benefit to the enterprise in terms of time saved and improved physician experience. Conclusions: This study supports the use of fishbone diagram formatting for laboratory data for inpatients within the MHS. Microsoft Excel or similar spreadsheet software.
P
MIMIC-III Dataset
paperswithcode.com
opendatalab.com
Updated Apr 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark (2023). MIMIC-III Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iii
Explore at:
Dataset updated
Apr 20, 2022
Authors
Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark
Description
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.

The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.
d
Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource...
datadryad.org
search.dataone.org
+1more
zip
Updated Jan 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fateme Nateghi Haredasht; Fatemeh Amrollahi; Manoj Maddali; Nicholas Marshall; Stephen Ma; Amy Chang; Niaz Banaei; Stanley Deresinski; Steven Asch; Mary Goldstein; Jonathan Chen (2025). Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource for studying antimicrobial resistance using electronic health records [Dataset]. http://doi.org/10.5061/dryad.jq2bvq8kp
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.jq2bvq8kp
Dataset updated
Jan 23, 2025
Dataset provided by
Dryad
Authors
Fateme Nateghi Haredasht; Fatemeh Amrollahi; Manoj Maddali; Nicholas Marshall; Stephen Ma; Amy Chang; Niaz Banaei; Stanley Deresinski; Steven Asch; Mary Goldstein; Jonathan Chen
Description
The Antibiotic Resistance Microbiology Dataset (ARMD) is a structured and de-identified resource developed using electronic health records (EHR) from Stanford Healthcare. It provides a comprehensive overview of microbiological cultures including urine, respiratory, and blood cultures. This dataset includes 283,715 unique adult patients and features detailed information on culture results, identified organisms, antibiotic susceptibility, and associated demographic and clinical data. The dataset was meticulously constructed through a multi-step process designed to enhance data quality and relevance. By enabling the study of antimicrobial resistance patterns and supporting antimicrobial stewardship efforts, ARMD offers a valuable resource for researchers and clinicians seeking to improve the management of infectious diseases and combat the growing threat of antimicrobial resistance.
S
Thyweill
health.data.ny.gov
application/rdfxml +5
Updated Sep 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
New York State Department of Health (2019). Thyweill [Dataset]. https://health.data.ny.gov/Health/Thyweill/uqw2-z9tw
Explore at:
xml, json, csv, application/rdfxml, application/rssxml, tsvAvailable download formats
Dataset updated
Sep 9, 2019
Authors
New York State Department of Health
Description
The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified dataset contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data contains basic record level detail regarding the discharge; however the data does not contain protected health information (PHI) under Health Insurance Portability and Accountability Act (HIPAA). The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed. A downloadable file with this data is available for ease of download at: https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/3m9u-ws8e. For more information check out: http://www.health.ny.gov/statistics/sparcs/ or go to the “About” tab.
d
Discharge Abstract Database (DAD) Research Analytic Files
search.dataone.org
Updated Dec 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Canadian Institute for Health Information (CIHI) (2023). Discharge Abstract Database (DAD) Research Analytic Files [Dataset]. http://doi.org/10.5683/SP3/NHMHRL
Explore at:
Unique identifier
https://doi.org/10.5683/SP3/NHMHRL
Dataset updated
Dec 28, 2023
Dataset provided by
Borealis
Authors
Canadian Institute for Health Information (CIHI)
Description
During the webinar, senior analyst from CIHI presented the Discharge Abstract Database (DAD) Research Analytic Files. This database captures administrative, clinical and demographic information on hospital discharges, including deaths, sign-outs and transfers. There are two files in the DLI that relate to the Discharge Abstract Database. The files are de-identified samples containing record-level data from fiscal years 2009-2010 and 2010-2011. One file contains clinical data and the other geographic data. Both files are available in English and French. In particular, this webinar will focus on using the documentation provided, as well as a few illustrative examples on how to best use the DAD Research Analytic Files.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux (2024). The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC Articles [Dataset]. http://doi.org/10.5281/zenodo.10079370

The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC Articles

Explore at:

zip, bin, csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.10079370

Dataset updated

Jan 5, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

The dataset contains multi-modal data from over 75,000 open access and de-identified case reports, including metadata, clinical cases, image captions and more than 130,000 images. Images and clinical cases belong to different medical specialties, such as oncology, cardiology, surgery and pathology. The structure of the dataset allows to easily map images with their corresponding article metadata, clinical case, captions and image labels. Details of the data structure can be found in the file data_dictionary.csv.

Almost 100,000 patients and almost 400,000 medical doctors and researchers were involved in the creation of the articles included in this dataset. The citation data of each article can be found in the metadata.parquet file.

Refer to the examples showcased in this GitHub repository to understand how to optimize the use of this dataset.

For a detailed insight about the contents of this dataset, please refer to this data article published in Data In Brief.

Clear search

Close search

Google apps

Main menu

The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases,...

mimic-iii-clinical-database-demo-1.4

MIMIC-IV

2017Female

Data from: Generalizable EHR-R-REDCap pipeline for a national...

Hospital Inpatient Discharges (SPARCS De-Identified): 2010

Hospital Inpatient Discharges (SPARCS De-Identified): 2022

mimic-iv-clinical-database-demo-2.2

Data from: Alcohol-related

Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...

Abstract

Introduction

Methods

Subject Inclusion and Exclusion Criteria

Data Acquisition

Data Analysis

Usage Notes

Hospital Inpatient Discharges (SPARCS De-Identified): 2021

2012 NY Discharges

2015 de-identified NY inpatient discharge (SPARCS)

Public Health Data

It would be nice to ...

Our Future Health Linked Health Records Data

Extended Lombardy's Neonatal Screening Dataset

Data from: Medical data formatting to improve physician interpretation speed...

MIMIC-III Dataset

Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource...

Thyweill

Discharge Abstract Database (DAD) Research Analytic Files

The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC ArticlesSee More Versions

The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC Articles