64 datasets found
  1. P

    i2b2 De-identification Dataset Dataset

    • paperswithcode.com
    Updated May 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). i2b2 De-identification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/i2b2-de-identification-dataset
    Explore at:
    Dataset updated
    May 6, 2022
    Description

    This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

  2. The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases,...

    • zenodo.org
    bin, csv, zip
    Updated Jan 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux (2024). The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC Articles [Dataset]. http://doi.org/10.5281/zenodo.10079370
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains multi-modal data from over 75,000 open access and de-identified case reports, including metadata, clinical cases, image captions and more than 130,000 images. Images and clinical cases belong to different medical specialties, such as oncology, cardiology, surgery and pathology. The structure of the dataset allows to easily map images with their corresponding article metadata, clinical case, captions and image labels. Details of the data structure can be found in the file data_dictionary.csv.

    Almost 100,000 patients and almost 400,000 medical doctors and researchers were involved in the creation of the articles included in this dataset. The citation data of each article can be found in the metadata.parquet file.

    Refer to the examples showcased in this GitHub repository to understand how to optimize the use of this dataset.

    For a detailed insight about the contents of this dataset, please refer to this data article published in Data In Brief.

  3. P

    MIMIC-III Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark (2023). MIMIC-III Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iii
    Explore at:
    Dataset updated
    Feb 9, 2021
    Authors
    Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark
    Description

    The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.

    The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.

  4. De-identification - anonymization

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco H C Felix (2023). De-identification - anonymization [Dataset]. http://doi.org/10.6084/m9.figshare.3545471.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Authors
    Francisco H C Felix
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    De-identification, anonymization, pseudoanonymization, re-identificationNational Institute of Standards and Technology (NIST) documentation declares that the use of these terms is still unclear. Words de-identification, anonymizatio_ and pseudoanonymization are sometimes interchangeable, sometimes carrying subtle different meanings. To mitigate ambiguity, NIST use definitions from ISO/TS 25237:2008:> de-identification: “general term for any process of removing the association between a set of identifying data and the data subject.” [p. 3] anonymization: “process that removes the association between the identifying dataset and the data subject.” [p. 2] pseudonymization: “particular type of anonymization that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.”1 [p. 5]Brazilian portuguese literature largely lacks this terminology, and they are more often used in law or information technology. The utilization of these concepts in health care and research has a specific conceptualization. HIPAA (Health Insurance Portability and Accountability Act), US regulation of health data privacy protection, establishes standards for patient personal information (protected health information - PHI) handling by health care providers (covered entities).

  5. Metadata record for: A DICOM dataset for evaluation of medical image...

    • springernature.figshare.com
    txt
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scientific Data Curation Team (2023). Metadata record for: A DICOM dataset for evaluation of medical image de-identification [Dataset]. http://doi.org/10.6084/m9.figshare.14802774.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Scientific Data Curation Team
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains key characteristics about the data described in the Data Descriptor A DICOM dataset for evaluation of medical image de-identification. Contents:

        1. human readable metadata summary table in CSV format
    
    
        2. machine readable metadata file in JSON format
    
  6. Hospital Inpatient Discharges (SPARCS De-Identified): 2018

    • healthdata.gov
    • health.data.ny.gov
    application/rdfxml +5
    Updated Apr 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    health.data.ny.gov (2025). Hospital Inpatient Discharges (SPARCS De-Identified): 2018 [Dataset]. https://healthdata.gov/State/Hospital-Inpatient-Discharges-SPARCS-De-Identified/pw9x-uv3q
    Explore at:
    csv, json, tsv, xml, application/rssxml, application/rdfxmlAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    health.data.ny.gov
    Description

    The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed. Note: This dataset may be downloaded from the attachments section of this page in a smaller, compressed format.

  7. INSPECT EHR

    • redivis.com
    application/jsonl +7
    Updated Apr 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shah Lab (2025). INSPECT EHR [Dataset]. http://doi.org/10.57761/ak51-d519
    Explore at:
    parquet, sas, csv, arrow, stata, avro, application/jsonl, spssAvailable download formats
    Dataset updated
    Apr 19, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Shah Lab
    Description

    Abstract

    The INSPECT dataset (Integrating Numerous Sources for Prognostic Evaluation of Clinical Timelines) contains de-identified longitudinal electronic health records (EHRs) from a large cohort of pulmonary embolism (PE) patients, along with ground truth labels for multiple outcomes. It includes 19,390 patients EHRs linked to 23,248 CTPA studies with paired radiology impressions.

    Methodology

    https://redivis.com/fileUploads/282601b3-2c4b-4de2-a84c-742037a916cd%3E" alt="inspect-logo.png">

    1. Overview

    INSPECT is a large-scale 3D multimodal medical imaging dataset:

    • 19,390 patients
    • 23,248 CT scans
    • 225+ million clinical events
    • 3 linked modalities

    %3C!-- --%3E

    2. CT Scans + Radiology Impression Notes

    Imaging data are available for download from the Stanford AIMI Center.

    3. EHR Data

    EHR data is sourced from Stanford’s STARR-OMOP database. Data are standardized in the OMOP CDM schema and are fully de-identified. Complete technical details are included in the paper, but key highlights:

    • Dates are jittered within patient to conceal real dates (but preserve deltas between dates)
    • Data for patients %3E= 90 years old are removed
    • Data for minors %3C18 are removed
    • Unstructured text fields not mappable to OMOP standard concepts are redacted
    • All clinical note text is redacted
    • HIV test result are redacted.
    • Provider names and NPIs are redacted

    %3C!-- --%3E

    Please see our Github repo to obtain code for loading the dataset, including a full data preprocessing pipeline for reproducibility, and running a set of pretrained baseline models

    Usage

    Access to the INSPECT dataset requires the following:

    • Verified Affiliation (Academic, Government, Industry Research Lab). Please use your verified email address when applying, do not use gmail or personal emails.
    • Encryption Verification / Attestation for Data Storage
    • Signing the terms of the INSPECT Data Set License 1.0
    • Providing a short description of your intended research use of INSPECT
    • CITI Training

    %3C!-- --%3E

    **These data must remain on your encrypted machine. Redistribution of data is FORBIDDEN and will result in immediate termination of access privileges. **

    IMPORTANT NOTES:

    • Our policy on derived works aligns with PhysioNet's guidelines, requiring that these artifacts be hosted on Redivis. If you create derived research artifacts based on INSPECT EHR (such as additional annotations or synthetic data), please contact us to discuss hosting arrangements.
    • Sending INSPECT data over a non-HIPAA-compliant API is a violation of the DUA.

    %3C!-- --%3E

    Please allow 7-10 business days to process applications.

  8. New York State Hospital De-Identified Data Data Package

    • johnsnowlabs.com
    csv
    Updated Jan 20, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    John Snow Labs (2021). New York State Hospital De-Identified Data Data Package [Dataset]. https://www.johnsnowlabs.com/marketplace/new-york-state-hospital-de-identified-data-data-package/
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 20, 2021
    Dataset authored and provided by
    John Snow Labs
    Area covered
    New York
    Description

    This data package shows the information on hospital discharges at patient-level data with basic record details without showing protected health information (PHI) and was made not identifiable. The data is classified by Health Service Area and county.

  9. p

    CARMEN-I: A resource of anonymized electronic health records in Spanish and...

    • physionet.org
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eulalia Farre Maduell; Salvador Lima-Lopez; Santiago Andres Frid; Artur Conesa; Elisa Asensio; Antonio Lopez-Rueda; Helena Arino; Elena Calvo; Maria Jesús Bertran; Maria Angeles Marcos; Montserrat Nofre Maiz; Laura Tañá Velasco; Antonia Marti; Ricardo Farreres; Xavier Pastor; Xavier Borrat Frigola; Martin Krallinger (2024). CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools [Dataset]. http://doi.org/10.13026/x7ed-9r91
    Explore at:
    Dataset updated
    Apr 20, 2024
    Authors
    Eulalia Farre Maduell; Salvador Lima-Lopez; Santiago Andres Frid; Artur Conesa; Elisa Asensio; Antonio Lopez-Rueda; Helena Arino; Elena Calvo; Maria Jesús Bertran; Maria Angeles Marcos; Montserrat Nofre Maiz; Laura Tañá Velasco; Antonia Marti; Ricardo Farreres; Xavier Pastor; Xavier Borrat Frigola; Martin Krallinger
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022. These reports, primarily in Spanish with some Catalan sections, cover COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression. The corpus underwent thorough anonymization, validation, and expert annotation, replacing sensitive data with synthetic equivalents. A subset of the corpus features annotations of medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (including family members). CARMEN-I serves as a valuable resource for training and assessing clinical NLP techniques and language models, aiding tasks like de-identification, concept detection, linguistic modifier extraction, document classification, and more. It also facilitates training researchers in clinical NLP and is a collaborative effort involving Barcelona Supercomputing Center's NLP4BIA team, Hospital Clínic, and Universitat de Barcelona's CLiC group.

  10. EHRSHOT

    • redivis.com
    • stanford.redivis.com
    application/jsonl +7
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shah Lab (2025). EHRSHOT [Dataset]. http://doi.org/10.57761/0gv9-nd83
    Explore at:
    csv, application/jsonl, sas, parquet, stata, spss, arrow, avroAvailable download formats
    Dataset updated
    Feb 13, 2025
    Dataset provided by
    Redivis Inc.
    Authors
    Shah Lab
    Description

    Abstract

    👂💉 EHRSHOT is a dataset for benchmarking the few-shot performance of foundation models for clinical prediction tasks. EHRSHOT contains de-identified structured data (e.g., diagnosis and procedure codes, medications, lab values) from the electronic health records (EHRs) of 6,739 Stanford Medicine patients and includes 15 prediction tasks. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and includes data beyond ICU and emergency department patients.

    ⚡️Quickstart 1. To recreate the original EHRSHOT paper, download the EHRSHOT_ASSETS.zip file from the "Files" tab 2. To work with OMOP CDM formatted data, download all the tables in the "Tables" tab

    ⚙️ Please see the "Methodology" section below for details on the dataset and downloadable files.

    Methodology

    1. 📖 Overview

    EHRSHOT is a benchmark for evaluating models on few-shot learning for patient classification tasks. The dataset contains:

    • **6,739 **patients
    • 41.6 million clinical events
    • 921,499 visits
    • 15 prediction tasks

    %3C!-- --%3E

    2. 💽 Dataset

    EHRSHOT is sourced from Stanford’s STARR-OMOP database.

    • Data follows the OMOP CDM and is fully de-identified.
    • Unlike most other EHR research datasets, EHRSHOT is not restricted to ED/ICU visits and instead includes longitudinal patient data for all hospital encounter types.
    • EHRSHOT does not contain clinical notes or images.

    %3C!-- --%3E

    We provide two versions of the dataset:

    • EHRSHOT-Original is the same exact dataset used in the original EHRSHOT paper.
    • EHRSHOT-OMOP is a more complete version of the EHRSHOT dataset which includes all OMOP CDM tables and additional OMOP metadata.

    %3C!-- --%3E

    To access the raw data, please see the "Tables" and "Files"** **tabs above:

    3. 💽 Data Files and Formats

    We provide EHRSHOT in two file formats:

    • OMOP CDM v5.4
    • Medical Event Data Standard (MEDS)

    %3C!-- --%3E

    Within the "Tables" tab...

    1. %3Cu%3EEHRSHOT-OMOP%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Notes: Contains all OMOP CDM tables for the EHRSHOT patients. Note that this dataset is slightly different than the original EHRSHOT dataset, as these tables contain the full OMOP schema rather than a filtered subset.

    Within the "Files" tab...

    1. %3Cu%3EEHRSHOT_ASSETS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-Original

    * Data Format: FEMR 0.1.16

    * Notes: The original EHRSHOT dataset as detailed in the paper. Also includes model weights.

    2. %3Cu%3EEHRSHOT_MEDS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-Original

    * Data Format: MEDS 0.3.3

    * Notes: The original EHRSHOT dataset as detailed in the paper. It does not include any models.

    3. %3Cu%3EEHRSHOT_OMOP_MEDS.zip%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Data Format: MEDS 0.3.3 + MEDS-ETL 0.3.8

    * Notes: Converts the dataset from EHRSHOT-OMOP into MEDS format via the `meds_etl_omop`command from MEDS-ETL.

    4. %3Cu%3EEHRSHOT_OMOP_MEDS_Reader.zip%3C/u%3E

    * Dataset Version: EHRSHOT-OMOP

    * Data Format: MEDS Reader 0.1.9 + MEDS 0.3.3 + MEDS-ETL 0.3.8

    * Notes: Same data as EHRSHOT_OMOP_MEDS.zip, but converted into a MEDS-Reader database for faster reads.

    4. 🤖 Model

    We also release the full weights of **CLMBR-T-base, **a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. Please download from https://huggingface.co/StanfordShahLab/clmbr-t-base

    **5. 🧑‍💻 Code **

    Please see our Github repo to obtain code for loading the dataset and running a set of pretrained baseline models: https://github.com/som-shahlab/ehrshot-benchmark/

    Usage

    **NOTE: You must authenticate to Redivis using your formal affiliation's email address. If you use gmail or other personal email addresses, you will not be granted access. **

    Access to the EHRSHOT dataset requires the following:

    • Verified Affiliation with an **Academic, Government, **o
  11. Hospital Inpatient Discharges (SPARCS De-Identified): 2013

    • healthdata.gov
    • health.data.ny.gov
    application/rdfxml +5
    Updated Apr 8, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    health.data.ny.gov (2025). Hospital Inpatient Discharges (SPARCS De-Identified): 2013 [Dataset]. https://healthdata.gov/State/Hospital-Inpatient-Discharges-SPARCS-De-Identified/gbzd-5nff
    Explore at:
    application/rdfxml, csv, json, application/rssxml, xml, tsvAvailable download formats
    Dataset updated
    Apr 8, 2025
    Dataset provided by
    health.data.ny.gov
    Description

    The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.

  12. c

    Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...

    • cancerimagingarchive.net
    csv, dicom, n/a +1
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
    Explore at:
    sqlite and zip, dicom, csv, n/aAvailable download formats
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 2, 2025
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Abstract

    These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.

    This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.

    Introduction

    Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).

    These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).

    This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.

    Methods

    Subject Inclusion and Exclusion Criteria

    The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.

    Data Acquisition

    To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.

    Data Analysis

    Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.

    Usage Notes

    This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.

    To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.

  13. h

    Optimum Patient Care Research Database (OPCRD)

    • healthdatagateway.org
    unknown
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Optimum Patient Care (OPC) (2024). Optimum Patient Care Research Database (OPCRD) [Dataset]. http://doi.org/10.2147/POR.S395632
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Sep 12, 2024
    Dataset provided by
    Optimum Patient Care Limited
    Authors
    Optimum Patient Care (OPC)
    License

    https://opcrd.co.uk/our-database/data-requests/https://opcrd.co.uk/our-database/data-requests/

    Description

    About OPCRD

    Optimum Patient Care Research Database (OPCRD) is a real-world, longitudinal, research database that provides anonymised data to support scientific, medical, public health and exploratory research. OPCRD is established, funded and maintained by Optimum Patient Care Limited (OPC) – which is a not-for-profit social enterprise that has been providing quality improvement programmes and research support services to general practices across the UK since 2005.

    Key Features of OPCRD

    OPCRD has been purposefully designed to facilitate real-world data collection and address the growing demand for observational and pragmatic medical research, both in the UK and internationally. Data held in OPCRD is representative of routine clinical care and thus enables the study of ‘real-world’ effectiveness and health care utilisation patterns for chronic health conditions.

    OPCRD unique qualities which set it apart from other research data resources: • De-identified electronic medical records of more than 24.9 million patients • OPCRD covers all major UK primary care clinical systems • OPCRD covers approximately 35% of the UK population • One of the biggest primary care research networks in the world, with over 1,175 practices • Linked patient reported outcomes for over 68,000 patients including Covid-19 patient reported data • Linkage to secondary care data sources including Hospital Episode Statistics (HES)

    Data Available in OPCRD

    OPCRD has received data contributions from over 1,175 practices and currently holds de-identified research ready data for over 24.9 million patients or data subjects. This includes longitudinal primary care patient data and any data relevant to the management of patients in primary care, and thus covers all conditions. The data is derived from both electronic health records (EHR) data and patient reported data from patient questionnaires delivered as part of quality improvement. OPCRD currently holds over 68,000 patient reported questionnaire data on Covid-19, asthma, COPD and rare diseases.

    Approvals and Governance

    OPCRD has NHS research ethics committee (REC) approval to provide anonymised data for scientific and medical research since 2010, with its most recent approval in 2020 (NHS HRA REC ref: 20/EM/0148). OPCRD is governed by the Anonymised Data Ethics and Protocols Transparency committee (ADEPT). All research conducted using anonymised data from OPCRD must gain prior approval from ADEPT. Proceeds from OPCRD data access fees and detailed feasibility assessments are re-invested into OPC services for the continued free provision of patient quality improvement programmes for contributing practices and patients.

    For more information on OPCRD please visit: https://opcrd.co.uk/

  14. d

    Data from: Generalizable EHR-R-REDCap pipeline for a national...

    • datadryad.org
    • explore.openaire.eu
    • +2more
    zip
    Updated Jan 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller (2022). Generalizable EHR-R-REDCap pipeline for a national multi-institutional rare tumor patient registry [Dataset]. http://doi.org/10.5061/dryad.rjdfn2zcm
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 9, 2022
    Dataset provided by
    Dryad
    Authors
    Sophia Shalhout; Farees Saqlain; Kayla Wright; Oladayo Akinyemi; David Miller
    Time period covered
    2021
    Description

    Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.

    Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.

    Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.

    Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR...

  15. n

    Data from: Learning relevance models for patient cohort retrieval

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Mar 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Travis R. Goodwin; Sanda M. Harabagiu (2019). Learning relevance models for patient cohort retrieval [Dataset]. http://doi.org/10.5061/dryad.pq0cs6h
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 27, 2019
    Dataset provided by
    The University of Texas at Dallas
    Authors
    Travis R. Goodwin; Sanda M. Harabagiu
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    OBJECTIVE We explored how judgements provided by physicians can be used to learn relevance models that enhance the quality of patient cohorts retrieved from Electronic Health Records (EHR) collections. METHODS A very large number of features were extracted from patient cohort descriptions as well as electronic health record collections. Specifically, we investigated retrieving (1) neurology-specific patient cohorts from the Temple University Hospital EEG Corpus as well as (2) the more general cohorts evaluated in the TREC Medical Records Track (TRECMed) from the de-identified hospital records provided by the University of Pittsburgh Medical Center. The features informed a Learning Relevance Model (LRM) that took advantage of relevance judgements provided by physicians. The LRM implements a pairwise learning-to-rank framework, which enables our learning patient cohort retrieval (L-PCR) system to learn from physicians’ feedback. RESULTS AND DISCUSSION We evaluated the L-PCR system against state-of-the-art traditional patient cohort retrieval systems, and observed a 27% improvement when operating on EEGs and a 53% improvement when operating on TRECMed EHRs, showing the promise of the L-PCR system. We also performed extensive feature analyses to reveal the most effective strategies for representing cohort descriptions as queries, encoding EHRs, and measuring relevance. CONCLUSION The learning patient cohort retrieval system has significant promise for reliably retrieving patient cohorts from EHRs in multiple settings when trained with relevance judgments. When provided with additional cohort descriptions, the L-PCR will continue to learn, thus offering a potential solution to the performance barriers of current cohort retrieval systems.

  16. Hospital Inpatient Discharges (SPARCS De-Identified): 2023

    • healthdata.gov
    application/rdfxml +5
    Updated Jun 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    health.data.ny.gov (2025). Hospital Inpatient Discharges (SPARCS De-Identified): 2023 [Dataset]. https://healthdata.gov/d/rwh3-2k63
    Explore at:
    application/rdfxml, csv, application/rssxml, tsv, xml, jsonAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    health.data.ny.gov
    Description

    The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges.

    This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.

    For more information visit: https://www.health.ny.gov/statistics/sparcs/

  17. f

    De-identified data.

    • plos.figshare.com
    • figshare.com
    xlsx
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sophie Wennemann; Bbuye Mudarshiru; Stella Zawedde-Muyanja; Trishul Siddharthan; Peter D. Jackson (2024). De-identified data. [Dataset]. http://doi.org/10.1371/journal.pgph.0002892.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Feb 8, 2024
    Dataset provided by
    PLOS Global Public Health
    Authors
    Sophie Wennemann; Bbuye Mudarshiru; Stella Zawedde-Muyanja; Trishul Siddharthan; Peter D. Jackson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    More than half the global population burns biomass fuels for cooking and home heating, especially in low-middle income countries. This practice is a prominent source of indoor air pollution and has been linked to the development of a variety of cardiopulmonary diseases, including Tuberculosis (TB). The purpose of this cross-sectional study was to investigate the association between current biomass smoke exposure and self-reported quality of life scores in a cohort of previous TB patients in Uganda. We reviewed medical records from six TB clinics from 9/2019-9/2020 and conducted phone interviews to obtain information about biomass smoke exposure. A random sample of these patients were asked to complete three validated quality-of-life surveys including the St. Georges Respiratory Questionnaire (SGRQ), the EuroQol 5 Dimension 3 Level system (EQ-5D-3L) which includes the EuroQol Visual Analog Scale (EQ-VAS), and the Patient Health Questionnaire 9 (PHQ-9). The cohort was divided up into 3 levels based on years of smoke exposure–no-reported smoke exposure (0 years), light exposure (1–19 years), and heavy exposure (20+ years), and independent-samples-Kruskal-Wallis testing was performed with post-hoc pairwise comparison and the Bonferroni correction. The results of this testing indicated significant increases in survey scores for patients with current biomass exposure and a heavy smoke exposure history (20+ years) compared to no reported smoke exposure in the SGRQ activity scores (adj. p = 0.018) and EQ-5D-3L usual activity scores (adj. p = 0.002), indicating worse activity related symptoms. There was a decrease in EQ-VAS scores for heavy (adj. p = 0.007) and light (adj. p = 0.017) exposure groups compared to no reported exposure, indicating lower perceptions of overall health. These results may suggest worse outcomes or baseline health for TB patients exposed to biomass smoke at the time of treatment and recovery, however further research is needed to characterize the effect of indoor air pollution on TB treatment outcomes.

  18. MultiCaRe: An open-source clinical case dataset for medical image...

    • zenodo.org
    bin, csv, zip
    Updated Dec 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mauro Nievas Offidani; Mauro Nievas Offidani (2024). MultiCaRe: An open-source clinical case dataset for medical image classification and multimodal AI applications [Dataset]. http://doi.org/10.5281/zenodo.13936721
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Dec 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mauro Nievas Offidani; Mauro Nievas Offidani
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains multi-modal data from over 85,000 open access and de-identified case reports, including metadata, clinical cases, image captions and more than 160,000 images. Images and clinical cases belong to different medical specialties, such as oncology, cardiology, surgery and pathology. The structure of the dataset allows to easily map images with their corresponding article metadata, clinical case, captions and image labels. Details of the data structure can be found in the file data_dictionary.csv.

    More than 110,000 patients and 300,000 medical doctors and researchers were involved in the creation of the articles included in this dataset. The citation data of each article can be found in the metadata.parquet file.

    Refer to the examples showcased in this GitHub repository to understand how to optimize the use of this dataset.

  19. d

    Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource...

    • search.dataone.org
    • datadryad.org
    Updated Apr 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fateme Nateghi Haredasht; Fatemeh Amrollahi; Manoj Maddali; Nicholas Marshall; Stephen Ma; Amy Chang; Niaz Banaei; Stanley Deresinski; Steven Asch; Mary Goldstein; Jonathan Chen (2025). Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource for studying antimicrobial resistance using electronic health records [Dataset]. http://doi.org/10.5061/dryad.jq2bvq8kp
    Explore at:
    Dataset updated
    Apr 12, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Fateme Nateghi Haredasht; Fatemeh Amrollahi; Manoj Maddali; Nicholas Marshall; Stephen Ma; Amy Chang; Niaz Banaei; Stanley Deresinski; Steven Asch; Mary Goldstein; Jonathan Chen
    Description

    The Antibiotic Resistance Microbiology Dataset (ARMD) is a structured and de-identified resource developed using electronic health records (EHR) from Stanford Healthcare. It provides a comprehensive overview of microbiological cultures including urine, respiratory, and blood cultures. This dataset includes 283,715 unique adult patients and features detailed information on culture results, identified organisms, antibiotic susceptibility, and associated demographic and clinical data. The dataset was meticulously constructed through a multi-step process designed to enhance data quality and relevance. By enabling the study of antimicrobial resistance patterns and supporting antimicrobial stewardship efforts, ARMD offers a valuable resource for researchers and clinicians seeking to improve the management of infectious diseases and combat the growing threat of antimicrobial resistance., Cohort Selection The ARMD was created using de-identified EHR data from Stanford Healthcare to address this need. This dataset provides microbiological cultures from adult patients (≥18 years old) and includes key clinical data points relevant to studying antimicrobial resistance. The cohort construction involved the following features and processes:

    Culture Types: Microbiological cultures were included, specifically urine, respiratory, and blood cultures.

    Temporal Adjustment: The timing of culture orders was adjusted for data privacy through jittering, ensuring patient confidentiality while retaining meaningful temporal relationships.

    Culture Positivity: Each culture is flagged as either positive or negative, indicating whether an organism was identified. Cultures flagged as negative are represented by a null value in the susceptibility field.

    Organism Identification and Susceptibility: For positive cultures, the identified organism and its antibiotic susceptibility are recorde..., , ## Antibiotic Resistance Microbiology Dataset (ARMD): A de-identified resource for studying antimicrobial resistance using electronic health records

    Background

    Antimicrobial resistance (AMR) represents a pressing global health challenge, exacerbated by the overuse and misuse of antibiotics. Efforts to mitigate AMR require high-quality datasets to analyze trends in microbial susceptibility, guide clinical decision-making, and inform stewardship programs. Electronic health records (EHR) are a rich source of real-world data that can be leveraged to study antimicrobial use and resistance patterns. However, constructing meaningful datasets from EHR data requires rigorous curation and preprocessing to ensure accuracy, relevance, and usability. ARMD aims to facilitate research in antimicrobial stewardship, with applications in identifying resistance patterns, evaluating treatment practices, and informing public health interventions. By leveraging de-identified EHR data from Stanford Healt...,

  20. M

    OLD-INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and...

    • stanfordaimi.azurewebsites.net
    Updated May 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Microsoft Research (2024). OLD-INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis [Dataset]. https://stanfordaimi.azurewebsites.net/datasets/318f3464-c4b6-4006-9856-6f48ba40ad67
    Explore at:
    Dataset updated
    May 30, 2024
    Dataset authored and provided by
    Microsoft Research
    License

    https://aimistanford-web-api.azurewebsites.net/licenses/f1f352a6-243f-4905-8e00-389edbca9e83/viewhttps://aimistanford-web-api.azurewebsites.net/licenses/f1f352a6-243f-4905-8e00-389edbca9e83/view

    Description

    Synthesizing information from various data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets. To address this limitation, we introduce INSPECT, which contains de-identified longitudinal records from a large cohort of pulmonary embolism (PE) patients, along with ground truth labels for multiple outcomes. INSPECT contains data from 19,438 patients, including CT images, sections of radiology reports, and structured electronic health record (EHR) data (including demographics, diagnoses, procedures, and vitals). Using our provided dataset, we develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks. We evaluate image-only, EHR-only, and fused models. Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement. To the best our knowledge, INSPECT is the largest multimodal dataset for enabling reproducible research on strategies for integrating 3D medical imaging and EHR data. NOTE: this is the first part of release due to PHI review. This release has 20078 CT scans, 21,266 impression sections and the EHR modality data will be uploaded to Stanford Redivis website (https://redivis.com/Stanford)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). i2b2 De-identification Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/i2b2-de-identification-dataset

i2b2 De-identification Dataset Dataset

Informatics for Integrating Biology and the Bedside (i2b2) Project — De-identification Dataset

Explore at:
15 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
May 6, 2022
Description

This dataset contains 1304 de-identified longitudinal medical records describing 296 patients.

Search
Clear search
Close search
Google apps
Main menu