100+ datasets found
  1. f

    De-identification - anonymization

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco H C Felix (2023). De-identification - anonymization [Dataset]. http://doi.org/10.6084/m9.figshare.3545471.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Authors
    Francisco H C Felix
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    De-identification, anonymization, pseudoanonymization, re-identificationNational Institute of Standards and Technology (NIST) documentation declares that the use of these terms is still unclear. Words de-identification, anonymizatio_ and pseudoanonymization are sometimes interchangeable, sometimes carrying subtle different meanings. To mitigate ambiguity, NIST use definitions from ISO/TS 25237:2008:> de-identification: “general term for any process of removing the association between a set of identifying data and the data subject.” [p. 3] anonymization: “process that removes the association between the identifying dataset and the data subject.” [p. 2] pseudonymization: “particular type of anonymization that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.”1 [p. 5]Brazilian portuguese literature largely lacks this terminology, and they are more often used in law or information technology. The utilization of these concepts in health care and research has a specific conceptualization. HIPAA (Health Insurance Portability and Accountability Act), US regulation of health data privacy protection, establishes standards for patient personal information (protected health information - PHI) handling by health care providers (covered entities).

  2. d

    Open Data Training Video: A proposed data de-identification framework for...

    • dataone.org
    • borealisdata.ca
    Updated Dec 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mawji, Alishah; Longstaff, Holly; Trawin, Jessica; Komugisha, Clare; Novakowski, Stefanie K.; Wiens, Matt; Akech, Samuel; Tagoola, Abner; Kissoon, Niranjan; Ansermino, Mark J. (2023). Open Data Training Video: A proposed data de-identification framework for clinical research [Dataset]. http://doi.org/10.5683/SP3/7XYZVC
    Explore at:
    Dataset updated
    Dec 28, 2023
    Dataset provided by
    Borealis
    Authors
    Mawji, Alishah; Longstaff, Holly; Trawin, Jessica; Komugisha, Clare; Novakowski, Stefanie K.; Wiens, Matt; Akech, Samuel; Tagoola, Abner; Kissoon, Niranjan; Ansermino, Mark J.
    Description

    Objective(s): Data sharing has enormous potential to accelerate and improve the accuracy of research, strengthen collaborations, and restore trust in the clinical research enterprise. Nevertheless, there remains reluctancy to openly share raw data sets, in part due to concerns regarding research participant confidentiality and privacy. We provide an instructional video to describe a standardized de-identification framework that can be adapted and refined based on specific context and risks. Data Description: Training video, presentation slides. Related Resources: The data de-identification algorithm, dataset, and data dictionary that correspond with this training video are available through the Smart Triage sub-Dataverse., NOTE for restricted files: If you are not yet a CoLab member, please complete our membership application survey to gain access to restricted files within 2 business days. Some files may remain restricted to CoLab members. These files are deemed more sensitive by the file owner and are meant to be shared on a case-by-case basis. Please contact the CoLab coordinator on this page under "collaborate with the pediatric sepsis colab."

  3. D

    De-identified Health Data Market Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). De-identified Health Data Market Report [Dataset]. https://www.archivemarketresearch.com/reports/de-identified-health-data-market-9104
    Explore at:
    ppt, pdf, docAvailable download formats
    Dataset updated
    Jan 22, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    global
    Variables measured
    Market Size
    Description

    Recent developments include: In February 2024, Veradigm published its first Veradigm Insights Report: Cardiovascular Conditions in 2024, analyzing de-identified real-world data from 53 million cardiovascular patients. The report assesses the prevalence of cardiovascular disease (CVD) and related conditions across all U.S. states, with demographic breakdowns based on age, ethnicity, and sex. , In July 2021, Verana Health and Komodo Health partnered to integrate Komodo’s Healthcare Map into Verana’s de-identified EHR datasets, spanning over 325 million patient journeys. This collaboration aims to provide life sciences researchers with detailed insights into patient pathways, encompassing treatment histories, hospitalizations, and socioeconomic factors. The partnership is expected to enhance research efforts in ophthalmology, neurology, and urology by combining clinical outcomes with real-world patient data, supporting more informed treatment development. , In September 2024, ICON announced a collaboration with Intel to utilize de-identified data from its clinical research platform alongside Intel's AI technology. This partnership enhances patient recruitment and streamlines clinical trial processes by deriving insights from de-identified patient data. The initiative aims to advance precision medicine and improve efficiencies in drug development and outcomes by integrating ICON's clinical trial expertise with Intel's AI capabilities. .

  4. d

    National Database for Clinical Trials Related to Mental Illness (NDCT)

    • catalog.data.gov
    • healthdata.gov
    • +3more
    Updated Jul 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (NIH) (2025). National Database for Clinical Trials Related to Mental Illness (NDCT) [Dataset]. https://catalog.data.gov/dataset/national-database-for-clinical-trials-related-to-mental-illness-ndct
    Explore at:
    Dataset updated
    Jul 16, 2025
    Dataset provided by
    National Institutes of Health (NIH)
    Description

    The National Database for Clinical Trials Related to Mental Illness (NDCT) is an extensible informatics platform for relevant data at all levels of biological and behavioral organization (molecules, genes, neural tissue, behavioral, social and environmental interactions) and for all data types (text, numeric, image, time series, etc.) related to clinical trials funded by the National Institute of Mental Health. Sharing data, associated tools, methodologies and results, rather than just summaries or interpretations, accelerates research progress. Community-wide sharing requires common data definitions and standards, as well as comprehensive and coherent informatics approaches for the sharing of de-identified human subject research data. Built on the National Database for Autism Research (NDAR) informatics platform, NDCT provides a comprehensive data sharing platform for NIMH grantees supporting clinical trials.

  5. u

    De-identified Data from the PArTNER Study: A Pragmatic Clinical Trial to...

    • indigo.uic.edu
    csv
    Updated May 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerry Krishnan; Sai Dheeraj Illendula; Lynn Gerald; Jun Lu (2025). De-identified Data from the PArTNER Study: A Pragmatic Clinical Trial to Improve Patient Experience During Transitions from Hospital to Home [Dataset]. http://doi.org/10.25417/uic.28889918.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 5, 2025
    Dataset provided by
    University of Illinois Chicago
    Authors
    Jerry Krishnan; Sai Dheeraj Illendula; Lynn Gerald; Jun Lu
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    The PArTNER study was a single-center pragmatic randomized clinical trial conducted at a minority-serving hospital in Chicago. It evaluated whether a Navigator intervention—delivered by community health workers and peer coaches—could improve patient experience, health outcomes, and healthcare utilization during the transition from hospital to home among adults hospitalized with heart failure, pneumonia, myocardial infarction (MI), chronic obstructive pulmonary disease (COPD), or sickle cell disease. A total of 1,029 adults, predominantly non-Hispanic Black, participated. The intervention included in-hospital visits, a home visit, and follow-up telephone coaching. The primary outcomes were changes in anxiety and informational support at 30 days post-discharge. The study found no significant overall improvements compared to usual care, although exploratory analyses suggested potential benefits for certain subgroups.Data Description:The dataset includes de-identified information on participant demographics, clinical characteristics, social determinants of health, Patient-Reported Outcomes Measurement Information System (PROMIS) scores (e.g., anxiety, informational support), healthcare utilization outcomes (e.g., hospital readmissions, emergency department visits), and intervention engagement. Data were collected through baseline hospital assessments, telephone follow-up surveys at 30 and 60 days post-discharge, and electronic health record reviews.Publications related to data:LaBedz, Stephanie L., et al. "Pragmatic clinical trial to improve patient experience among adults during transitions from hospital to home: the PArTNER study." Journal of general internal medicine 37.16 (2022): 4103-4111.Prieto-Centurion, Valentin, et al. "Design of the patient navigator to Reduce Readmissions (PArTNER) study: a pragmatic clinical effectiveness trial." contemporary clinical trials communications 15 (2019): 100420.

  6. U

    Data from: Patient Consent to Publication and Data Sharing in Industry and...

    • datacatalog.hshsl.umaryland.edu
    Updated Mar 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    O'Mareen Spence; Richie Onwuchekwa Uba; Seongbin Shin; Peter Doshi (2024). Patient Consent to Publication and Data Sharing in Industry and NIH-Funded Clinical Trials [Dataset]. http://doi.org/10.5281/zenodo.1231072
    Explore at:
    Dataset updated
    Mar 27, 2024
    Dataset provided by
    HS/HSL
    Authors
    O'Mareen Spence; Richie Onwuchekwa Uba; Seongbin Shin; Peter Doshi
    Time period covered
    Jan 1, 1983 - Dec 31, 2013
    Description

    Clinical trial participants are often motivated by the altruistic assumption that study results will contribute to medical knowledge. Additionally, the sharing of research data is rapidly developing into an ethical standard. An evaluation of 144 blank (sample) informed consent forms (ICF) was undertaken to determine the extent to which clinical trial participants were apprised of researchers’ intent to publish results, share de-identified data, and the overall benefit to medical knowledge. This dataset consists of 98 ICFs from industry-funded trials from the European Medicines Agency (EMA) and 46 ICFs from publicly-funded trials listed in the National Heart, Lung and Blood Institute (NHLBI) Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC). The documents were reviewed for identification and extraction of stated or implied language for the following 5 aspects of each study: publication of results, sharing de-identified data, data ownership, confidentiality of identifiable data and, whether the trial will produce knowledge that offers public benefit. Results indicate that investigators rarely disclose intent to share de-identifiable data or commitment to publish. All ICFs are available via 2 zip files, one for the industry-funded trials and the other for the trials in BioLINCC. Also included is the study extraction sheet.

  7. d

    REVAMP Clinical Trial Dataset

    • dataone.org
    • search.dataone.org
    Updated Nov 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siedner. Mark (2023). REVAMP Clinical Trial Dataset [Dataset]. http://doi.org/10.7910/DVN/SKUUOP
    Explore at:
    Dataset updated
    Nov 12, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Siedner. Mark
    Description

    Data are derived from the Resistance Testing for Management of HIV Virologic Failure in Sub-Saharan Africa (REVAMP) clinical trial. The de-identified dataset includes include randomization allocation, baseline participant characteristics and primary and secondary outcomes.

  8. n

    NIDA Data Share

    • neuinfo.org
    • rrid.site
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). NIDA Data Share [Dataset]. http://identifiers.org/RRID:SCR_002002
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    Website which allows data from completed clinical trials to be distributed to investigators and public. Researchers can download de-identified data from completed NIDA clinical trial studies to conduct analyses that improve quality of drug abuse treatment. Incorporates data from Division of Therapeutics and Medical Consequences and Center for Clinical Trials Network.

  9. e

    De-Identified Health Data Market Size, Share, Trend Analysis by 2033

    • emergenresearch.com
    pdf,excel,csv,ppt
    Updated Dec 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emergen Research (2024). De-Identified Health Data Market Size, Share, Trend Analysis by 2033 [Dataset]. https://www.emergenresearch.com/industry-report/de-identified-health-data-market
    Explore at:
    pdf,excel,csv,pptAvailable download formats
    Dataset updated
    Dec 8, 2024
    Dataset authored and provided by
    Emergen Research
    License

    https://www.emergenresearch.com/privacy-policyhttps://www.emergenresearch.com/privacy-policy

    Area covered
    Global
    Variables measured
    Base Year, No. of Pages, Growth Drivers, Forecast Period, Segments covered, Historical Data for, Pitfalls Challenges, 2033 Value Projection, Tables, Charts, and Figures, Forecast Period 2024 - 2033 CAGR, and 1 more
    Description

    The De-Identified Health Data Market size is expected to reach a valuation of USD 17.23 billion in 2033 growing at a CAGR of 9.50%. The De-Identified Health Data market research report classifies market by share, trend, demand, forecast and based on segmentation.

  10. Data from: Sharing of clinical trial data among trialists: a cross sectional...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Dec 19, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinay Rathi; Kristina Dzara; Cary P. Gross; Iain Hrynaszkiewicz; Steven Joffe; Harlan M. Krumholz; Kelly M. Strait; Joseph S. Ross (2012). Sharing of clinical trial data among trialists: a cross sectional survey [Dataset]. http://doi.org/10.5061/dryad.6544v
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 19, 2012
    Dataset provided by
    BioMed Centralhttp://www.biomedcentral.com/
    Yale School of Medicine
    Yale New Haven Hospital
    Boston Children's Hospital
    Authors
    Vinay Rathi; Kristina Dzara; Cary P. Gross; Iain Hrynaszkiewicz; Steven Joffe; Harlan M. Krumholz; Kelly M. Strait; Joseph S. Ross
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    United States, Other, Western Europe
    Description

    Objective: To investigate clinical trialists’ opinions and experiences of sharing of clinical trial data with investigators who are not directly collaborating with the research team. Design and setting: Cross sectional, web based survey. Participants: Clinical trialists who were corresponding authors of clinical trials published in 2010 or 2011 in one of six general medical journals with the highest impact factor in 2011. Main outcome measures: Support for and prevalence of data sharing through data repositories and in response to individual requests, concerns with data sharing through repositories, and reasons for granting or denying requests. Results: Of 683 potential respondents, 317 completed the survey (response rate 46%). In principle, 236 (74%) thought that sharing de-identified data through data repositories should be required, and 229 (72%) thought that investigators should be required to share de-identified data in response to individual requests. In practice, only 56 (18%) indicated that they were required by the trial funder to deposit the trial data in a repository; of these 32 (57%) had done so. In all, 149 respondents (47%) had received an individual request to share their clinical trial data; of these, 115 (77%) had granted and 56 (38%) had denied at least one request. Respondents’ most common concerns about data sharing were related to appropriate data use, investigator or funder interests, and protection of research subjects. Conclusions: We found strong support for sharing clinical trial data among corresponding authors of recently published trials in high impact general medical journals who responded to our survey, including a willingness to share data, although several practical concerns were identified.

  11. Clean data from survey of statisticians on Adverse Event analysis practices...

    • figshare.com
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rachel Phillips; Victoria Cornelius (2023). Clean data from survey of statisticians on Adverse Event analysis practices in RCTs [Dataset]. http://doi.org/10.6084/m9.figshare.12436574.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Rachel Phillips; Victoria Cornelius
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset (Stata v15.1) containing responses from a survey of UK Clinical Research Collaboration registered clinical trial units (CTUs) and industry statisticians from both pharmaceuticals and clinical research organisations (http://dx.doi. org/10.1136/bmjopen-2020- 036875) Data is de-identified. The dataset contains descriptive variables describing participant's experience, as well as responses to questions on current adverse event analysis practices, awareness of specialist methods for adverse event analysis and priorities, concerns and barriers participants experience when analysing adverse event data.

  12. Data cleaning using unstructured data

    • zenodo.org
    zip
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer (2024). Data cleaning using unstructured data [Dataset]. http://doi.org/10.5281/zenodo.13135983
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Rihem Nasfi; Rihem Nasfi; Antoon Bronselaer; Antoon Bronselaer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this project, we work on repairing three datasets:

    • Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help find informative details about the design of the trial.
    • Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database. Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.
    • Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and `Das Ist Drin'. There may be overlapping products across these websites. Each product in the dataset is identified by a unique code. Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients.

    N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

    • "{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")
    • "{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")
    • "{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")
    • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")
    • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")
  13. n

    GRDR

    • neuinfo.org
    • dknet.org
    • +1more
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). GRDR [Dataset]. http://identifiers.org/RRID:SCR_008978
    Explore at:
    Dataset updated
    Nov 12, 2024
    Description

    Data repository of de-identified patient data, aggregated in a standardized manner, to enable analyses across many rare diseases and to facilitate various research projects, clinical studies, and clinical trials. The aim is to facilitate drug and therapeutics development, and to improve the quality of life for the many millions of people who are suffering from rare diseases. The goal of GRDR is to enable analyses of data across many rare diseases and to facilitate clinical trials and other studies. During the two-year pilot program, a web-based template will be developed to allow any patient organization to establish a rare disease patient registry. At the conclusion of the program, guidance will be available to patient groups to establish a registry and to contribute de-identified patient data to the GRDR repository. A Request for Information (RFI) was released on February 10, 2012 requesting information from patient groups about their interest in participating in a GRDR pilot project. ORDR selected 30 patient organizations to participate in this pilot program to test the different functionalities of the GRDR. Fifteen (15) organizations with established registries and 15 organizations that do not have patient registry. The 15 patient groups, each without a registry, were selected to assist in testing the implementation of the ORDR Common Data Elements (CDEs) in the newly developed registry infrastructure. These organizations will participate in the development and promotion of a new patient registry for their rare disease. The GRDR program will fund the development and hosting of the registry during the pilot program. Thereafter, the patient registry is expected to be self-sustaining.The 15 established patient registries were selected to integrate their de-identified data into the GRDR to evaluate the data mapping and data import/export processes. The GRDR team will assist these organizations in mapping their existing registry data to the CDEs. Participating registries must have a means to export their de-identified registry data into a specified data format that will facilitate loading the data into the GRDR repository on a regular basis. The GRDR will also develop the capability to link patients'''' data and medical information to donated biospecimens by using a Voluntary Global Unique Patient Identifier (GUID). The identifier will enable the creation of an interface between the patient registries that are linked to biorepositories and the Rare Disease Human Biospecimens/Biorepositories (RD-HUB) http://biospecimens.ordr.info.nih.gov/.

  14. c

    Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation,...

    • cancerimagingarchive.net
    csv, dicom, n/a +1
    Updated May 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2025). Data in Support of the MIDI-B Challenge (MIDI-B-Synthetic-Validation, MIDI-B-Curated-Validation, MIDI-B-Synthetic-Test, MIDI-B-Curated-Test) [Dataset]. http://doi.org/10.7937/cf2p-aw56
    Explore at:
    sqlite and zip, dicom, csv, n/aAvailable download formats
    Dataset updated
    May 2, 2025
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    May 2, 2025
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Abstract

    These resources comprise a large and diverse collection of multi-site, multi-modality, and multi-cancer clinical DICOM images from 538 subjects infused with synthetic PHI/PII in areas encountered by TCIA curation teams. Also provided is a TCIA-curated version of the synthetic dataset, along with mapping files for mapping identifiers between the two.

    This new MIDI data resource includes DICOM datasets used in the Medical Image De-Identification Benchmark (MIDI-B) challenge at MICCAI 2024. They are accompanied by ground truth answer keys and a validation script for evaluating the effectiveness of medical image de-identification workflows. The validation script systematically assesses de-identified data against an answer key outlining appropriate actions and values for proper de-identification of medical images, promoting safer and more consistent medical image sharing.

    Introduction

    Medical imaging research increasingly relies on large-scale data sharing. However, reliable de-identification of DICOM images still presents significant challenges due to the wide variety of DICOM header elements and pixel data where identifiable information may be embedded. To address this, we have developed an openly accessible synthetic dataset containing artificially generated protected health information (PHI) and personally identifiable information (PII).

    These resources complement our earlier work (Pseudo-PHI-DICOM-data ) hosted on The Cancer Imaging Archive. As an example of its use, we also provide a version curated by The Cancer Imaging Archive (TCIA) curation team. This resource builds upon best practices emphasized by the MIDI Task Group who underscore the importance of transparency, documentation, and reproducibility in de-identification workflows, part of the themes at recent conferences (Synapse:syn53065760) and workshops (2024 MIDI-B Challenge Workshop).

    This framework enables objective benchmarking of de-identification performance, promotes transparency in compliance with regulatory standards, and supports the establishment of consistent best practices for sharing clinical imaging data. We encourage the research community to use these resources to enhance and standardize their medical image de-identification workflows.

    Methods

    Subject Inclusion and Exclusion Criteria

    The source data were selected from imaging already hosted in de-identified form on TCIA. Imaging containing faces were excluded, and no new human studies were performed for his project.

    Data Acquisition

    To build the synthetic dataset, image series were selected from TCIA’s curated datasets to represent a broad range of imaging modalities (CR, CT, DX, MG, MR, PT, SR, US) , manufacturers including (GE, Siemens, Varian , Confirma, Agfa, Eigen, Elekta, Hologic, KONICA MINOLTA, others) , scan parameters, and regions of the body. These were processed to inject the synthetic PHI/PII as described.

    Data Analysis

    Synthetic pools of PHI, like subject and scanning institution information, were generated using the Python package Faker (https://pypi.org/project/Faker/8.10.3/). These were inserted into DICOM metadata of selected imaging files using a system of inheritable rule-based templates outlining re-identification functions for data insertion and logging for answer key creation. Text was also burned-in to the pixel data of a number of images. By systematically embedding realistic synthetic PHI into image headers and pixel data, accompanied by a detailed ground-truth answer key, our framework enables users transparency, documentation, and reproducibility in de-identification practices, aligned with the HIPAA Safe Harbor method, DICOM PS3.15 Confidentiality Profiles, and TCIA best practices.

    Usage Notes

    This DICOM collection is split into two datasets, synthetic and curated. The synthetic dataset is the PHI/PII infused DICOM collection accompanied by a validation script and answer keys for testing, refining and benchmarking medical image de-identification pipelines. The curated dataset is a version of the synthetic dataset curated and de-identified by members of The Cancer Imaging Archive curation team. It can be used as a guide, an example of medical image curation best practices. For the purposes of the De-Identification challenge at MICCAI 2024, the synthetic and curated datasets each contain two subsets, a portion for Validation and the other for Testing.

    To link a curated dataset to the original synthetic dataset and answer keys, a mapping between the unique identifiers (UIDs) and patient IDs must be provided in CSV format to the evaluation software. We include the mapping files associated with the TCIA-curated set as an example. Lastly, for both the Validation and Testing datasets, an answer key in sqlite.db format is provided. These components are for use with the Python validation script linked below (4). Combining these components, a user developing or evaluating de-identification methods can ensure they meet a specification for successfully de-identifying medical image data.

  15. N

    INSIGHT Clinical Research Network

    • datacatalog.med.nyu.edu
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). INSIGHT Clinical Research Network [Dataset]. https://datacatalog.med.nyu.edu/dataset/10133
    Explore at:
    Dataset updated
    Jun 17, 2025
    Area covered
    New York
    Description

    INSIGHT Clinical Research Network is a project founded by the Patient Centered Outcomes Research Institute (PCORI) and is part of the PCORnet program. The INSIGHT longitudinal datasets bring together New York City organizations including medical schools, medical centers, research support organizations, and practice-based research networks. Over 160 million patient encounters and 365 million diagnoses have been recorded in the central data repository. The INSIGHT datasets include longitudinally collected clinical, patient-reported, and patient-generated information, as well as claims data. Datasets are available at the de-identified patient level, identifiable patient level, and patient cohort level. COVID-19 data is available.

  16. Medical Imaging De-Identification Software Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Medical Imaging De-Identification Software Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/medical-imaging-de-identification-software-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Medical Imaging De-Identification Software Market Outlook




    According to our latest research, the global medical imaging de-identification software market size reached USD 315 million in 2024, driven by the increasing adoption of digital healthcare solutions and stringent regulatory requirements for patient data privacy. The market is expected to grow at a robust CAGR of 13.2% during the forecast period, reaching approximately USD 858 million by 2033. The primary growth factor fueling this expansion is the rising volume of medical imaging data and the escalating need to ensure compliance with data protection laws such as HIPAA, GDPR, and other regional regulations.




    The growth trajectory of the medical imaging de-identification software market is underpinned by the exponential increase in digital imaging procedures across healthcare facilities worldwide. As advanced imaging modalities like MRI, CT, and PET scans become standard in diagnostic workflows, the volume of data generated has surged. This data often contains sensitive patient information, making it imperative for healthcare organizations to adopt robust de-identification solutions. The proliferation of health information exchanges and the increasing emphasis on interoperability have further heightened the need for secure and compliant data sharing. These factors collectively foster a conducive environment for the adoption of de-identification software, as organizations seek to balance data utility with stringent privacy requirements.




    Another major driver is the evolving regulatory landscape that mandates strict adherence to patient confidentiality and data protection standards. Regulatory frameworks such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, the General Data Protection Regulation (GDPR) in Europe, and similar regulations in Asia Pacific and other regions are compelling healthcare providers and research institutions to implement advanced de-identification solutions. These regulations impose hefty penalties for non-compliance, further incentivizing investments in software that can automate and streamline the de-identification process. Moreover, the growing trend of collaborative research and data sharing among healthcare entities necessitates reliable de-identification tools to facilitate secure and lawful data exchange.




    Technological advancements in artificial intelligence and machine learning are also playing a pivotal role in shaping the medical imaging de-identification software market. Modern solutions leverage AI-driven algorithms to enhance the accuracy and efficiency of de-identification processes, reducing the risk of inadvertent data leaks. These innovations are particularly valuable in large-scale research projects, where massive datasets must be anonymized rapidly and without compromising data integrity. Furthermore, the integration of de-identification software with existing healthcare IT infrastructure, such as PACS and EHR systems, is becoming increasingly seamless, making adoption easier for end-users. This technological evolution is expected to drive further market growth over the next decade.




    From a regional perspective, North America currently dominates the medical imaging de-identification software market, accounting for the largest share in 2024. The region’s leadership is attributed to the presence of advanced healthcare infrastructure, high adoption rates of digital health technologies, and stringent regulatory frameworks. Europe follows closely, propelled by GDPR compliance and increasing investments in healthcare IT. The Asia Pacific region is experiencing the fastest growth, fueled by expanding healthcare access, rapid digitalization, and rising awareness of data privacy. Latin America and the Middle East & Africa are also witnessing gradual adoption, supported by ongoing healthcare modernization initiatives and regulatory developments.





    Component Analysis




    The component segment of the medical imaging de-i

  17. G

    Clinical Notes Image Matching

    • gomask.ai
    csv
    Updated Jul 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GoMask.ai (2025). Clinical Notes Image Matching [Dataset]. https://gomask.ai/marketplace/datasets/clinical-notes-image-matching
    Explore at:
    csv(Unknown)Available download formats
    Dataset updated
    Jul 12, 2025
    Dataset provided by
    GoMask.ai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    note_id, image_id, modality, body_part, diagnosis, note_date, note_text, image_date, image_type, patient_id, and 3 more
    Description

    This dataset provides a comprehensive mapping between de-identified clinical notes and their corresponding diagnostic images, enabling advanced research in multi-modal AI for healthcare. Each entry includes rich metadata for both text and imaging, supporting tasks such as automated diagnosis, cross-modal retrieval, and explainable AI in clinical settings.

  18. h

    Optimum Patient Care Research Database (OPCRD)

    • healthdatagateway.org
    unknown
    Updated Aug 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Optimum Patient Care (OPC) (2024). Optimum Patient Care Research Database (OPCRD) [Dataset]. http://doi.org/10.2147/POR.S395632
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Aug 10, 2024
    Dataset provided by
    Optimum Patient Care Limited
    Authors
    Optimum Patient Care (OPC)
    License

    https://opcrd.co.uk/our-database/data-requests/https://opcrd.co.uk/our-database/data-requests/

    Description

    About OPCRD

    Optimum Patient Care Research Database (OPCRD) is a real-world, longitudinal, research database that provides anonymised data to support scientific, medical, public health and exploratory research. OPCRD is established, funded and maintained by Optimum Patient Care Limited (OPC) – which is a not-for-profit social enterprise that has been providing quality improvement programmes and research support services to general practices across the UK since 2005.

    Key Features of OPCRD

    OPCRD has been purposefully designed to facilitate real-world data collection and address the growing demand for observational and pragmatic medical research, both in the UK and internationally. Data held in OPCRD is representative of routine clinical care and thus enables the study of ‘real-world’ effectiveness and health care utilisation patterns for chronic health conditions.

    OPCRD unique qualities which set it apart from other research data resources: • De-identified electronic medical records of more than 24.9 million patients • OPCRD covers all major UK primary care clinical systems • OPCRD covers approximately 35% of the UK population • One of the biggest primary care research networks in the world, with over 1,175 practices • Linked patient reported outcomes for over 68,000 patients including Covid-19 patient reported data • Linkage to secondary care data sources including Hospital Episode Statistics (HES)

    Data Available in OPCRD

    OPCRD has received data contributions from over 1,175 practices and currently holds de-identified research ready data for over 24.9 million patients or data subjects. This includes longitudinal primary care patient data and any data relevant to the management of patients in primary care, and thus covers all conditions. The data is derived from both electronic health records (EHR) data and patient reported data from patient questionnaires delivered as part of quality improvement. OPCRD currently holds over 68,000 patient reported questionnaire data on Covid-19, asthma, COPD and rare diseases.

    Approvals and Governance

    OPCRD has NHS research ethics committee (REC) approval to provide anonymised data for scientific and medical research since 2010, with its most recent approval in 2020 (NHS HRA REC ref: 20/EM/0148). OPCRD is governed by the Anonymised Data Ethics and Protocols Transparency committee (ADEPT). All research conducted using anonymised data from OPCRD must gain prior approval from ADEPT. Proceeds from OPCRD data access fees and detailed feasibility assessments are re-invested into OPC services for the continued free provision of patient quality improvement programmes for contributing practices and patients.

    For more information on OPCRD please visit: https://opcrd.co.uk/

  19. f

    Summary of quasi-identifiers requiring further de-identification.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alishah Mawji; Holly Longstaff; Jessica Trawin; Dustin Dunsmuir; Clare Komugisha; Stefanie K. Novakowski; Matthew O. Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; J. Mark Ansermino (2023). Summary of quasi-identifiers requiring further de-identification. [Dataset]. http://doi.org/10.1371/journal.pdig.0000027.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 16, 2023
    Dataset provided by
    PLOS Digital Health
    Authors
    Alishah Mawji; Holly Longstaff; Jessica Trawin; Dustin Dunsmuir; Clare Komugisha; Stefanie K. Novakowski; Matthew O. Wiens; Samuel Akech; Abner Tagoola; Niranjan Kissoon; J. Mark Ansermino
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of quasi-identifiers requiring further de-identification.

  20. c

    Data from: A DICOM dataset for evaluation of medical image de-identification...

    • cancerimagingarchive.net
    • dev.cancerimagingarchive.net
    csv, dicom, n/a
    Updated Jan 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2021). A DICOM dataset for evaluation of medical image de-identification [Dataset]. http://doi.org/10.7937/s17z-r072
    Explore at:
    dicom, csv, n/aAvailable download formats
    Dataset updated
    Jan 31, 2021
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Apr 7, 2021
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Open access or shared research data must comply with (HIPAA) patient privacy regulations. These regulations require the de-identification of datasets before they can be placed in the public domain. The process of image de-identification is time consuming, requires significant human resources, and is prone to human error. Automated image de-identification algorithms have been developed but the research community requires some method of evaluation before such tools can be widely accepted. This evaluation requires a robust dataset that can be used as part of an evaluation process for de-identification algorithms.

    We developed a DICOM dataset that can be used to evaluate the performance of de-identification algorithms. DICOM image information objects were selected from datasets published in TCIA. Synthetic Protected Health Information (PHI) was generated and inserted into selected DICOM data elements to mimic typical clinical imaging exams. The evaluation dataset was de-identified by a TCIA curation team using standard TCIA tools and procedures. We are publishing the evaluation dataset (containing synthetic PHI) and de-identified evaluation dataset (result of TCIA curation) in advance of a potential competition, sponsored by the National Cancer Institute (NCI), for de-identification algorithm evaluation, and de-identification of medical image datasets. The evaluation dataset published here is a subset of a larger evaluation dataset that was created under contract for the National Cancer Institute. This subset is being published to allow researchers to test their de-identification algorithms and promote standardized procedures for validating automated de-identification.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Francisco H C Felix (2023). De-identification - anonymization [Dataset]. http://doi.org/10.6084/m9.figshare.3545471.v1

De-identification - anonymization

Explore at:
268 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Authors
Francisco H C Felix
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

De-identification, anonymization, pseudoanonymization, re-identificationNational Institute of Standards and Technology (NIST) documentation declares that the use of these terms is still unclear. Words de-identification, anonymizatio_ and pseudoanonymization are sometimes interchangeable, sometimes carrying subtle different meanings. To mitigate ambiguity, NIST use definitions from ISO/TS 25237:2008:> de-identification: “general term for any process of removing the association between a set of identifying data and the data subject.” [p. 3] anonymization: “process that removes the association between the identifying dataset and the data subject.” [p. 2] pseudonymization: “particular type of anonymization that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.”1 [p. 5]Brazilian portuguese literature largely lacks this terminology, and they are more often used in law or information technology. The utilization of these concepts in health care and research has a specific conceptualization. HIPAA (Health Insurance Portability and Accountability Act), US regulation of health data privacy protection, establishes standards for patient personal information (protected health information - PHI) handling by health care providers (covered entities).

Search
Clear search
Close search
Google apps
Main menu