12 datasets found
  1. De-identification - anonymization

    • figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francisco H C Felix (2023). De-identification - anonymization [Dataset]. http://doi.org/10.6084/m9.figshare.3545471.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    figshare
    Authors
    Francisco H C Felix
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    De-identification, anonymization, pseudoanonymization, re-identificationNational Institute of Standards and Technology (NIST) documentation declares that the use of these terms is still unclear. Words de-identification, anonymizatio_ and pseudoanonymization are sometimes interchangeable, sometimes carrying subtle different meanings. To mitigate ambiguity, NIST use definitions from ISO/TS 25237:2008:> de-identification: “general term for any process of removing the association between a set of identifying data and the data subject.” [p. 3] anonymization: “process that removes the association between the identifying dataset and the data subject.” [p. 2] pseudonymization: “particular type of anonymization that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.”1 [p. 5]Brazilian portuguese literature largely lacks this terminology, and they are more often used in law or information technology. The utilization of these concepts in health care and research has a specific conceptualization. HIPAA (Health Insurance Portability and Accountability Act), US regulation of health data privacy protection, establishes standards for patient personal information (protected health information - PHI) handling by health care providers (covered entities).

  2. Data De-Identification or Pseudonymity Software Market Report | Global...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data De-Identification or Pseudonymity Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-de-identification-or-pseudonymity-software-market
    Explore at:
    pdf, pptx, csvAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data De-Identification or Pseudonymity Software Market Outlook



    As of 2023, the global Data De-Identification or Pseudonymity Software market is valued at approximately USD 1.5 billion and is projected to grow at a robust CAGR of 18% from 2024 to 2032, driven by increasing data privacy concerns and stringent regulatory requirements.



    The growth of the Data De-Identification or Pseudonymity Software market is primarily fueled by the exponential increase in data generation across industries. With the advent of IoT, AI, and digital transformation strategies, the volume of data generated has seen an unprecedented spike. Organizations are now more aware of the need to protect sensitive information to comply with global data privacy regulations such as GDPR in Europe and CCPA in California. The need to ensure that personal data is anonymized or de-identified before analysis or sharing has escalated, pushing the demand for these software solutions.



    Another significant growth factor is the rising number of cyber-attacks and data breaches. As data becomes more valuable, it also becomes a prime target for cybercriminals. In response, companies are investing heavily in data privacy and security measures, including de-identification and pseudonymity solutions, to mitigate risks associated with data breaches. This trend is more prevalent in sectors dealing with highly sensitive information like healthcare, finance, and government. Ensuring that data remains secure and private while being useful for analytics is a key driver for the adoption of these technologies.



    Moreover, the evolution of Big Data analytics and cloud computing is also spurring growth in this market. As organizations move their operations to the cloud and leverage big data for decision-making, the importance of maintaining data privacy while utilizing large datasets for analytics cannot be overstated. Cloud-based de-identification solutions offer scalability, flexibility, and cost-effectiveness, making them increasingly popular among enterprises of all sizes. This shift towards cloud deployments is expected to further boost market growth.



    Regionally, North America holds the largest market share due to its advanced technological infrastructure and stringent data protection laws. The presence of major technology companies and a high rate of adoption of advanced solutions in the U.S. and Canada contribute significantly to regional market growth. Europe follows closely, driven by rigorous GDPR compliance requirements. The Asia Pacific region is anticipated to witness the fastest growth, attributed to the increasing digitization and growing awareness about data privacy in countries like India and China.



    As organizations increasingly seek to protect their sensitive data, the concept of Data Protection on Demand is gaining traction. This model allows businesses to access data protection services as and when needed, providing flexibility and scalability. By leveraging cloud-based platforms, companies can implement robust data protection measures without the need for significant upfront investments in infrastructure. This approach not only ensures compliance with data privacy regulations but also offers a cost-effective solution for managing data security. As the demand for on-demand services continues to rise, Data Protection on Demand is poised to become a critical component of data management strategies across various industries.



    Component Analysis



    The Data De-Identification or Pseudonymity Software market by component is segmented into software and services. The software segment dominates the market, driven by the increasing need for automated solutions that ensure data privacy. These software solutions come with a variety of tools and features designed to anonymize or pseudonymize data efficiently, making them essential for organizations managing large volumes of sensitive information. The software market is expanding rapidly, with new innovations and improvements constantly being introduced to enhance functionality and user experience.



    The services segment, though smaller compared to software, plays a crucial role in the market. Services include consulting, implementation, and maintenance, which are essential for the successful deployment and operation of de-identification software. These services help organizations tailor the software to their specific needs, ensuring compliance with regional and industry-specific data protection regulations.

  3. D

    Data De-identification & Pseudonymity Software Market Report | Global...

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Data De-identification & Pseudonymity Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/global-data-de-identification-pseudonymity-software-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Data De-identification & Pseudonymity Software Market Outlook




    The global Data De-identification & Pseudonymity Software Market is projected to reach USD 3.5 billion by 2032, growing at a CAGR of 15.2% from 2024 to 2032. The rise in data privacy regulations and the increasing need for securing sensitive information are key factors driving this growth.




    The accelerating pace of digital transformation across various industries has led to an unprecedented surge in data generation. This voluminous data often contains sensitive information that needs robust protection. The growing awareness regarding data privacy and stringent regulations like GDPR in Europe, CCPA in California, and other data protection laws worldwide are compelling organizations to adopt advanced data de-identification and pseudonymity software. These solutions ensure that sensitive data is anonymized or pseudonymized, thus mitigating the risk of data breaches and ensuring compliance with regulations. Consequently, the adoption of data de-identification and pseudonymity software is rapidly increasing.




    Another significant growth factor is the increased focus on data security by industries such as healthcare, finance, and government. In healthcare, the protection of patient data is paramount, making the industry a significant consumer of de-identification software. Similarly, in the finance sector, protecting customer information is crucial to maintain trust and comply with regulatory requirements. Government agencies dealing with citizen data are also increasingly investing in these technologies to prevent unauthorized access and misuse of sensitive information. The demand for data de-identification and pseudonymity software is thus witnessing a steady rise across these critical sectors.




    Technological advancements and innovation in data security solutions are further propelling market growth. The integration of artificial intelligence and machine learning into de-identification and pseudonymity software has enhanced their effectiveness and efficiency. These advanced technologies enable more accurate and faster processing of large datasets, thereby offering robust data protection. Additionally, the rise of cloud computing and the increasing adoption of cloud-based solutions provide scalable and cost-effective options for organizations, further driving the market.



    In this context, the role of Identity Information Protection Service becomes increasingly crucial. As organizations strive to safeguard sensitive data, these services provide an essential layer of security by ensuring that identity-related information is protected from unauthorized access and misuse. Identity Information Protection Service helps organizations comply with data privacy regulations by offering robust solutions that secure personal identifiers, thus reducing the risk of identity theft and data breaches. By integrating these services, companies can enhance their data protection strategies, ensuring that identity information remains confidential and secure across various platforms and applications.




    Regionally, North America holds the largest market share, driven by stringent data protection regulations and high adoption rates of advanced technologies. Europe follows, with significant contributions from countries like Germany, the UK, and France, driven by GDPR compliance requirements. The Asia Pacific region is expected to witness the highest growth rate due to the rapid digitalization of economies like China and India, coupled with increasing awareness about data privacy. Latin America and the Middle East & Africa regions are also showing promising growth, albeit from a smaller base.



    Component Analysis




    The Data De-identification & Pseudonymity Software Market by component is segmented into software and services. The software segment includes standalone software solutions designed to de-identify or pseudonymize data. This segment is witnessing substantial growth due to the increasing demand for automated and scalable data protection solutions. The software solutions are enhanced with advanced algorithms and AI capabilities, providing accurate de-identification and pseudonymization of large datasets, which is crucial for organizations dealing with massive amounts of sensitive data.




  4. Medical Imaging De-Identification Software Market Research Report 2033

    • growthmarketreports.com
    csv, pdf, pptx
    Updated Jul 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Growth Market Reports (2025). Medical Imaging De-Identification Software Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/medical-imaging-de-identification-software-market
    Explore at:
    pdf, csv, pptxAvailable download formats
    Dataset updated
    Jul 5, 2025
    Dataset authored and provided by
    Growth Market Reports
    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Medical Imaging De-Identification Software Market Outlook




    According to our latest research, the global medical imaging de-identification software market size reached USD 315 million in 2024, driven by the increasing adoption of digital healthcare solutions and stringent regulatory requirements for patient data privacy. The market is expected to grow at a robust CAGR of 13.2% during the forecast period, reaching approximately USD 858 million by 2033. The primary growth factor fueling this expansion is the rising volume of medical imaging data and the escalating need to ensure compliance with data protection laws such as HIPAA, GDPR, and other regional regulations.




    The growth trajectory of the medical imaging de-identification software market is underpinned by the exponential increase in digital imaging procedures across healthcare facilities worldwide. As advanced imaging modalities like MRI, CT, and PET scans become standard in diagnostic workflows, the volume of data generated has surged. This data often contains sensitive patient information, making it imperative for healthcare organizations to adopt robust de-identification solutions. The proliferation of health information exchanges and the increasing emphasis on interoperability have further heightened the need for secure and compliant data sharing. These factors collectively foster a conducive environment for the adoption of de-identification software, as organizations seek to balance data utility with stringent privacy requirements.




    Another major driver is the evolving regulatory landscape that mandates strict adherence to patient confidentiality and data protection standards. Regulatory frameworks such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, the General Data Protection Regulation (GDPR) in Europe, and similar regulations in Asia Pacific and other regions are compelling healthcare providers and research institutions to implement advanced de-identification solutions. These regulations impose hefty penalties for non-compliance, further incentivizing investments in software that can automate and streamline the de-identification process. Moreover, the growing trend of collaborative research and data sharing among healthcare entities necessitates reliable de-identification tools to facilitate secure and lawful data exchange.




    Technological advancements in artificial intelligence and machine learning are also playing a pivotal role in shaping the medical imaging de-identification software market. Modern solutions leverage AI-driven algorithms to enhance the accuracy and efficiency of de-identification processes, reducing the risk of inadvertent data leaks. These innovations are particularly valuable in large-scale research projects, where massive datasets must be anonymized rapidly and without compromising data integrity. Furthermore, the integration of de-identification software with existing healthcare IT infrastructure, such as PACS and EHR systems, is becoming increasingly seamless, making adoption easier for end-users. This technological evolution is expected to drive further market growth over the next decade.




    From a regional perspective, North America currently dominates the medical imaging de-identification software market, accounting for the largest share in 2024. The region’s leadership is attributed to the presence of advanced healthcare infrastructure, high adoption rates of digital health technologies, and stringent regulatory frameworks. Europe follows closely, propelled by GDPR compliance and increasing investments in healthcare IT. The Asia Pacific region is experiencing the fastest growth, fueled by expanding healthcare access, rapid digitalization, and rising awareness of data privacy. Latin America and the Middle East & Africa are also witnessing gradual adoption, supported by ongoing healthcare modernization initiatives and regulatory developments.





    Component Analysis




    The component segment of the medical imaging de-i

  5. p

    CARMEN-I: A resource of anonymized electronic health records in Spanish and...

    • physionet.org
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eulalia Farre Maduell; Salvador Lima-Lopez; Santiago Andres Frid; Artur Conesa; Elisa Asensio; Antonio Lopez-Rueda; Helena Arino; Elena Calvo; Maria Jesús Bertran; Maria Angeles Marcos; Montserrat Nofre Maiz; Laura Tañá Velasco; Antonia Marti; Ricardo Farreres; Xavier Pastor; Xavier Borrat Frigola; Martin Krallinger (2024). CARMEN-I: A resource of anonymized electronic health records in Spanish and Catalan for training and testing NLP tools [Dataset]. http://doi.org/10.13026/x7ed-9r91
    Explore at:
    Dataset updated
    Apr 20, 2024
    Authors
    Eulalia Farre Maduell; Salvador Lima-Lopez; Santiago Andres Frid; Artur Conesa; Elisa Asensio; Antonio Lopez-Rueda; Helena Arino; Elena Calvo; Maria Jesús Bertran; Maria Angeles Marcos; Montserrat Nofre Maiz; Laura Tañá Velasco; Antonia Marti; Ricardo Farreres; Xavier Pastor; Xavier Borrat Frigola; Martin Krallinger
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    The CARMEN-I corpus comprises 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022. These reports, primarily in Spanish with some Catalan sections, cover COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression. The corpus underwent thorough anonymization, validation, and expert annotation, replacing sensitive data with synthetic equivalents. A subset of the corpus features annotations of medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (including family members). CARMEN-I serves as a valuable resource for training and assessing clinical NLP techniques and language models, aiding tasks like de-identification, concept detection, linguistic modifier extraction, document classification, and more. It also facilitates training researchers in clinical NLP and is a collaborative effort involving Barcelona Supercomputing Center's NLP4BIA team, Hospital Clínic, and Universitat de Barcelona's CLiC group.

  6. f

    DataSheet_1_Segmentation stability of human head and neck cancer medical...

    • frontiersin.figshare.com
    pdf
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jaakko Sahlsten; Kareem A. Wahid; Enrico Glerean; Joel Jaskari; Mohamed A. Naser; Renjie He; Benjamin H. Kann; Antti Mäkitie; Clifton D. Fuller; Kimmo Kaski (2023). DataSheet_1_Segmentation stability of human head and neck cancer medical images for radiotherapy applications under de-identification conditions: Benchmarking data sharing and artificial intelligence use-cases.pdf [Dataset]. http://doi.org/10.3389/fonc.2023.1120392.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    Frontiers
    Authors
    Jaakko Sahlsten; Kareem A. Wahid; Enrico Glerean; Joel Jaskari; Mohamed A. Naser; Renjie He; Benjamin H. Kann; Antti Mäkitie; Clifton D. Fuller; Kimmo Kaski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundDemand for head and neck cancer (HNC) radiotherapy data in algorithmic development has prompted increased image dataset sharing. Medical images must comply with data protection requirements so that re-use is enabled without disclosing patient identifiers. Defacing, i.e., the removal of facial features from images, is often considered a reasonable compromise between data protection and re-usability for neuroimaging data. While defacing tools have been developed by the neuroimaging community, their acceptability for radiotherapy applications have not been explored. Therefore, this study systematically investigated the impact of available defacing algorithms on HNC organs at risk (OARs).MethodsA publicly available dataset of magnetic resonance imaging scans for 55 HNC patients with eight segmented OARs (bilateral submandibular glands, parotid glands, level II neck lymph nodes, level III neck lymph nodes) was utilized. Eight publicly available defacing algorithms were investigated: afni_refacer, DeepDefacer, defacer, fsl_deface, mask_face, mri_deface, pydeface, and quickshear. Using a subset of scans where defacing succeeded (N=29), a 5-fold cross-validation 3D U-net based OAR auto-segmentation model was utilized to perform two main experiments: 1.) comparing original and defaced data for training when evaluated on original data; 2.) using original data for training and comparing the model evaluation on original and defaced data. Models were primarily assessed using the Dice similarity coefficient (DSC).ResultsMost defacing methods were unable to produce any usable images for evaluation, while mask_face, fsl_deface, and pydeface were unable to remove the face for 29%, 18%, and 24% of subjects, respectively. When using the original data for evaluation, the composite OAR DSC was statistically higher (p ≤ 0.05) for the model trained with the original data with a DSC of 0.760 compared to the mask_face, fsl_deface, and pydeface models with DSCs of 0.742, 0.736, and 0.449, respectively. Moreover, the model trained with original data had decreased performance (p ≤ 0.05) when evaluated on the defaced data with DSCs of 0.673, 0.693, and 0.406 for mask_face, fsl_deface, and pydeface, respectively.ConclusionDefacing algorithms may have a significant impact on HNC OAR auto-segmentation model training and testing. This work highlights the need for further development of HNC-specific image anonymization methods.

  7. Anonymized DICOM Dataset from 5T Cardiac T1 Mapping Study

    • zenodo.org
    zip
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    linqi Ge; linqi Ge (2025). Anonymized DICOM Dataset from 5T Cardiac T1 Mapping Study [Dataset]. http://doi.org/10.5281/zenodo.15438025
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    linqi Ge; linqi Ge
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains anonymized DICOM images acquired as part of a cardiac T1 mapping study using a 5T MRI system. All personal identifiers have been removed in compliance with DICOM de-identification standards and institutional ethics approval. The dataset includes pre- and post-contrast MOLLI sequences from healthy volunteers and patients. It is made publicly available for academic and non-commercial research purposes.

  8. D

    Updated PTSS dataset for the FORAS project

    • dataverse.nl
    csv, docx, xlsx
    Updated Feb 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bruno Coimbra; Bruno Coimbra; Rutger Neeleman; Rutger Neeleman; Elizabeth Grandfield; Elizabeth Grandfield; Mirjam van Zuiden; Mirjam van Zuiden; Rens van de Schoot; Rens van de Schoot (2025). Updated PTSS dataset for the FORAS project [Dataset]. http://doi.org/10.34894/CRE6ZC
    Explore at:
    docx(48426), xlsx(9398219), csv(21840732), xlsx(1199186)Available download formats
    Dataset updated
    Feb 5, 2025
    Dataset provided by
    DataverseNL
    Authors
    Bruno Coimbra; Bruno Coimbra; Rutger Neeleman; Rutger Neeleman; Elizabeth Grandfield; Elizabeth Grandfield; Mirjam van Zuiden; Mirjam van Zuiden; Rens van de Schoot; Rens van de Schoot
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    Dutch Research Council
    Description

    This updated labeled dataset builds upon the initial systematic review by van de Schoot et al. (2018; DOI: 10.1080/00273171.2017.1412293), which included studies on post-traumatic stress symptom (PTSS) trajectories up to 2016, sourced from the Open Science Framework (OSF). As part of the FORAS project - Framework for PTSS trajectORies: Analysis and Synthesis (funded by the Dutch Research Council, grant no. 406.22.GO.048 and pre-registered at PROSPERO under ID CRD42023494027), we extended this dataset to include publications between 2016 and 2023. In total, the search identified 10,594 de-duplicated records obtained via different search methods, each published with their own search query and result: Exact replication of the initial search: OSF.IO/QABW3 Comprehensive database search: OSF.IO/D3UV5 Snowballing: OSF.IO/M32TS Full-text search via Dimensions data: OSF.IO/7EXC5 Semantic search via OpenAlex: OSF.IO/M32TS Humans (BC, RN) and AI (Bron et al., 2024) have screened the records, and disagreements have been solved (MvZ, BG, RvdS). Each record was screened separately for Title, Abstract, and Full-text inclusion and per inclusion criteria. A detailed screening logbook is available at OSF.IO/B9GD3, and the entire process is described in https://doi.org/10.31234/osf.io/p4xm5. A description of all columns/variables and full methodological details is available in the accompanying codebook. Important Notes: Duplicates: To maintain consistency and transparency, duplicates are left in the dataset and are labeled with the same classification as the original records. A filter is provided to allow users to exclude these duplicates as needed. Anonymized Data: The dataset "...._anonymous" excludes DOIs, OpenAlex IDs, titles, and abstracts to ensure data anonymization during the review process. The complete dataset, including all identifiers, is uploaded under embargo and will be publicly available on 01-10-2025. This dataset serves not only as a valuable resource for researchers interested in systematic reviews of PTSS trajectories and facilitates reproducibility and transparency in the research process but also for data scientists who would like to mimic the screening process using different machine learning and AI models.

  9. P

    Data from: RadCases Dataset

    • paperswithcode.com
    • huggingface.co
    Updated Sep 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael S. Yao; Allison Chae; Charles E. Kahn Jr.; Walter R. Witschey; James C. Gee; Hersh Sagreiya; Osbert Bastani (2024). RadCases Dataset [Dataset]. https://paperswithcode.com/dataset/radcases
    Explore at:
    Dataset updated
    Sep 26, 2024
    Authors
    Michael S. Yao; Allison Chae; Charles E. Kahn Jr.; Walter R. Witschey; James C. Gee; Hersh Sagreiya; Osbert Bastani
    Description

    RadCases Dataset This HuggingFace (HF) dataset contains the raw case labels for input patient "one-liner" case summaries according to the ACR Appropriateness Criteria. Because many of the sources of data used to construct the RadCases dataset require credentialed access, we cannot publicly release the input patient case summaries. Instead, the "cases" included in this publicly available dataset are the cryptographically secure SHA-512 hashes of the original, "human-readable" cases. In this way, the hashes cannot be used to reconstruct the original RadCases dataset, but can instead be used as a lookup key to determine the ground-truth label for the dataset.

    Setup Prior to using this dataset, you need to download the raw source of patient one-liners first in compliance with each of the source-specific licenses and data usage agreements. The setup process is different for each of the different dataset sources:

    Synthetic: The Synthetic dataset is composed of patient one-liners synthetically generated by OpenAI's ChatGPT. You can find the raw dataset at this GitHub link. No additional setup steps are required for the Synthetic RadCases dataset. USMLE: The USMLE dataset is comprised of practice USMLE Step- 2 and 3 cases from Medbullets that are made available by Chen et al. (2024). The dataset is made publicly available by the cited authors at this GitHub link - we extract the first sentence of each question stem to use as an input patient one-liner in the RadCases dataset. JAMA: The JAMA dataset is comprised of challenging patient one-liners derived from the JAMA Clinical Challenges from the Journal of the American Medical Association (JAMA). Please follow the instructions from @HanjieChen here to first download the dataset. We extract the first sentence of each clinical challenge to use as the input patient one-liner in the RadCases dataset. NEJM: The NEJM dataset is comprised of challenging patient one-liners derived from the NEJM Case Records of the Massachusetts General Hospital from the New England Journal of Medicine (NEJM). We provide a script build_nejm_dataset.py to scrape the case records from the DOIs listed here, which are the same as those used by Savage et al. (2024).. The resulting nejm.jsonl file generated by the script should then be added to the radGPT home directory. BIDMC: The Beth Israel Deaconess Medical Center (BIDMC) dataset is comprised of real anonymized, de-identified patient one-liners derived from the MIMIC-IV Dataset. Please request access to the MIMIC-IV dataset here. The discharge.csv.gz file should then be added to the radGPT/radgpt/data directory.

    Dataset Structure Each row of the dataset is a (SHA-512 hash of a) patient "one-liner" case mapping to an ACR Appropriateness Criteria topic, and also the parent panel of that topic.

    case: the SHA-512 hash of the patient one-liner panel: the ACR Appropriateness Criteria panel label of the patient one-liner topic: the ACR Appropriateness Criteria topic label of the patient one-liner

    Retrieving A Label To retrieve a ground-truth ACR label from this dataset, you can use the following source code:

    import hashlib
    
    prompt = input("Patient One-Liner Case: ")
    hash_gen = hashlib.sha512()
    hash_gen.update(prompt.encode())
    hash_val = str(hash_gen.hexdigest())
    

    The corresponding hash_val variable can then be used to lookup the corresponding panel or topic by matching hash_val with the case value in the RadCases dataset.

    Direct Dataset Usage You can download the contents of this dataset using the following terminal command:

    git clone https://huggingface.co/datasets/michaelsyao/RadCases

  10. c

    University of Missouri Post-operative Glioma Dataset

    • cancerimagingarchive.net
    n/a, nifti, xlsx
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Cancer Imaging Archive (2025). University of Missouri Post-operative Glioma Dataset [Dataset]. http://doi.org/10.7937/7k9k-3c83
    Explore at:
    xlsx, n/a, niftiAvailable download formats
    Dataset updated
    Mar 21, 2025
    Dataset authored and provided by
    The Cancer Imaging Archive
    License

    https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/https://www.cancerimagingarchive.net/data-usage-policies-and-restrictions/

    Time period covered
    Mar 21, 2025
    Dataset funded by
    National Cancer Institutehttp://www.cancer.gov/
    Description

    Abstract

    This dataset includes MR imaging from 203 glioma patients with 617 different post-treatment MR time points, and tumor segmentations. Clinical data includes patient demographics, genomics, and treatment details. Preprocessing of MR images followed a standardized pipeline with automatic tumor segmentation based on nnUNet deep learning approach. The automatic tumor segmentations were manually validated and refined by neuroradiologists.

    The heterogeneity of glioma imaging characteristics and management strategies contributes to a lack of reliable findings when evaluating treatment outcomes with conventional MRI, and the overlapping imaging features of radiation necrosis and tumor progression post-treatment can be particularly challenging for radiologists. This robust dataset should contribute to the development of AI models to improve evaluation of treatment outcomes.

    Introduction

    The dataset consists of institutional review board-approved retrospective analysis of pathologically proven glioma patients at University Hospital of The University of Missouri - Anatomic Pathology CoPathPlus database was used to collect glioma cases over the last 10 years.

    Sharing segmented postoperative glioma data with clinical information significantly accelerates research and improves clinical practice by providing a comprehensive, readily available dataset. This eliminates the time-consuming burden of manual segmentation, enhances the accuracy and consistency of tumor delineation, and allows researchers to focus on analysis and interpretation, ultimately driving the development of more accurate segmentation algorithms, predictive models for personalized treatment strategies, and improved patient outcome predictions. Standardized longitudinal follow-up and benchmarking capabilities further facilitate multi-center studies and objective evaluation of treatment efficacy, leading to advancements in glioma biology and personalized patient care.

    Methods

    The following subsections provide information about how the data were selected, acquired, and prepared for publication.

    Subject Inclusion and Exclusion Criteria

    The selection criteria for the CoPath Natural Language II Search included accession dates ranging from 01/01/2021 to 02/20/2024. To ensure all relevant diagnoses for this study were included; three separate keyword searches were performed using "glioma", "astrocytoma", and "glioblastoma". The search only included keyword results that were present in the Final Diagnoses. "Glioma" returned 85 cases; "Astrocytoma" returned 67 cases; and "Glioblastoma" returned 215 cases. Following the exclusion of duplicate cases, those missing any of the four requisite MR imaging sequences, and cases that failed processing through our pipeline, our final cohort comprised 203 patients.

    Data Acquisition

    Radiology: MRI studies on our McKesson Radiology 12.2 Picture archiving and communication system (PACS) (Change Healthcare Radiology Solutions, Nashville, Tennessee, U.S) were exported. The image exportation process involved multiple personnels of varying ranks, including medical graduates, radiology residents, neuroradiology fellows, and neuroradiologists. Our team exported the four basic conventional MR sequences including T1, T1 with IV gadolinium-based contrast agent administration, T2, and Fluid Attenuated Inversion Recovery (FLAIR) into a HIPPA compliant MU secured research server.

    For each patient, the images were thoroughly checked for including up to six post-treatment images as available. The post-treatment images were captured on different dates, though not all patients had the maximum number of follow-up images; some had as few as one post-treatment follow-up MRI. For patients with more frequent follow-up MRIs, the immediate post-operative scan, at least one time point of progression and another follow-up study. The MR images were comprehensively reviewed to exclude significantly motion degraded or suboptimal studies.

    The majority of the studies were conducted using Siemens MRI machines 97.47%, n=579 with a smaller proportion performed on MRI machines from other vendors: GE (2.02%, n=12) and Philips (0.51%, n=3). Table 1 shows the distribution of studies across different Siemens MR machines. Regarding the magnetic field strength, 1.5T MRIs accounted for 48.14% (n=1,126), 3T MRIs accounted for 45.08% (n=318), and 3T MRIs accounted for 45.08% (n=261). Table 2 summarizes the MRI parameters of each MR sequence.

    Our team made efforts to obtain 3D sequences whenever available. Scans were performed using 3D acquisition methods in 40.28% of cases (n=975) and 2D acquisition methods in 59.82% of cases (n=1,419). In cases where 3D images were not available, 2D images were utilized instead. Table 3 summarizes the counts and percentage of studies performed with 2D vs 3D acquisition across different MR sequences.

    Clinical: Basic demographic data, clinical data points, and tumor pathology were obtained through review of the electronic medical record (EMR). Clinical data points included the date of diagnosis, date of first surgery or treatment, date and characterization of first and/or subsequent disease progression and/or recurrence, and date of any follow-up resections. Survival information included the date of death and, if that was unknown, the date of last known contact while alive. Disease progression and/or recurrence was characterized as imaging only, clinical only, or both based on information obtained through review of each patient’s clinical notes, brain imaging, and clinical impression as documented by the primary care team. Brief summaries of the reasoning behind each characterization were also included. Patients with no further clinical contact beyond their primary treatment were documented as “lost to follow-up.” Pathological information was obtained through review of the initial pathology note and any subsequent addenda for each tumor sample and included final tumor diagnosis, grade, and any identified genetic mutations. This information was then compiled into a spreadsheet for analysis.

    Data Analysis

    The image data underwent preprocessing using the Federated Tumor Segmentation (FeTS) tool. The pipeline began with converting DICOM files to the Neuroimaging Informatics Technology Initiative (NIfTI) format, ensuring the removal of any remaining PHI not eliminated by the anonymization/de-identification tool. The converted NIfTI images were then resampled to an isotropic 1mm³ resolution and co-registered to the standard anatomical human brain atlas, SRI24. A deep learning brain extraction method was applied to strip the skull and extracranial tissues, thereby mitigating any potential facial reconstruction or recognition risks.

    The preprocessed images were segmented using a deep network based on nnU-Net, resulting in four distinct labels that correspond to different components of each tumor:

    • Label 1: Non-enhancing Tumor Core (NETC). This label identifies non-enhancing components within the tumor, such as cystic, necrotic, or hemorrhagic portions.
    • Label 2: Surrounding Non-enhancing FLAIR Hyperintensity (SNFH). This label represents both non-enhancing infiltrative tumor components and peritumoral vasogenic edema.
    • Label 3: Enhancing Tissue (ET). This label highlights the viable nodular-enhancing components of the tumor.
    • Label 4: Resection Cavity (RC). This label covers post-surgical changes, including recent changes like blood products and air foci, as well as chronic changes with materials isointense to CSF signal.

    A spreadsheet is also provided that includes tumor volumes and signal intensity of different tumor components across various MR sequences.

    Usage Notes

    Each scan was manually exported using the built-in McKesson DICOM export tool into separate folders labeled as post-treatment 1, post-treatment 2, etc. In a subsequent step, a subset of the data was selected to contribute for the development of FeTS 2 toolbox. Consequently, the naming convention was updated to replace "post-treatment" with "timepoint" (e.g., post-treatment 1 became timepoint 1) to adhere to the instructions of the FeTS development team. Each sequence was saved in its own folder within these categories to a HIPPA compliant and secured server within the University of Missouri network. Exportation was conducted in DICOM format, maintaining the original image compression settings to preserve quality. To ensure patient privacy and HIPPA compliance, all images were anonymized and all protected health information (PHI) e.g. patient name, MRN, accession number, etc. were deleted from the metadata DICOM headers.

    The folders are labeled in the following structure:

    • Main folder: PatientID_XXXX
    • Subfolders: Timepoint_X, Timepoint_X
    • Each time point folder has the NIfTI images associated with the respective timepoints.

  11. f

    Patient exit interview dataset.

    • plos.figshare.com
    bin
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Everlyn Waweru; Tom Smekens; Joanna Orne-Gliemann; Freddie Ssengooba; Jacqueline Broerse; Bart Criel (2023). Patient exit interview dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0236524.s004
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Everlyn Waweru; Tom Smekens; Joanna Orne-Gliemann; Freddie Ssengooba; Jacqueline Broerse; Bart Criel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A STATA dataset with anonymized and de-identified patient data and responses. (DTA)

  12. d

    In-vitro efficacy of fluoroquinolones and carbapenems against...

    • datadryad.org
    zip
    Updated May 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aakash Parajuli; Shila Shrestha; Shiba Kumar Rai (2025). In-vitro efficacy of fluoroquinolones and carbapenems against biofilm-forming and non-forming non-fermenting gram-negative bacteria isolated from clinical specimens [Dataset]. http://doi.org/10.5061/dryad.ghx3ffc1f
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 6, 2025
    Dataset provided by
    Dryad
    Authors
    Aakash Parajuli; Shila Shrestha; Shiba Kumar Rai
    Description

    In-vitro efficacy of fluoroquinolones and carbapenems against biofilm-forming and non-forming non-fermenting gram-negative bacteria isolated from clinical specimens

    Dataset DOI: 10.5061/dryad.ghx3ffc1f

    Description of the data and file structure

    Comparative In-Vitro Efficacy of Fluoroquinolones and Carbapenems among Biofilm-Forming and Non-Forming Non-Fermenters Isolated from Clinical Specimens

    The dataset is of hospital-visiting individuals with infection due to non-fermenter bacteria, i.e., Acinetobacter calcoaceticus-baumanii complex and Pseudomonas aeruginosa.

    The dataset comprises of single sheet. The sheet details for demographic information, such as age group and gender of the infected patients; clinical information, including clinical samples; microbiological findings comprising bacterial genera, antimicrobial resistance patterns, biofilm formers or non-formers, inhibitory concentrations of fluoroquinolones (norfloxacin, ciprofloxacin, ofl...

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Francisco H C Felix (2023). De-identification - anonymization [Dataset]. http://doi.org/10.6084/m9.figshare.3545471.v1
Organization logo

De-identification - anonymization

Explore at:
270 scholarly articles cite this dataset (View in Google Scholar)
txtAvailable download formats
Dataset updated
Jun 2, 2023
Dataset provided by
figshare
Authors
Francisco H C Felix
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

De-identification, anonymization, pseudoanonymization, re-identificationNational Institute of Standards and Technology (NIST) documentation declares that the use of these terms is still unclear. Words de-identification, anonymizatio_ and pseudoanonymization are sometimes interchangeable, sometimes carrying subtle different meanings. To mitigate ambiguity, NIST use definitions from ISO/TS 25237:2008:> de-identification: “general term for any process of removing the association between a set of identifying data and the data subject.” [p. 3] anonymization: “process that removes the association between the identifying dataset and the data subject.” [p. 2] pseudonymization: “particular type of anonymization that both removes the association with a data subject and adds an association between a particular set of characteristics relating to the data subject and one or more pseudonyms.”1 [p. 5]Brazilian portuguese literature largely lacks this terminology, and they are more often used in law or information technology. The utilization of these concepts in health care and research has a specific conceptualization. HIPAA (Health Insurance Portability and Accountability Act), US regulation of health data privacy protection, establishes standards for patient personal information (protected health information - PHI) handling by health care providers (covered entities).

Search
Clear search
Close search
Google apps
Main menu