100+ datasets found

G
Clinical Trial Adverse Events and Safety Data
gomask.ai
csv, json
Updated Nov 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GoMask.ai (2025). Clinical Trial Adverse Events and Safety Data [Dataset]. https://gomask.ai/marketplace/datasets/clinical-trial-adverse-events-and-safety-data
Explore at:
csv(10 MB), jsonAvailable download formats
Dataset updated
Nov 2, 2025
Dataset provided by
GoMask.ai
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
2024 - 2025
Area covered
Global
Variables measured
outcome, subject_id, meddra_code, meddra_term, report_date, action_taken, reporter_role, event_end_date, severity_grade, adverse_event_id, and 6 more
Description
This dataset provides comprehensive, standardized reporting of adverse events and safety data from clinical trials, including event details, severity, regulatory coding, and pharmacovigilance notes. It enables robust safety monitoring, regulatory submissions, and data-driven risk assessments for investigational drugs.
Data from: Clinical Dataset
kaggle.com
zip
Updated Oct 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mohamadreza Momeni (2023). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/imtkaggleteam/clinical-dataset
Explore at:
zip(16220 bytes)Available download formats
Dataset updated
Oct 5, 2023
Authors
Mohamadreza Momeni
Description
The purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.

Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.

About Dataset:

333 scholarly articles cite this dataset.

Unique identifier: DOI

Dataset updated: 2023

Authors: Haoyang Mi

In this dataset, we have two dataset:

1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time

2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS

Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.
f
Characteristics of randomised controlled clinical trials included in the...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pei-Yuan Zuo; Xing-Lin Chen; Yu-Wei Liu; Chang-Liang Xiao; Cheng-Yun Liu (2023). Characteristics of randomised controlled clinical trials included in the meta-analysis. [Dataset]. http://doi.org/10.1371/journal.pone.0102484.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0102484.t001
Dataset updated
Jun 20, 2023
Dataset provided by
PLOS ONE
Authors
Pei-Yuan Zuo; Xing-Lin Chen; Yu-Wei Liu; Chang-Liang Xiao; Cheng-Yun Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abbreviations and notes: NA, data not available; CNS, central nervous system.Founding sources: three trials were sponsored by Genentech [12], [14], [23], eleven trials were supported by National Cancer Institute and National Institute of Health [9], [15]–[22], [24], [25], One trial was supported by Roche Australia [10], One trial was supported by Cancer Research UK [11], One trial was supported by F. Hoffmann–La Roche [8].aThe dose schedule was converted from mg/kg per schedule.
J
Data associated with: Study to Understand Fall Reduction and Vitamin D in...
archive.data.jhu.edu
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lawrence J. Appel; Erin D. Michos; Edgar R. Miller III (2025). Data associated with: Study to Understand Fall Reduction and Vitamin D in You (STURDY) randomized clinical trial [Dataset]. http://doi.org/10.7281/T1/PXEROL
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7281/T1/PXEROL
Dataset updated
May 28, 2025
Dataset provided by
Johns Hopkins Research Data Repository
Authors
Lawrence J. Appel; Erin D. Michos; Edgar R. Miller III
License
https://archive.data.jhu.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7281/T1/PXEROLhttps://archive.data.jhu.edu/api/datasets/:persistentId/versions/1.1/customlicense?persistentId=doi:10.7281/T1/PXEROL
Dataset funded by
Johns Hopkins Institute for Clinical and Translation Research
National Institutes of Health
Mid-Atlantic Nutrition Obesity Research Center
Description
This is the limited access database for the Study to Understand Fall Reduction and Vitamin D in You (STURDY) randomized response-adaptive clinical trial. The database includes baseline, treatment and post randomization data. This Database includes a set of files pertaining to the full study population (688 randomized participants plus screenees who were not randomized) and a set of files pertaining to the burn-in cohort (the 406 participants randomized prior to the first adjustment of the randomization probabilities). The Database also includes files that support the analyses included in the primary outcome paper published by the Annals of Internal Medicine (2021;174:(2):145-156). Each data file in the Database corresponds to a specific data collection form or type of data. This documentation notebook includes a SAS PROC CONTENTS listing for each SAS file and a copy of the relevant form if applicable. Each variable on each SAS data file has an associated SAS label. Several STURDY documents, including the final versions of the screening and trial consent statements, the Protocol, and the Manual of Procedures, are included with this documentation notebook to assist with understanding and navigation of STURDY data. Notes on analysis questions and issues are also included, as is a list of STURDY publications.
mimic-iii-clinical-database-demo-1.4
kaggle.com
zip
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Montassar bellah (2025). mimic-iii-clinical-database-demo-1.4 [Dataset]. https://www.kaggle.com/datasets/montassarba/mimic-iii-clinical-database-demo-1-4
Explore at:
zip(11100065 bytes)Available download formats
Dataset updated
Apr 1, 2025
Authors
Montassar bellah
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Abstract MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [1]. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, we have manually curated a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. Notably, the demo dataset does not include free-text notes.

Background In recent years there has been a concerted move towards the adoption of digital health record systems in hospitals. Despite this advance, interoperability of digital systems remains an open issue, leading to challenges in data integration. As a result, the potential that hospital data offers in terms of understanding and improving care is yet to be fully realized.

MIMIC-III integrates deidentified, comprehensive clinical data of patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts, and makes it widely accessible to researchers internationally under a data use agreement. The open nature of the data allows clinical studies to be reproduced and improved in ways that would not otherwise be possible.

The MIMIC-III database was populated with data that had been acquired during routine hospital care, so there was no associated burden on caregivers and no interference with their workflow. For more information on the collection of the data, see the MIMIC-III Clinical Database page.

Methods The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.

This project was approved by the Institutional Review Boards of Beth Israel Deaconess Medical Center (Boston, MA) and the Massachusetts Institute of Technology (Cambridge, MA). Requirement for individual patient consent was waived because the project did not impact clinical care and all protected health information was deidentified.

Data Description MIMIC-III is a relational database consisting of 26 tables. For a detailed description of the database structure, see the MIMIC-III Clinical Database page. The demo shares an identical schema, except all rows in the NOTEEVENTS table have been removed.

The data files are distributed in comma separated value (CSV) format following the RFC 4180 standard. Notably, string fields which contain commas, newlines, and/or double quotes are encapsulated by double quotes ("). Actual double quotes in the data are escaped using an additional double quote. For example, the string she said "the patient was notified at 6pm" would be stored in the CSV as "she said ""the patient was notified at 6pm""". More detail is provided on the RFC 4180 description page: https://tools.ietf.org/html/rfc4180

Usage Notes The MIMIC-III demo provides researchers with an opportunity to review the structure and content of MIMIC-III before deciding whether or not to carry out an analysis on the full dataset.

CSV files can be opened natively using any text editor or spreadsheet program. However, some tables are large, and it may be preferable to navigate the data stored in a relational database. One alternative is to create an SQLite database using the CSV files. SQLite is a lightweight database format which stores all constituent tables in a single file, and SQLite databases interoperate well with a number software tools.

DB Browser for SQLite is a high quality, visual, open source tool to create, design, and edit database files compatible with SQLite. We have found this tool to be useful for navigating SQLite files. Information regarding installation of the software and creation of the database can be found online: https://sqlitebrowser.org/

Release Notes Release notes for the demo follow the release notes for the MIMIC-III database.

Acknowledgements This research and development was supported by grants NIH-R01-EB017205, NIH-R01-EB001659, and NIH-R01-GM104987 from the National Institutes of Health. The authors would also like to thank Philips Healthcare and staff at the Beth Israel Deaconess Medical Center, Boston, for supporting database development, and Ken Pierce for providing ongoing support for the MIMIC research community.

Conflicts of Interest The authors declare no competing financial interests.

References Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Mo...
p
Data from: MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital...
physionet.org
Updated Feb 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari (2025). MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization [Dataset]. http://doi.org/10.13026/5gte-bv70
Explore at:
Unique identifier
https://doi.org/10.13026/5gte-bv70
Dataset updated
Feb 3, 2025
Authors
Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
This dataset presents a curated collection of preprocessed and labeled clinical notes derived from the MIMIC-IV-Note database. The primary aim of this resource is to facilitate the development and training of machine learning models focused on summarizing brief hospital courses (BHC) from clinical discharge notes.

The dataset contains 270,033 meticulously cleaned and standardized clinical notes containing an average token length of 2,267, ensuring usability for machine learning (ML) applications. Each clinical note is paired with a corresponding BHC summary, providing a robust foundation for supervised learning tasks. The preprocessing pipeline employed uses regular expressions to address common issues in the raw clinical text, such as special characters, extraneous whitespace, inconsistent formatting, and irrelevant text, to produce a high-quality, structured dataset with separated clinical note sections through appropriate headings.

By offering this resource, we aim to support healthcare professionals and researchers in their efforts to enhance patient care through the automation of BHC summarization. This dataset is ideal for exploring various NLP techniques, developing predictive models, and improving the efficiency and accuracy of clinical documentation practices. We invite the research community to utilize this dataset to advance the field of medical informatics and contribute to better health outcomes.
G
Automated Clinical Note Redaction Services Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Oct 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Automated Clinical Note Redaction Services Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/automated-clinical-note-redaction-services-market
Explore at:
csv, pptx, pdfAvailable download formats
Dataset updated
Oct 3, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Automated Clinical Note Redaction Services Market Outlook

According to our latest research, the global Automated Clinical Note Redaction Services market size in 2024 stands at USD 1.26 billion, demonstrating robust adoption across healthcare and research sectors. The market is set to expand at a CAGR of 19.2% from 2025 to 2033, reaching a projected value of USD 5.86 billion by 2033. This remarkable growth is driven by increasing regulatory demands for patient data privacy, the proliferation of electronic health records, and the rising need for secure data sharing in healthcare environments.

One of the primary growth factors for the Automated Clinical Note Redaction Services market is the intensifying focus on data privacy and security within the healthcare industry. With the global shift towards digital health records and telemedicine, healthcare organizations are handling unprecedented volumes of sensitive patient data. Stringent regulations such as HIPAA in the United States, GDPR in Europe, and similar frameworks across other regions have made it imperative for healthcare providers to ensure that patient-identifiable information is thoroughly protected. Automated redaction solutions offer a scalable and efficient way to de-identify clinical notes, minimizing the risk of data breaches and ensuring compliance with evolving privacy laws. This is particularly crucial as cyber threats targeting healthcare data continue to rise, prompting organizations to invest in advanced redaction technologies to safeguard their information assets.

Another significant driver propelling market growth is the rapid adoption of artificial intelligence (AI) and machine learning (ML) technologies in healthcare workflows. Automated Clinical Note Redaction Services leverage AI-powered natural language processing (NLP) algorithms to accurately and swiftly identify and redact sensitive data from unstructured clinical notes, pathology reports, and physician documentation. This not only enhances operational efficiency but also reduces manual workload and the potential for human error. As healthcare providers increasingly seek to streamline administrative processes and focus more on patient care, the demand for intelligent automation solutions that can handle large-scale data redaction is expected to surge. Furthermore, the integration of these services with electronic health record (EHR) systems and cloud platforms is making deployment more accessible and scalable for organizations of all sizes.

The expanding scope of data-driven research and analytics in healthcare is also contributing to the market's upward trajectory. Research institutions and health information exchanges are leveraging Automated Clinical Note Redaction Services to facilitate secure data sharing for population health studies, clinical trials, and AI model training, all while maintaining patient anonymity. The ability to extract valuable insights from vast repositories of clinical data without compromising privacy is a key enabler for medical innovation and evidence-based decision-making. As precision medicine and personalized healthcare initiatives gain momentum, the need for compliant, efficient, and automated redaction solutions will become even more pronounced, further fueling market expansion over the coming years.

From a regional perspective, North America dominates the Automated Clinical Note Redaction Services market, accounting for the largest revenue share in 2024, followed closely by Europe and the Asia Pacific. The United States leads the adoption curve due to its advanced healthcare infrastructure, strict regulatory environment, and early integration of digital health technologies. Meanwhile, Europe benefits from robust data protection laws and increasing investments in healthcare IT, while the Asia Pacific region is experiencing rapid growth driven by expanding healthcare access, digitalization initiatives, and rising awareness of data security. Latin America and the Middle East & Africa are also showing promising growth trajectories, albeit from a smaller base, as governments and private players ramp up investments in healthcare modernization and data governance.

"https://growthmarketreports.com/request-sample/166676">
<button class="btn btn-lg text-center" id="free_s
I
Inclusion_Criteria_Annotation
databank.illinois.edu
Updated Dec 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaoru Dong; Jingyi Xie; Hoang Linh (2018). Inclusion_Criteria_Annotation [Dataset]. http://doi.org/10.13012/B2IDB-5958960_V1
Explore at:
Unique identifier
https://doi.org/10.13012/B2IDB-5958960_V1
Dataset updated
Dec 14, 2018
Authors
Xiaoru Dong; Jingyi Xie; Hoang Linh
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
U.S. National Institutes of Health (NIH)
Description
File Name: Inclusion_Criteria_Annotation.csv Data Preparation: Xiaoru Dong Date of Preparation: 2018-12-14 Data Contributions: Jingyi Xie, Xiaoru Dong, Linh Hoang Data Source: Cochrane systematic reviews published up to January 3, 2018 by 52 different Cochrane groups in 8 Cochrane group networks. Associated Manuscript authors: Xiaoru Dong, Jingyi Xie, Linh Hoang, and Jodi Schneider. Associated Manuscript, Working title: Machine classification of inclusion criteria from Cochrane systematic reviews. Description: The file contains lists of inclusion criteria of Cochrane Systematic Reviews and the manual annotation results. 5420 inclusion criteria were annotated, out of 7158 inclusion criteria available. Annotations are either "Only RCTs" or "Others". There are 2 columns in the file: - "Inclusion Criteria": Content of inclusion criteria of Cochrane Systematic Reviews. - "Only RCTs": Manual Annotation results. In which, "x" means the inclusion criteria is classified as "Only RCTs". Blank means that the inclusion criteria is classified as "Others". Notes: 1. "RCT" stands for Randomized Controlled Trial, which, in definition, is "a work that reports on a clinical trial that involves at least one test treatment and one control treatment, concurrent enrollment and follow-up of the test- and control-treated groups, and in which the treatments to be administered are selected by a random process, such as the use of a random-numbers table." [Randomized Controlled Trial publication type definition from https://www.nlm.nih.gov/mesh/pubtypes.html]. 2. In order to reproduce the relevant data to this, please get the code of the project published on GitHub at: https://github.com/XiaoruDong/InclusionCriteria and run the code following the instruction provided.
p
MIMIC-IV-Note: Deidentified free-text clinical notes
physionet.org
Updated Jan 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair Johnson; Tom Pollard; Steven Horng; Leo Anthony Celi; Roger Mark (2023). MIMIC-IV-Note: Deidentified free-text clinical notes [Dataset]. http://doi.org/10.13026/0p14-t007
Explore at:
Unique identifier
https://doi.org/10.13026/0p14-t007
Dataset updated
Jan 5, 2023
Authors
Alistair Johnson; Tom Pollard; Steven Horng; Leo Anthony Celi; Roger Mark
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
The advent of large, open access text databases has driven advances in state-of-the-art model performance in natural language processing (NLP). The relatively limited amount of clinical data available for NLP has been cited as a significant barrier to the field's progress. Here we describe MIMIC-IV-Note: a collection of deidentified free-text clinical notes for patients included in the MIMIC-IV clinical database. MIMIC-IV-Note contains 357,289 deidentified discharge summaries from 161,403 patients admitted to the hospital and emergency department at the Beth Israel Deaconess Medical Center in Boston, MA, USA. The database also contains 2,471,881 deidentified radiology reports for 256,400 patients. All notes have had protected health information removed in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. All notes are linkable to MIMIC-IV providing important context to the clinical data therein. The database is intended to stimulate research in clinical natural language processing and associated areas.
Data from: Clinical Dataset
kaggle.com
Updated Aug 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laksika Tharmalingam (2025). Clinical Dataset [Dataset]. https://www.kaggle.com/datasets/uom190346a/synthetic-clinical-tabular-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Laksika Tharmalingam
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🏥 Clinical Tabular Dataset (Non-PII, Realistic, Research-Ready)

Subtitle

Dataset mimicking real-world patient records for AI research.

Overview

This dataset is a synthetically generated clinical tabular dataset designed to closely mimic real-world patient health records while ensuring zero personally identifiable information (PII). It was created using statistical distributions, clinical guidelines, and publicly available medical references to replicate patterns typically observed in hospital and outpatient settings.

Unlike real EHR datasets, this synthetic dataset is free from privacy restrictions, making it safe to use for AI/ML model training, benchmarking, academic research, and prototyping healthcare applications.

Dataset Details

Number of Records: 10000 patients (scalable to millions)

Data Type: Structured, tabular

Format: CSV (Comma-separated values)

Domain: Healthcare / Clinical informatics

PII Free: ✅ (No names, IDs, or sensitive personal details)

🔍 Columns & Clinical Context Age, Sex, BMI — basic demographics Vitals: Systolic/Diastolic BP, Glucose, Cholesterol, Creatinine Comorbidities: Diabetes, Hypertension Diagnosis: Normal, Pneumonia, Heart Failure, Sepsis Outcomes: 30-day Readmission, Mortality

Applications

This dataset can be used for:

Machine Learning: Classification, clustering, regression models

Healthcare AI: Predictive modeling for risk factors and disease detection

Data Science Education: Hands-on exercises for students

Synthetic Data Research: Benchmarking synthetic data generation approaches

Fairness & Bias Testing: Evaluating ML models across age, gender, and lifestyle groups

Why This Dataset?

Realistic: Matches clinical ranges and distributions found in actual healthcare data

Safe to Share: 100% synthetic, no HIPAA/GDPR concerns

Flexible: Can be scaled, modified, or extended with more medical variables

High Impact: Fills a major gap in openly available clinical tabular datasets

Disclaimer

This dataset is synthetic and for research/educational purposes only. It should not be used for medical decision-making or clinical care.

Citation

If you use this dataset, please cite as:

Synthetic Clinical Tabular Dataset (2025). Generated for ML research and benchmarking.
f
Characteristics of 722 notes which are manually evaluated, and their...
plos.figshare.com
figshare.com
xls
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Zhang; Neil Jethani; Simon Jones; Nicholas Genes; Vincent J. Major; Ian S. Jaffe; Anthony B. Cardillo; Noah Heilenbach; Nadia Fazal Ali; Luke J. Bonanni; Andrew J. Clayburn; Zain Khera; Erica C. Sadler; Jaideep Prasad; Jamie Schlacter; Kevin Liu; Benjamin Silva; Sophie Montgomery; Eric J. Kim; Jacob Lester; Theodore M. Hill; Alba Avoricani; Ethan Chervonski; James Davydov; William Small; Eesha Chakravartty; Himanshu Grover; John A. Dodson; Abraham A. Brody; Yindalon Aphinyanaphongs; Arjun Masurkar; Narges Razavian (2024). Characteristics of 722 notes which are manually evaluated, and their corresponding patients. [Dataset]. http://doi.org/10.1371/journal.pdig.0000685.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000685.t001
Dataset updated
Dec 11, 2024
Dataset provided by
PLOS Digital Health
Authors
Hao Zhang; Neil Jethani; Simon Jones; Nicholas Genes; Vincent J. Major; Ian S. Jaffe; Anthony B. Cardillo; Noah Heilenbach; Nadia Fazal Ali; Luke J. Bonanni; Andrew J. Clayburn; Zain Khera; Erica C. Sadler; Jaideep Prasad; Jamie Schlacter; Kevin Liu; Benjamin Silva; Sophie Montgomery; Eric J. Kim; Jacob Lester; Theodore M. Hill; Alba Avoricani; Ethan Chervonski; James Davydov; William Small; Eesha Chakravartty; Himanshu Grover; John A. Dodson; Abraham A. Brody; Yindalon Aphinyanaphongs; Arjun Masurkar; Narges Razavian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Characteristics of 722 notes which are manually evaluated, and their corresponding patients.
FactEHR Stanford
redivis.com
application/jsonl +7
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shah Lab (2025). FactEHR Stanford [Dataset]. http://doi.org/10.57761/7hh9-gd47
Explore at:
stata, parquet, spss, arrow, sas, application/jsonl, csv, avroAvailable download formats
Unique identifier
https://doi.org/10.57761/7hh9-gd47
Dataset updated
Jan 30, 2025
Dataset provided by
Redivis Inc.
Authors
Shah Lab
Description
Abstract

FactEHR is a dataset for verifying facts in clinical text, containing fact decompositions of 2,168 clinical notes from three hospital systems generated by four language models: GPT4o, o1-mini, Gemini-1.5-Pro, and Llama3-8B. It includes 3,504 textual entailment pairs labeled by 7 clinicians, supporting advanced research in clinical fact verification.

Methodology

1. Overview

FactEHR is a fact decomposition dataset for clinical notes. FactEHR is sampled from three source datasets: MIMIC-III, UCSF's CORAL, and Stanford's MedAlign. For governance reasons, MIMIC and UCSF data must be downloaded from PhysioNet (FactEHR PhysioNet) and MedAlign from Redivis (FactEHR Stanford).

2. FactEHR Stanford

all_human_model_entailment_labels.csv – This file contains entailment pairs sampled from entailment_pairs.csv, labeled by clinical experts to serve as ground truth for entailment tasks.

combined_notes.csv – This file contains all clinical notes from which FactEHR is derived, including their full text, estimated token count, note type, and dataset source.

entailment_pairs.csv – This file consists of entailment pairs derived from fact decompositions and their source clinical notes, with entailment predictions generated by models.

fact_decompositions.csv – This file stores fact decompositions generated by LLMs, linking decomposed facts to clinical notes using unique identifiers.

factehr_dev_set.csv – This file contains entailment pairs sampled from the same data sources as FactEHR (but different notes), intended for entailment model development.

precision_hypotheses.csv – This file contains hypotheses for precision entailment pairs, where the premise is a source clinical note and each hypothesis is a fact from its corresponding decomposition.

recall_hypotheses.csv – This file contains hypotheses for recall entailment pairs, where the premise is a fact decomposition and each hypothesis is a sentence from its original clinical note.

%3C!-- --%3E

Usage

Access to the** FactEHR Stanford** requires following:

Verified Affiliation (Academic, Government, Industry Research Lab). Please use your verified email address when applying, **do not use gmail or personal emails. **Applications using personal, unverified email addresses will be rejected.

Encryption Verification / Attestation for Data Storage

Signing the terms of the MedAlign Data Set License 1.0

Providing a short description of your intended research use of MedAlign

CITI Training

%3C!-- --%3E

**These data must remain on your encrypted machine. Redistribution of data is FORBIDDEN and will result in immediate termination of access privileges. **

IMPORTANT NOTE: Our policy on derived works aligns with PhysioNet's guidelines, requiring that these artifacts be hosted on Redivis. If you create derived research artifacts based on the dataset (such as additional annotations or synthetic data), please contact us to discuss hosting arrangements.

Please allow 7-10 business days to process applications.
f
Fleiss’ kappa inter-rater-agreement metric between reviewers (2-way) and...
plos.figshare.com
xls
Updated Dec 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hao Zhang; Neil Jethani; Simon Jones; Nicholas Genes; Vincent J. Major; Ian S. Jaffe; Anthony B. Cardillo; Noah Heilenbach; Nadia Fazal Ali; Luke J. Bonanni; Andrew J. Clayburn; Zain Khera; Erica C. Sadler; Jaideep Prasad; Jamie Schlacter; Kevin Liu; Benjamin Silva; Sophie Montgomery; Eric J. Kim; Jacob Lester; Theodore M. Hill; Alba Avoricani; Ethan Chervonski; James Davydov; William Small; Eesha Chakravartty; Himanshu Grover; John A. Dodson; Abraham A. Brody; Yindalon Aphinyanaphongs; Arjun Masurkar; Narges Razavian (2024). Fleiss’ kappa inter-rater-agreement metric between reviewers (2-way) and reviewers and ChatGPT (3-way) over the double-reviewed notes. [Dataset]. http://doi.org/10.1371/journal.pdig.0000685.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000685.t003
Dataset updated
Dec 11, 2024
Dataset provided by
PLOS Digital Health
Authors
Hao Zhang; Neil Jethani; Simon Jones; Nicholas Genes; Vincent J. Major; Ian S. Jaffe; Anthony B. Cardillo; Noah Heilenbach; Nadia Fazal Ali; Luke J. Bonanni; Andrew J. Clayburn; Zain Khera; Erica C. Sadler; Jaideep Prasad; Jamie Schlacter; Kevin Liu; Benjamin Silva; Sophie Montgomery; Eric J. Kim; Jacob Lester; Theodore M. Hill; Alba Avoricani; Ethan Chervonski; James Davydov; William Small; Eesha Chakravartty; Himanshu Grover; John A. Dodson; Abraham A. Brody; Yindalon Aphinyanaphongs; Arjun Masurkar; Narges Razavian
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fleiss’ kappa inter-rater-agreement metric between reviewers (2-way) and reviewers and ChatGPT (3-way) over the double-reviewed notes.
D
Automated Clinical Note Redaction Services Market Research Report 2033
dataintelo.com
csv, pdf, pptx
Updated Oct 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dataintelo (2025). Automated Clinical Note Redaction Services Market Research Report 2033 [Dataset]. https://dataintelo.com/report/automated-clinical-note-redaction-services-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Oct 1, 2025
Dataset authored and provided by
Dataintelo
License
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
Time period covered
2024 - 2032
Area covered
Global
Description
Automated Clinical Note Redaction Services Market Outlook

According to our latest research, the global Automated Clinical Note Redaction Services market size was valued at USD 1.18 billion in 2024, and is projected to reach USD 4.67 billion by 2033 at a robust CAGR of 16.4% during the forecast period. The market growth is primarily driven by the escalating demand for advanced healthcare data privacy solutions and the increasing adoption of electronic health records (EHRs) across healthcare organizations worldwide. As per our latest findings, the growing regulatory scrutiny and the need for efficient, scalable, and accurate redaction services are further fueling market expansion.

One of the major growth drivers for the Automated Clinical Note Redaction Services market is the intensifying focus on data privacy compliance within the healthcare sector. With the proliferation of digital health data, healthcare providers are under immense pressure to ensure that sensitive patient information is not inadvertently exposed or misused. Stringent regulations such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States, the General Data Protection Regulation (GDPR) in Europe, and similar legislative frameworks in other regions have made it imperative for healthcare organizations to adopt automated solutions that can efficiently redact personally identifiable information (PII) and protected health information (PHI) from clinical notes. This regulatory landscape is not only driving the adoption of automated redaction services but is also prompting vendors to innovate and enhance their offerings to stay compliant with evolving standards.

Another significant factor propelling the market is the rapid digitization of healthcare records and the increasing reliance on electronic health records (EHRs) for clinical, administrative, and research purposes. The surge in digital documentation has led to a massive influx of unstructured data, making manual redaction both impractical and error-prone. Automated clinical note redaction services leverage artificial intelligence (AI) and natural language processing (NLP) technologies to accurately identify and remove sensitive information at scale, thereby streamlining workflows and reducing operational costs. As healthcare organizations continue to modernize their IT infrastructure, the demand for such sophisticated, automated solutions is expected to soar, further accelerating market growth.

Furthermore, the growing emphasis on clinical research and data sharing is amplifying the need for secure and compliant data management solutions. Research institutions, pharmaceutical companies, and healthcare payers increasingly require access to vast troves of clinical data for analytics, drug development, and population health studies. Automated redaction services enable these stakeholders to access valuable information without compromising patient privacy, facilitating collaboration while maintaining regulatory compliance. The ability of these solutions to support large-scale, secure data sharing is becoming a critical differentiator in the market, attracting significant investments and driving innovation.

From a regional perspective, North America currently dominates the Automated Clinical Note Redaction Services market, accounting for over 42% of the global revenue in 2024. This leadership position is attributed to the region's advanced healthcare infrastructure, high adoption of EHRs, and strict regulatory requirements. Europe follows closely, driven by robust data protection laws and increasing digital transformation initiatives in healthcare. The Asia Pacific region is anticipated to witness the fastest growth, fueled by expanding healthcare IT investments, rising awareness about data privacy, and government-led digital health programs. Latin America and the Middle East & Africa are also experiencing steady growth, albeit from a smaller base, as healthcare providers in these regions gradually embrace digitalization and compliance-driven solutions.

Component Analysis

The Automated Clinical Note Redaction Services market is segmented by component into Software and Services. The software segment holds the largest share, primarily due to the widespread adoption of AI-powered redaction platforms that can be seamlessly integrated into existing healthcare IT sys
nbme-score-clinical-patient-notes-correction
kaggle.com
zip
Updated Apr 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
hengck23 (2022). nbme-score-clinical-patient-notes-correction [Dataset]. https://www.kaggle.com/datasets/hengck23/nbme-score-clinical-patient-notes-correction
Explore at:
zip(61680 bytes)Available download formats
Dataset updated
Apr 26, 2022
Authors
hengck23
Description
manual correction of suspected missing annotations of data in https://www.kaggle.com/competitions/nbme-score-clinical-patient-notes

https://i.ibb.co/BrdFZSr/Selection-141.png" alt="https://i.ibb.co/BrdFZSr/Selection-141.png">

how the dataset is constructed: - we trained a model using original annotated train data for k-folds. - we apply the trained model on the validation data select the "false positive errors". - these "false positive errors" are verified by human inspection to determine if they are really error or missing truth annotations. about 50% of predicted "false positive errors" are actually missing annotations
CK4Gen, High Utility Synthetic Survival Datasets
figshare.com
zip
Updated Nov 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Kuo (2024). CK4Gen, High Utility Synthetic Survival Datasets [Dataset]. http://doi.org/10.6084/m9.figshare.27611388.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27611388.v1
Dataset updated
Nov 5, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Nicholas Kuo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
===###Overview:This repository provides high-utility synthetic survival datasets generated using the CK4Gen framework, optimised to retain critical clinical characteristics for use in research and educational settings. Each dataset is based on a carefully curated ground truth dataset, processed with standardised variable definitions and analytical approaches, ensuring a consistent baseline for survival analysis.###===###Description:The repository includes synthetic versions of four widely utilised and publicly accessible survival analysis datasets, each anchored in foundational studies and aligned with established ground truth variations to support robust clinical research and training.#---GBSG2: Based on Schumacher et al. [1]. The study evaluated the effects of hormonal treatment and chemotherapy duration in node-positive breast cancer patients, tracking recurrence-free and overall survival among 686 women over a median of 5 years. Our synthetic version is derived from a variation of the GBSG2 dataset available in the lifelines package [2], formatted to match the descriptions in Sauerbrei et al. [3], which we treat as the ground truth.ACTG320: Based on Hammer et al. [4]. The study investigates the impact of adding the protease inhibitor indinavir to a standard two-drug regimen for HIV-1 treatment. The original clinical trial involved 1,151 patients with prior zidovudine exposure and low CD4 cell counts, tracking outcomes over a median follow-up of 38 weeks. Our synthetic dataset is derived from a variation of the ACTG320 dataset available in the sksurv package [5], which we treat as the ground truth dataset.WHAS500: Based on Goldberg et al. [6]. The study follows 500 patients to investigate survival rates following acute myocardial infarction (MI), capturing a range of factors influencing MI incidence and outcomes. Our synthetic data replicates a ground truth variation from the sksurv package, which we treat as the ground truth dataset.FLChain: Based on Dispenzieri et al. [7]. The study assesses the prognostic relevance of serum immunoglobulin free light chains (FLCs) for overall survival in a large cohort of 15,859 participants. Our synthetic version is based on a variation available in the sksurv package, which we treat as the ground truth dataset.###===###Notes:Please find an in-depth discussion on these datasets, as well as their generation process, in the link below, to our paper:https://arxiv.org/abs/2410.16872Kuo, et al. "CK4Gen: A Knowledge Distillation Framework for Generating High-Utility Synthetic Survival Datasets in Healthcare." arXiv preprint arXiv:2410.16872 (2024).###===###References:[1]: Schumacher, et al. “Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. German breast cancer study group.”, Journal of Clinical Oncology, 1994.[2]: Davidson-Pilon “lifelines: Survival Analysis in Python”, Journal of Open Source Software, 2019.[3]: Sauerbrei, et al. “Modelling the effects of standard prognostic factors in node-positive breast cancer”, British Journal of Cancer, 1999.[4]: Hammer, et al. “A controlled trial of two nucleoside analogues plus indinavir in persons with human immunodeficiency virus infection and cd4 cell counts of 200 per cubic millimeter or less”, New England Journal of Medicine, 1997.[5]: Pölsterl “scikit-survival: A library for time-to-event analysis built on top of scikit-learn”, Journal of Machine Learning Research, 2020.[6]: Goldberg, et al. “Incidence and case fatality rates of acute myocardial infarction (1975–1984): the Worcester heart attack study”, American Heart Journal, 1988.[7]: Dispenzieri, et al. “Use of nonclonal serum immunoglobulin free light chains to predict overall survival in the general population”, in Mayo Clinic Proceedings, 2012.
Medical Writing Market Analysis North America, Asia, Europe, Rest of World...
technavio.com
pdf
Updated Jun 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Medical Writing Market Analysis North America, Asia, Europe, Rest of World (ROW) - US, Germany, UK, China, India - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/medical-writing-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Jun 3, 2024
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2024 - 2028
Area covered
Germany, United States
Description
Snapshot img

Medical Writing Market Size 2024-2028

The medical writing market size is forecast to increase by USD 1.18 billion, at a CAGR of 6.45% between 2023 and 2028.

The market growth depends on key drivers such as the increase in the number of clinical trials. The medical writing market plays a crucial role in scientific data analysis, regulatory submissions, and the creation of educational materials. As the healthcare industry invests heavily in evidence-based medicine, skilled medical writers are in demand to communicate complex scientific information effectively. A significant trend shaping the market is the increasing adoption of AI in medical writing, which enhances efficiency and accuracy in document creation. However, a key challenge affecting the market growth is data security and privacy concerns associated with medical writing, especially when handling sensitive patient and clinical trial information.

What will be the Size of the Market During the Forecast Period?

Request Free Sample

The market encompasses various sectors, including patient information leaflets, scientific manuscripts, educational materials, regulatory writing, clinical writing, and medical writing sessions. These materials are essential for physicians and healthcare professionals to effectively communicate complex medical information to patients and peers. The market is significantly influenced by advancements in genetic engineering and bioinformatics, which require precise and accurate documentation. Clinical data management is another critical area that relies on medical writing for the collection, analysis, and reporting of clinical trial data. The market for medical writing continues to grow as the demand for clear and concise communication in the medical field increases.

How is this market segmented and which is the largest segment?

The market research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD billion' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

Type Clinical writing Regulatory writing Others End-user Pharmaceutical biotech companies Contract research organization others Geography North America US Asia China India Europe Germany UK Rest of World (ROW)

By Type Insights

The clinical writing segment is estimated to witness significant growth during the forecast period.

Clinical writing refers to the type of writing that healthcare professionals engage in regularly. Examples of clinical writing include documenting progress or treatment notes in medical records, updating patient charts, preparing referral and consultation letters, and completing various administrative forms. This form of writing communicates essential, accurate, and detailed information regarding a patient's condition, diagnostic tests, treatment plans, and prognosis. Unlike other forms of medical writing, clinical writing directly affects patient care. Additionally, it carries legal implications and may be used as evidence in malpractice or negligence lawsuits.

Get a glance at the market report of share of various segments Request Free Sample

The clinical writing segment was valued at USD 1.48 billion in 2018 and showed a gradual increase during the forecast period.

Regional Analysis

North America is estimated to contribute 36% to the growth of the global market during the forecast period.

Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

For more insights on the market share of various regions Request Free Sample

The market is thriving due to the region's emphasis on evidence-based medicine and the substantial healthcare expenditure. With the increasing prevalence of diseases worldwide, there is a growing demand for high-quality scientific data and patient information leaflets. This need is met through the production of scientific manuscripts, educational materials, and regulatory submissions. Skilled medical writers play a crucial role in transforming complex scientific research into clear and concise language for various audiences, including physicians, patients, and regulatory bodies. The market encompasses a wide range of applications, including research articles, conference papers, and documentation for drug-related information, medical device regulations, and study protocols.

Moreover, advancements in medical technologies, such as genetic engineering, bioinformatics, and agriculture biotechnology, necessitate the need for comprehensive clinical data management and medical writing sessions. The internship forum provides opportunities for aspiring medical writers to gain valuable experience and contribute to the development of medication innovations and medical apparatus regulations. The internet h
G
Clinical Note Summarization Software Market Research Report 2033
growthmarketreports.com
csv, pdf, pptx
Updated Aug 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Growth Market Reports (2025). Clinical Note Summarization Software Market Research Report 2033 [Dataset]. https://growthmarketreports.com/report/clinical-note-summarization-software-market
Explore at:
pdf, csv, pptxAvailable download formats
Dataset updated
Aug 29, 2025
Dataset authored and provided by
Growth Market Reports
Time period covered
2024 - 2032
Area covered
Global
Description
Clinical Note Summarization Software Market Outlook

According to our latest research, the global clinical note summarization software market size reached USD 1.42 billion in 2024. The market is experiencing robust momentum, registering a CAGR of 19.7% from 2025 to 2033. By the end of 2033, the market is projected to attain a value of USD 6.98 billion. This exceptional growth is primarily driven by the increasing adoption of artificial intelligence (AI) and natural language processing (NLP) technologies in healthcare, which are transforming how clinical data is managed and utilized across various healthcare settings.

One of the primary growth factors propelling the clinical note summarization software market is the exponential rise in healthcare data volume and complexity. Healthcare providers are inundated with unstructured data from electronic health records (EHRs), physician notes, and patient reports, making manual summarization both time-consuming and error-prone. The deployment of clinical note summarization software, equipped with advanced NLP and machine learning algorithms, automates the extraction of critical information from vast volumes of clinical notes. This not only enhances clinical workflow efficiency but also improves the accuracy of patient care by ensuring that key insights are not overlooked. As regulatory pressures mount for accurate documentation and reporting, the demand for robust summarization solutions continues to escalate, further fueling market expansion.

Another significant driver is the increasing emphasis on value-based healthcare and patient-centric care models. Clinical note summarization software enables healthcare organizations to streamline documentation processes, reduce administrative burdens, and allocate more time for direct patient interaction. By automating the summarization of clinical notes, providers can rapidly access relevant patient histories, diagnoses, and treatment plans, facilitating quicker and more informed clinical decisions. This capability is particularly vital in acute care settings and during patient transitions between care teams, where timely and accurate information exchange is critical. Furthermore, the integration of summarization tools with existing EHR systems enhances interoperability and data accessibility, supporting broader digital transformation initiatives within the healthcare sector.

The market's growth is also buoyed by the rising demand for data-driven insights in healthcare research and population health management. Clinical note summarization software is increasingly being adopted by research institutes and payers to extract actionable insights from large-scale clinical datasets. This not only aids in identifying disease trends and treatment outcomes but also supports the development of predictive analytics and personalized medicine initiatives. The growing prevalence of chronic diseases, coupled with the need for efficient documentation in value-based reimbursement models, is prompting healthcare organizations to invest in advanced summarization tools. As a result, the clinical note summarization software market is poised for significant expansion across diverse healthcare applications over the forecast period.

The advent of AI-Generated Clinical Discharge Summary systems is revolutionizing the way healthcare providers manage patient information post-discharge. These systems utilize advanced AI algorithms to automatically generate comprehensive discharge summaries, ensuring that critical patient information is accurately captured and communicated to both patients and subsequent care providers. By reducing the manual effort required to compile these summaries, healthcare professionals can focus more on direct patient care and less on administrative tasks. This technology not only enhances the continuity of care but also minimizes the risk of information loss during patient transitions. Moreover, AI-generated summaries are increasingly being integrated with EHR systems, providing seamless access to patient data and supporting informed decision-making across care teams. As the demand for efficient and accurate documentation grows, AI-Generated Clinical Discharge Summary systems are poised to become a staple in healthcare settings worldwide.

Regionally, North America dominates the clinical note summarization software market,
p
Data from: Transformer models trained on MIMIC-III to generate synthetic...
physionet.org
Updated May 27, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ali Amin-Nejad; Julia Ive; Sumithra Velupillai (2020). Transformer models trained on MIMIC-III to generate synthetic patient notes [Dataset]. http://doi.org/10.13026/m34x-fq90
Explore at:
Unique identifier
https://doi.org/10.13026/m34x-fq90
Dataset updated
May 27, 2020
Authors
Ali Amin-Nejad; Julia Ive; Sumithra Velupillai
License
https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
Description
Natural Language Processing can help to unlock knowledge in the vast troves of unstructured clinical data that are collected during patient care. Patient confidentiality presents a barrier to the sharing and analysis of such data, however, meaning that only small, fragmented and sequestered datasets are available for research. To help side-step this roadblock, we explore the use of Transformer models for the generation of synthetic notes. We demonstrate how models trained on notes from the MIMIC-III clinical database can be used to generate synthetic data with potential to support downstream research studies. We release these trained models to the research community to stimulate further research in this area.
MedAlign
redivis.com
stanford.redivis.com
application/jsonl +7
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shah Lab (2025). MedAlign [Dataset]. http://doi.org/10.57761/5b7c-pm72
Explore at:
avro, arrow, sas, parquet, csv, stata, application/jsonl, spssAvailable download formats
Unique identifier
https://doi.org/10.57761/5b7c-pm72
Dataset updated
Mar 30, 2025
Dataset provided by
Redivis Inc.
Authors
Shah Lab
Description
Abstract

MedAlign is a benchmark dataset of 983 clinician-curated natural language instructions for EHR data, grounded by 275 longitudinal EHRs. It includes reference responses for 303 instructions and supports evaluation of LLMs on healthcare-specific tasks.

Methodology

**IMPORTANT USAGE NOTE: **MedAlign only includes test set examples. No training examples are provided for fine-tuning models.

1. Overview

MedAlign is a longitudinal EHR benchmark for instruction-following with LLMs. The dataset includes:

275 patients

46,252 clinical notes

128 clinical note types

3.6 million clinical events

%3C!-- --%3E

2. EHR Data

EHR data is sourced from Stanford’s STARR-OMOP database. Data are standardized in the OMOP CDM schema and are scrubbed on identifying PHI information. Complete technical details are included in the paper, but key highlights:

Dates are jittered within patient to conceal real dates (but preserve deltas between dates)

Data for patients %3E= 90 years old are removed

%3C!-- --%3E

Unstructured text fields not mappable to OMOP standard concepts are redacted

%3C!-- --%3E

All clinical note text has been scrubbed of PHI variables using hiding-in-plain-sight (HIPS) Carrell et al. 2013.

HIV test results are redacted.

Provider names and NPIs are redacted

%3C!-- --%3E

3. Instruction Following Benchmark

See "medalign_instructions_responses_v1_2.zip" for instructions, responses, and EHR text timelines.

Please see our Github repo to obtain code for loading the dataset.

Usage

Access to the MedAlign dataset requires the following:

Verified Affiliation (Academic, Government, Industry Research Lab). Please use your verified email address when applying, **do not use gmail or personal emails. **Applications using personal, unverified email addresses will be rejected.

Encryption Verification / Attestation for Data Storage

Signing the terms of the MedAlign Data Set License 1.0

Providing a short description of your intended research use of MedAlign

CITI Training

%3C!-- --%3E

**These data must remain on your encrypted machine. Redistribution of data is FORBIDDEN and will result in immediate termination of access privileges. **

IMPORTANT NOTES:

Our policy on derived works aligns with PhysioNet's guidelines, requiring that these artifacts be hosted on Redivis. If you create derived research artifacts based on MedAlign (such as additional annotations or synthetic data), please contact us to discuss hosting arrangements.

Sending MedAlign data over a non-HIPAA-compliant API is a violation of the DUA.

%3C!-- --%3E

Please allow 7-10 business days to process applications.

Facebook

Twitter

Click to copy link

Link copied

Cite

GoMask.ai (2025). Clinical Trial Adverse Events and Safety Data [Dataset]. https://gomask.ai/marketplace/datasets/clinical-trial-adverse-events-and-safety-data

Clinical Trial Adverse Events and Safety Data

Explore at:

csv(10 MB), jsonAvailable download formats

Dataset updated

Nov 2, 2025

Dataset provided by

GoMask.ai

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Time period covered

2024 - 2025

Area covered

Global

Variables measured

outcome, subject_id, meddra_code, meddra_term, report_date, action_taken, reporter_role, event_end_date, severity_grade, adverse_event_id, and 6 more

Description

This dataset provides comprehensive, standardized reporting of adverse events and safety data from clinical trials, including event details, severity, regulatory coding, and pharmacovigilance notes. It enables robust safety monitoring, regulatory submissions, and data-driven risk assessments for investigational drugs.

Clear search

Close search

Google apps

Main menu

Clinical Trial Adverse Events and Safety Data

Data from: Clinical Dataset

Characteristics of randomised controlled clinical trials included in the...

Data associated with: Study to Understand Fall Reduction and Vitamin D in...

mimic-iii-clinical-database-demo-1.4

Data from: MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital...

Automated Clinical Note Redaction Services Market Research Report 2033

Automated Clinical Note Redaction Services Market Outlook

Inclusion_Criteria_Annotation

MIMIC-IV-Note: Deidentified free-text clinical notes

Data from: Clinical Dataset

🏥 Clinical Tabular Dataset (Non-PII, Realistic, Research-Ready)

Subtitle

Overview

Dataset Details

Applications

Why This Dataset?

Disclaimer

Citation

Characteristics of 722 notes which are manually evaluated, and their...

FactEHR Stanford

Abstract

Methodology

Usage

Fleiss’ kappa inter-rater-agreement metric between reviewers (2-way) and...

Automated Clinical Note Redaction Services Market Research Report 2033

Automated Clinical Note Redaction Services Market Outlook

Component Analysis

nbme-score-clinical-patient-notes-correction

CK4Gen, High Utility Synthetic Survival Datasets

Medical Writing Market Analysis North America, Asia, Europe, Rest of World...

Snapshot img

Clinical Note Summarization Software Market Research Report 2033

Clinical Note Summarization Software Market Outlook

Data from: Transformer models trained on MIMIC-III to generate synthetic...

MedAlign

Abstract

Methodology

Usage

Clinical Trial Adverse Events and Safety Data