Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Daniel Ansted
Released under CC0: Public Domain
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is synthetic patient-level clinical trial data, re-created based on data from a clinical trial for corticosteroids and antiviral agents as treatment for Bell's Palsy: https://www.nejm.org/doi/full/10.1056/nejmoa072006#
Bell's Palsy is a sudden, temporary weakness or paralysis of the muscles on one side of the face. The exact cause is unknown, but it's believed to occur due to swelling and inflammation of the nerve that controls the muscles on one side of the face, which can be triggered by a viral infection.
The authors conducted a double-blind, placebo-controlled, randomized, factorial trial involving patients with Bell's Palsy who were recruited within 72 hours after the onset of symptoms. Patients were randomly assigned to receive 10 days of treatment with prednisolone, acyclovir, both agents, or placebo. The primary outcome was recovery of facial function, as rated on the House–Brackmann scale.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F4146319%2Ffcf4a8da1954977c5b94ca11caee6079%2Fhouse_brackmann_scale.jpg?generation=1693081683813509&alt=media" alt="">
Facebook
TwitterThis dataset was created by Priyanshu
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This synthetic dataset simulates a Phase III randomized controlled clinical trial evaluating CardioX (Drug A) versus an active comparator (Drug B) and a placebo for treating hypertension. It is designed for clinical data analysis, anomaly detection, and risk-based monitoring (RBM) applications.
The dataset includes 1,000 patients across 50 trial sites, with realistic patient demographics, blood pressure readings, cholesterol levels, dropout rates, and adverse event reporting. Several anomalies have been embedded to simulate real-world data quality issues commonly encountered in clinical trials.
This dataset is ideal for data quality assessments, statistical anomaly detection (Z-scores, IQR, clustering), and risk-based management (RBM) in clinical research.
🔹 Clinical Trial Data Analysis – Investigate treatment efficacy and safety trends.
🔹 Anomaly Detection – Apply Z-scores, IQR, and ML-based clustering methods to identify outliers.
🔹 Risk-Based Monitoring (RBM) – Detect potential site-level risks and data inconsistencies.
🔹 Machine Learning Applications – Train models for adverse event prediction or dropout risk estimation.
| Column Name | Description |
|---|---|
| Patient_ID | Unique identifier for each trial participant. |
| Site_ID | Site where the patient was enrolled (1-50) |
| Age | Patient age (in years). |
| Gender | Male or Female. |
| Enrollment_Date | Date when the patient was enrolled in the study. |
| Treatment_Group | Assigned treatment: Placebo, Drug A (CardioX), or Drug B (Active Comparator). |
| Adverse_Events | Number of adverse events (AEs) reported by the patient. |
| Dropout | Whether the patient dropped out of the study (1 = Yes, 0 = No). |
| Systolic_BP | Systolic Blood Pressure (mmHg). |
| Diastolic_BP | Diastolic Blood Pressure (mmHg). |
| Cholesterol_Level | Total cholesterol level (mg/dL). |
This dataset is fully synthetic and does not contain real patient data. It is created for educational, analytical, and research purposes in clinical data science and biostatistics.
🔗 If you use this dataset, tag me! Let’s discuss insights & findings! 🚀
Facebook
TwitterThe purest type of electronic clinical data which is obtained at the point of care at a medical facility, hospital, clinic or practice. Often referred to as the electronic medical record (EMR), the EMR is generally not available to outside researchers. The data collected includes administrative and demographic information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic monitoring data, hospitalization, patient insurance, etc.
Individual organizations such as hospitals or health systems may provide access to internal staff. Larger collaborations, such as the NIH Collaboratory Distributed Research Network provides mediated or collaborative access to clinical data repositories by eligible researchers. Additionally, the UW De-identified Clinical Data Repository (DCDR) and the Stanford Center for Clinical Informatics allow for initial cohort identification.
About Dataset:
333 scholarly articles cite this dataset.
Unique identifier: DOI
Dataset updated: 2023
Authors: Haoyang Mi
In this dataset, we have two dataset:
1- Clinical Data_Discovery_Cohort: Name of columns: Patient ID Specimen date Dead or Alive Date of Death Date of last Follow Sex Race Stage Event Time
2- Clinical_Data_Validation_Cohort Name of columns: Patient ID Survival time (days) Event Tumor size Grade Stage Age Sex Cigarette Pack per year Type Adjuvant Batch EGFR KRAS
Feel free to put your thought and analysis in a notebook for this datasets. And you can create some interesting and valuable ML projects for this case. Thanks for your attention.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Overview: The AIDS Clinical Trials Group Study 175 Dataset, initially published in 1996, is a comprehensive collection of healthcare statistics and categorical information about patients diagnosed with AIDS. This dataset was created with the primary purpose of examining the performance of two different types of AIDS treatments: zidovudine (AZT) versus didanosine (ddI), AZT plus ddI, and AZT plus zalcitabine (ddC). The prediction task associated with this dataset involves determining whether each patient died within a specified time window.
Dataset Details: - Number of rows: 2139 - Number of columns: 24
Purpose of Dataset Creation: The dataset was created to evaluate the efficacy and safety of various AIDS treatments, specifically comparing the performance of AZT, ddI, and ddC in preventing disease progression in HIV-infected patients with CD4 counts ranging from 200 to 500 cells/mm3. This intervention trial aimed to contribute insights into the effectiveness of monotherapy versus combination therapy with nucleoside analogs.
Funding Sources: The creation of this dataset was funded by: - AIDS Clinical Trials Group of the National Institute of Allergy and Infectious Diseases - General Research Center units funded by the National Center for Research Resources
Instance Representation: Each instance in the dataset represents a health record of a patient diagnosed with AIDS in the United States. These records encompass crucial categorical information and healthcare statistics related to the patient's condition.
Study Design: - Study Type: Interventional (Clinical Trial) - Enrollment: 2100 participants - Masking: Double-Blind - Primary Purpose: Treatment - Official Title: A Randomized, Double-Blind Phase II/III Trial of Monotherapy vs. Combination Therapy With Nucleoside Analogs in HIV-Infected Persons With CD4 Cells of 200-500/mm3 - Study Completion Date: November 1995
Study Objectives: To determine the effectiveness and safety of different AIDS treatments, including AZT, ddI, and ddC, in preventing disease progression among HIV-infected patients with specific CD4 cell counts.
Additional Information: The dataset provides valuable insights into the HIV-related clinical trials conducted by the AIDS Clinical Trials Group, contributing to the understanding of treatment outcomes and informing future research in the field.
Attributes Description:
Censoring Indicator (label):Binary indicator (1 = failure, 0 = censoring) denoting patient status.
Temporal Information:
Time to Event (time): Integer representing time to failure or censoring.
Treatment Features:
Baseline Health Metrics:
Age (age): Patient's age in years at baseline.
Weight (wtkg): Continuous feature representing weight in kilograms at baseline.
Hemophilia (hemo): Binary indicator of hemophilia status (0 = no, 1 = yes).
Sexual Orientation (homo): Binary indicator of homosexual activity (0 = no, 1 = yes).
IV Drug Use History (drugs): Binary indicator of history of IV drug use (0 = no, 1 = yes).
Karnofsky Score (karnof): Integer on a scale of 0-100 indicating the patient's functional status.
Antiretroviral Therapy History:
Non-ZDV Antiretroviral Therapy Pre-175 (oprior): Binary indicator of non-ZDV antiretroviral therapy pre-Study 175 (0 = no, 1 = yes).
ZDV in the 30 Days Prior to 175 (z30): Binary indicator of ZDV use in the 30 days prior to Study 175 (0 = no, 1 = yes).
ZDV Prior to 175 (zprior): Binary indicator of ZDV use prior to Study 175 (0 = no, 1 = yes).
Days Pre-175 Anti-Retroviral Therapy (preanti): Integer representing the number of days of pre-Study 175 anti-retroviral therapy.
Demographic Information:
Race (race): Integer denoting race (0 = White, 1 = non-white).
Gender (gender): Binary indicator of gender (0 = Female, 1 = Male).
Treatment History:
Antiretroviral History (str2): Binary indicator of antiretroviral history (0 = naive, 1 = experienced).
Antiretroviral History Stratification (strat): Integer representing antiretroviral history stratification.
Symptomatic Information:
Symptomatic Indicator (symptom): Binary indicator of symptomatic status (0 = asymptomatic, 1 = symptomatic).
Additional Treatment Attributes:
Treatment Indicator (treat): Binary indicator of treatment (0 = ZDV only, 1 = others).
Off-Treatment Indicator (offtrt): Binary indicator of being off-treatment before 96+/-5 weeks (0 = no, 1 = yes).
Immunological Metrics:
CD4 Counts (cd40, cd420): Integer values representing CD4 counts at baseline and 20+/-5 weeks.
CD8 Counts (cd80, cd820): Integer values representing CD8 counts at baseline and 20+/-5 weeks.
Original Dataset Website: [h...
Facebook
TwitterThis dataset was created by Jashwanth Reddy Kadaru
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🧪 Covid-19 Clinical Trials Dataset (Raw + Cleaned)
This dataset offers a deep look into the global clinical research landscape during the Covid-19 pandemic. Sourced directly from ClinicalTrials.gov, it provides structured and semi-structured information on registered Covid-19-related clinical trials across countries, sponsors, and phases.
📁 What’s Included • COVID_clinical_trials.csv — Raw dataset as obtained from ClinicalTrials.gov • Covid-19_cleaned_dataset.csv — Preprocessed version for direct use in data analysis and visualization tasks
🎯 Use Case & Learning Goals
This dataset is ideal for: • Practicing data cleaning, preprocessing, and wrangling • Performing exploratory data analysis (EDA) • Building interactive dashboards (e.g., with Tableau or Plotly) • Training ML models for classification or forecasting (e.g., predicting trial outcomes) • Exploring trends in clinical trial research during global health emergencies
🔍 Key Features
Each row represents a registered clinical trial and includes fields such as: • NCT Number (unique ID) • Study Title • Start Date and Completion Date • Phase • Study Type (Interventional/Observational) • Enrollment Size • Country, Sponsor, and Intervention Type • Study Status (Recruiting, Completed, Withdrawn, etc.)
✅ Cleaned Dataset
The cleaned version includes: • Standardized column naming • Filled missing values where possible • Removed duplicates and a few columns
📊 Example Applications • Country-wise contribution analysis • Sponsor landscape visualization • Trial timeline and phase progression charts • Predictive modeling of trial duration or status
🙏 Acknowledgments
Thanks to ClinicalTrials.gov for providing public access to this critical data.
Facebook
TwitterBy Aero Data Lab [source]
This dataset contains information on clinical trials conducted by sponsors. Each row represents a clinical trial, and the columns represent various attributes of the trial, such as the National Clinical Trial Number, the sponsor of the trial, the title of the trial, and so on.
The purpose of this dataset is to provide a bird's-eye view of the clinical trial landscape. By understanding which sponsors are conducting which trials and for what conditions, we can get a better sense of where research is headed and what new treatments may be on the horizon
- NCT is a unique identifier for clinical trials. It stands for National Clinical Trial Number.
- Sponsor is the organization that is funding the clinical trial.
- Title is the name of the clinical trial.
- Summary is a brief summary of the clinical trial.
- Start Year is the year that the clinical trial started.
- Start Month is the month that the clinical trial started.
- Phase is the stage of development of the investigative drug or device (I), which can be one of four types: I, II, III, or IV.
- Enrollment is The number of participants in the clinical trial.
- Status is The status of enrollment in the study, which can be Recruiting, Not yet recruiting, Active, not recruiting, Completed, Suspended, or Terminated.
Condition indicates what medical condition(s) are being studied in this particular NCT record
- Identify patterns in clinical trials to improve the development process
- Understand how different sponsors fund clinical trials
By Aero Data Lab [source]
License
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: AERO-BirdsEye-Data.csv | Column name | Description | |:----------------|:-----------------------------------------------------------------| | NCT | National Clinical Trial number. (String) | | Sponsor | Name of the sponsor conducting the clinical trial. (String) | | Title | Title of the clinical trial. (String) | | Summary | Brief summary of the clinical trial. (String) | | Start_Year | Year the clinical trial started. (Integer) | | Start_Month | Month the clinical trial started. (String) | | Phase | Phase of the clinical trial. (String) | | Enrollment | Number of participants enrolled in the clinical trial. (Integer) | | Status | Status of the clinical trial. (String) | | Condition | Condition being tested in the clinical trial. (String) |
If you use this dataset in your research, please credit By Aero Data Lab [source]
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains synthetic clinical trial data for research and analysis purposes. It is designed for practicing data analysis, machine learning, and visualization techniques in a healthcare setting.
The dataset includes four sheets:
patients → Patient demographics and physical attributes (sex, birthdate, height, weight, BMI, location, contact)
treatments → Records of treatments given to patients during the trial (treatment name, date, dosage, outcome)
treatments_cut → Filtered treatment records (e.g., last treatment per patient or specific conditions)
adverse_reactions → Reported side effects or complications related to treatments
This dataset is fictional and anonymized — no real patient data is used.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Last Updated: 8th May 2020
This dataset contains information of more than 338k+ clinical trials. The data is in the form of XML files.
For COVID-19 related clinical trials please check this dataset by user Ali Panahi
What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.
Please read the Terms and conditions from the source website (clinicaltrials.gov)
To highlight the main points: - ClinicalTrials.gov data carries an international copyright - Emails extracted from this data cannot be used for marketing/promotional activities
Splash image is taken from (Wiki Commons)[https://commons.wikimedia.org/wiki/File:Figure_1_Food_and_Drug_Administration%E2%80%99s_(FDA)_Typical_Drug_Development_and_Approval_Process_(35856478702).jpg] Original source: https://www.gao.gov/products/GAO-17-564
@savannareid for pointing out this resource
A central repository of all on-going clinical trials provides a one-stop database for creative querying
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Drug development is costly and uncertain, with success rates varying widely across therapeutic areas and phases. Predicting the Probability of Trial Success (PTS) can guide better R&D investment, pipeline prioritization, and business development decisions. The challenge is to develop a machine learning model that predicts the PTS for ongoing (active) Phase-3 clinical trials, based on learnings from historical trials.
Data Provided
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
The data can be relevant alone or as supplementary information on the covid-19 outbreak
The data contains selected attributes such as sex, age, phase as well as enrollment and inclusion/exclusion criteria for clinical trials relating to covid-19. It is acquired by scraping the data using python from clinicaltrials.gov. The scraping particularly helps in acquiring the inclusion/exclusion criteria without downloading and processing the XML study files for each study. The data was collected on 28th March 2020
All data is part of US government database and is downloaded from clinicaltrials.gov.
What are the clinical trials already in place? Do inclusion/exclusion conditions and other attributes impact the feasability of the clinical trials?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a CSV file of clinical trial data used to develop an interactive visualization published by Aero Data Lab, titled “A Bird’s Eye View of Research Landscape.” The data offers insights into pharmaceutical research and development trends and serves as a valuable resource for exploring the structure and scope of clinical trials. Published in 2019, the dataset can support studies in medical innovation, trial phases, and therapeutic focus areas.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This dataset was created by Nauman Ali Shah
Released under Database: Open Database, Contents: Database Contents
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
ClinicalTrials.gov is a database of privately and publicly funded clinical studies conducted around the world. It is maintained by the National Institute of Health. All data is publicly available and the site provides a direct download feature which makes it super easy to use relevant data for analysis.
This dataset consists of clinical trials related to COVID 19 studies presented on the site.
The dataset consists of XML files where each XML file corresponds to one study. The filename is the NCT number which a unique identifier of a study in ClinicalTrials repository. Additionally, a CSV file has also been provided, which might not have as much information as contained in the XML file, but does give sufficient information.
Please refer to this notebook for details on the dataset : https://www.kaggle.com/parulpandey/eda-on-covid-19-clinical-trials
ClinicalTrials.gov is a resource provided by the U.S. National Library of Medicine.
Listing a study does not mean it has been evaluated by the U.S. Federal Government. Read our disclaimer for details. Before participating in a study, talk to your health care provider and learn about the risks and potential benefits.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description: The dataset contains information on ongoing clinical trials of the medicinal product in Estonia, covering both the studies of Directive 2001/20/EC and Regulation 536/2014. The dataset contains, among other things, information on the sponsor, title, specialty and date of authorisation of the study.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains a 1,000-patient synthetic clinical trial sample (5 scheduled visits per patient) and a full technical, quality, and regulatory documentation suite. Full 50,000-patient dataset is available by licensing. Contact: meddataresearch.hungary@protonmail.com
Facebook
TwitterThis dataset was created by ankit bhandari
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by Gabriel Moraga
Released under Attribution 4.0 International (CC BY 4.0)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Daniel Ansted
Released under CC0: Public Domain