72 datasets found

Clinical Trials
kaggle.com
Updated Nov 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Moraga (2024). Clinical Trials [Dataset]. https://www.kaggle.com/datasets/gabrielmoraga/clinical-trials/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2024
Dataset provided by
Kaggle
Authors
Gabriel Moraga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by Gabriel Moraga

Released under Attribution 4.0 International (CC BY 4.0)

Contents
TREC 2022 Clinical Trials Dataset
catalog.data.gov
s.cnmilf.com
Updated Sep 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2024). TREC 2022 Clinical Trials Dataset [Dataset]. https://catalog.data.gov/dataset/trec-2022-clinical-trials-dataset
Explore at:
Dataset updated
Sep 11, 2024
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
The goal of the Clinical Trials track is to focus research on the clinical trials matching problem: given a free text summary of a patient health record, find suitable clinical trials for that patient.
Clinical Trials
kaggle.com
Updated Nov 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Clinical Trials [Dataset]. https://www.kaggle.com/datasets/thedevastator/a-quick-overview-of-clinical-trials/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
Description
Clinical Trials

Clinical trials over the years along with start / end dates, outcome and more.

By Aero Data Lab [source]

About this dataset

This dataset contains information on clinical trials conducted by sponsors. Each row represents a clinical trial, and the columns represent various attributes of the trial, such as the National Clinical Trial Number, the sponsor of the trial, the title of the trial, and so on.

The purpose of this dataset is to provide a bird's-eye view of the clinical trial landscape. By understanding which sponsors are conducting which trials and for what conditions, we can get a better sense of where research is headed and what new treatments may be on the horizon

How to use the dataset

NCT is a unique identifier for clinical trials. It stands for National Clinical Trial Number.

Sponsor is the organization that is funding the clinical trial.

Title is the name of the clinical trial.

Summary is a brief summary of the clinical trial.

Start Year is the year that the clinical trial started.

Start Month is the month that the clinical trial started.

Phase is the stage of development of the investigative drug or device (I), which can be one of four types: I, II, III, or IV.

Enrollment is The number of participants in the clinical trial.

Status is The status of enrollment in the study, which can be Recruiting, Not yet recruiting, Active, not recruiting, Completed, Suspended, or Terminated.

Condition indicates what medical condition(s) are being studied in this particular NCT record

Research Ideas

Identify patterns in clinical trials to improve the development process

Understand how different sponsors fund clinical trials

Acknowledgements

By Aero Data Lab [source]

License

License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

Columns

File: AERO-BirdsEye-Data.csv | Column name | Description | |:----------------|:-----------------------------------------------------------------| | NCT | National Clinical Trial number. (String) | | Sponsor | Name of the sponsor conducting the clinical trial. (String) | | Title | Title of the clinical trial. (String) | | Summary | Brief summary of the clinical trial. (String) | | Start_Year | Year the clinical trial started. (Integer) | | Start_Month | Month the clinical trial started. (String) | | Phase | Phase of the clinical trial. (String) | | Enrollment | Number of participants enrolled in the clinical trial. (Integer) | | Status | Status of the clinical trial. (String) | | Condition | Condition being tested in the clinical trial. (String) |

Acknowledgements

If you use this dataset in your research, please credit By Aero Data Lab [source]
Clinical_trials
kaggle.com
Updated Oct 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DEBRAH DOMINIC ATUAHENE (2021). Clinical_trials [Dataset]. https://www.kaggle.com/dominique2001gh/clinical-trials/activity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 23, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
DEBRAH DOMINIC ATUAHENE
Description
Dataset

This dataset was created by DEBRAH DOMINIC ATUAHENE

Contents
Health Care Analytics
kaggle.com
Updated Jan 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 10, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Abishek Sudarshan
Description
Context

Part of Janatahack Hackathon in Analytics Vidhya

Content

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

The Process:

MedCamp employees / volunteers reach out to people and drive registrations. During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.

Other things to note:

Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people. For a few camps, there was hardware failure, so some information about date and time of registration is lost. MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

Favorable outcome:

For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall. You need to predict the chances (probability) of having a favourable outcome.

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train Test data is for all camps conducted on or after 1st April 2006.

Acknowledgements

Credits to AV

Inspiration

To share with the data science community to jump start their journey in Healthcare Analytics
UCI ML Drug Review dataset
kaggle.com
Updated Dec 13, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jessica Li (2018). UCI ML Drug Review dataset [Dataset]. https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018/home
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jessica Li
Description
This dataset was used for the Winter 2018 Kaggle University Club Hackathon and is now publicly available. See Acknowledgments section for citation and licensing. Note: The types of data and recommendation based solutions provided by the contestants are purely for NLP learning purposes. They are not suitable for a real world drug recommendations solutions.

Welcome to the Kaggle University Club Hackathon!

If you are interested in joining Kaggle University Club, please e-mail Jessica Li at lijessica@google.com

This Hackathon is open to all undergraduate, master, and PhD students who are part of the Kaggle University Club program. The Hackathon provides students with a chance to build capacity via hands-on ML, learn from one another, and engage in a self-defined project that is meaningful to their careers.

Teams must register via Google Form to be eligible for the Hackathon. The Hackathon starts on Monday, November 12, 2018 and ends on Monday, December 10, 2018. Teams have one month to work on a team submission. Teams must do all work within the Kernel editor and set Kernel(s) to public at all times.

Prompt

The freestyle format of hackathons has time and again stimulated groundbreaking and innovative data insights and technologies. The Kaggle University Club Hackathon recreates this environment virtually on our platform. We challenge you to build a meaningful project around the UCI Machine Learning - Drug Review Dataset. Teams are free to let their creativity run and propose methods to analyze this dataset and form interesting machine learning models.

Machine learning has permeated nearly all fields and disciplines of study. One hot topic is using natural language processing and sentiment analysis to identify, extract, and make use of subjective information. The UCI ML Drug Review dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects (see the acknowledgments section to learn more).

The sky's the limit here in terms of what your team can do! Teams are free to add supplementary datasets in conjunction with the drug review dataset in their Kernel. Discussion is highly encouraged within the forum and Slack so everyone can learn from their peers.

Here are just a couple ideas as to what you could do with the data:

Classification: Can you predict the patient's condition based on the review?

Regression: Can you predict the rating of the drug based on the review?

Sentiment analysis: What elements of a review make it more helpful to others? Which patients tend to have more negative reviews? Can you determine if a review is positive, neutral, or negative?

Data visualizations: What kind of drugs are there? What sorts of conditions do these patients have?

Top Submissions

There is no one correct answer to this Hackathon, and teams are free to define the direction of their own project. That being said, there are certain core elements generally found across all outstanding Kernels on the Kaggle platform. The best Kernels are:

Complex: How many domains of analysis and topics does this Kernel cover? Does it attempt machine learning methods? Does the Kernel offer a variety of unique analyses and interesting conclusions or solutions?

Original: What is the subject matter of this Kernel? Does it have a well-defined and interesting project scope, narrative or problem? Could the results make an impact? Is it thought provoking?

Approachable: How easy is it to understand this Kernel? Are all thought processes clear? Is the code clean, with useful comments? Are visualizations and processes articulated and self-explanatory?

Teams with top submissions have a chance to receive exclusive Kaggle University Club swag and be featured on our official blog and across social media.

IMPORTANT: Teams must set all Kernels to public at all times. This is so we can track each team's progression, but more importantly it encourages collaboration, productive discussion, and healthy inspiration to all teams. It is not so that teams can simply copycat good ideas. If a team's Kernel isn't their own organic work, it will not be considered a top submission. Teams must come up with a project on their own.

Submission Styling

The final Kernel submission for the Hackathon must contain the following information:

All team members added as collaborators to the Kernel

Somewhere at the top of your Kernel, find a space to put down all team member names, university name, club name, and team name (as specified whe...
Medical_cost_dataset
kaggle.com
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandita Pore (2023). Medical_cost_dataset [Dataset]. https://www.kaggle.com/datasets/nanditapore/medical-cost-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nandita Pore
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Description:

Explore the intricacies of medical costs and healthcare expenses with our meticulously curated Medical Cost Dataset. This dataset offers valuable insights into the factors influencing medical charges, enabling researchers, analysts, and healthcare professionals to gain a deeper understanding of the dynamics within the healthcare industry.

Columns: 1. ID: A unique identifier assigned to each individual record, facilitating efficient data management and analysis. 2. Age: The age of the patient, providing a crucial demographic factor that often correlates with medical expenses. 3. Sex: The gender of the patient, offering insights into potential cost variations based on biological differences. 4. BMI: The Body Mass Index (BMI) of the patient, indicating the relative weight status and its potential impact on healthcare costs. 5. Children: The number of children or dependents covered under the medical insurance, influencing family-related medical expenses. 6. Smoker: A binary indicator of whether the patient is a smoker or not, as smoking habits can significantly impact healthcare costs. 7. Region: The geographic region of the patient, helping to understand regional disparities in healthcare expenditure. 8. Charges: The medical charges incurred by the patient, serving as the target variable for analysis and predictions.

Whether you're aiming to uncover patterns in medical billing, predict future healthcare costs, or explore the relationships between different variables and charges, our Medical Cost Dataset provides a robust foundation for your research. Researchers can utilize this dataset to develop data-driven models that enhance the efficiency of healthcare resource allocation, insurers can refine pricing strategies, and policymakers can make informed decisions to improve the overall healthcare system.

Unlock the potential of healthcare data with our comprehensive Medical Cost Dataset. Gain insights, make informed decisions, and contribute to the advancement of healthcare economics and policy. Start your analysis today and pave the way for a healthier future.
f
Summary of the internal and external validation datasets used to evaluate...
plos.figshare.com
xls
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Summary of the internal and external validation datasets used to evaluate the models. “Origin” refers to the country where the data was collected. “Lesion” refers to the number of images in the dataset with lesion annotations. The Kaggle dataset (first row, shaded in gray) is the internal dataset used to evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. [Dataset]. https://plos.figshare.com/articles/dataset/Summary_of_the_internal_and_external_validation_datasets_used_to_evaluate_the_models_Origin_refers_to_the_country_where_the_data_was_collected_Lesion_refers_to_the_number_of_images_in_the_dataset_with_lesion_annotations_The_Kaggle_dataset_f/29048391
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000831.t001
Dataset updated
May 14, 2025
Dataset provided by
PLOS Digital Health
Authors
Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the internal and external validation datasets used to evaluate the models. “Origin” refers to the country where the data was collected. “Lesion” refers to the number of images in the dataset with lesion annotations. The Kaggle dataset (first row, shaded in gray) is the internal dataset used to evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model.
b
Data from: RxNorm
bioregistry.io
kaggle.com
Updated May 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). RxNorm [Dataset]. http://identifiers.org/biolink:RXNORM
Explore at:
Unique identifier
https://identifiers.org/biolink:RXNORM https://identifiers.org/wikidata:P3345
Dataset updated
May 10, 2021
License
https://bioregistry.io/spdx:https://uts.nlm.nih.gov/uts/assets/LicenseAgreement.pdfhttps://bioregistry.io/spdx:https://uts.nlm.nih.gov/uts/assets/LicenseAgreement.pdf
Description
RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software, including those of First Databank, Micromedex, and Gold Standard Drug Database. By providing links between these vocabularies, RxNorm can mediate messages between systems not using the same software and vocabulary.
A
‘Accidental Drug Related Deaths in Connecticut’ analyzed by Analyst-2
analyst-2.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Accidental Drug Related Deaths in Connecticut’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-accidental-drug-related-deaths-in-connecticut-4048/1de1c015/?iid=004-085&v=presentation
Explore at:
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Connecticut
Description
Analysis of ‘Accidental Drug Related Deaths in Connecticut’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/accidental-drug-related-deaths-in-connecticute on 13 February 2022.

--- Dataset description provided by original source is as follows ---

About this dataset

This dataset contains the list of each accidental death associated with a drug overdose in the state of Connecticut from 2012 to 2018. Deaths are grouped by age, race, ethnicity, and gender and by the types of drugs detected post-death.

COMMERCIAL LICENSE

For subscribing to a commercial license for John Snow Labs Data Library which includes all datasets curated and maintained by John Snow Labs please visit https://www.johnsnowlabs.com/marketplace.

This dataset was created by John and contains around 0 samples along with County, City, technical information and other features such as: - Is Hydrocodone - Age - and more.

How to use this dataset

Analyze Race in relation to Is Methadone

Study the influence of Is Heroin on Is Oxymorphone

More datasets

Acknowledgements

If you use this dataset in your research, please credit John

Start A New Notebook!

--- Original source retains full ownership of the source dataset ---
P
Novel COVID-19 Chestxray Repository Dataset
paperswithcode.com
Updated Sep 8, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pratik Bhowal; Subhankar Sen; Jin Hee Yoon Zong Woo Geem; Ram Sarkar (2021). Novel COVID-19 Chestxray Repository Dataset [Dataset]. https://paperswithcode.com/dataset/novel-covid-19-chestxray-repository
Explore at:
Dataset updated
Sep 8, 2021
Authors
Pratik Bhowal; Subhankar Sen; Jin Hee Yoon Zong Woo Geem; Ram Sarkar
Description
Authors of the Dataset:

Pratik Bhowal (B.E., Dept of Electronics and Instrumentation Engineering, Jadavpur University Kolkata, India) [LinkedIn], [Github] Subhankar Sen (B.Tech, Dept of Computer Science Engineering, Manipal University Jaipur, India) [LinkedIn], [Github], [Google Scholar] Jin Hee Yoon (faculty of the Dept. of Mathematics and Statistics at Sejong University, Seoul, South Korea) [LinkedIn], [Google Scholar] Zong Woo Geem (faculty of College of IT Convergence at Gachon University, South Korea) [LinkedIn], [Google Scholar] Ram Sarkar( Professor at Dept. of Computer Science Engineering, Jadavpur Univeristy Kolkata, India) [LinkedIn], [Google Scholar]

Overview The authors have created a new dataset known as Novel COVID-19 Chestxray Repository by the fusion of publicly available chest-xray image repositories. In creating this combined dataset, three different datasets obtained from the Github and Kaggle databases,created by the authors of other research studies in this field, were utilized.In our study,frontal and lateral chest X-ray images are used since this view of radiography is widely used by radiologist in clinical diagnosis.In the following section, authors have summarized how this dataset is created.

COVID-19 Radiography Database: The first release of this dataset reports 219 COVID-19,1345 viral pneumonia and 1341 normal radiographic chest X-ray images. This dataset was created by a team of researchers from Qatar University, Doha, Qatar, and the University of Dhaka, Bangladesh in collaboration with medical doctors and specialists from Pakistan and Malaysia.This database is regularly updated with the emergence of new cases of COVID-19 patients worldwide.Related Paper:https://arxiv.org/abs/2003.13145

COVID-Chestxray set:Joseph Paul Cohen and Paul Morrison and Lan Dao have created a public image repository on Github which consists both CT scans and digital chest x-rays.The data was collected mainly from retrospective cohorts of pediatric patients from Guangzhou Women and Children’s medical center.With the aid of metadata information provided along with the dataset,we were able to extract 521 COVID-19 positive,239 viral and bacterial pneumonias;which are of the following three broad categories:Middle East Respiratory Syndrome (MERS),Severe Acute Respiratory Syndrome (SARS), and Acute Respiratory Distress syndrome (ARDS);and 218 normal radiographic chest X-ray images of varying image resolutions. Related Paper: https://arxiv.org/abs/2006.11988

Actualmed COVID chestxray dataset:Actualmed-COVID-chestxray-dataset comprises of 12 COVID-19 positive and 80 normal radiographic chest x-ray images.

The combined dataset includes chest X-ray images of COVID-19,Pneumonia and Normal (healthy) classes, with a total of 752, 1584, and 1639 images respectively. Information about the Novel COVID-19 Chestxray Database and its parent image repositories is provided in Table 1.

Table 1: Dataset Description | Dataset| COVID-19 |Pneumonia | Normal | | ------------- | ------------- | ------------- | -------------| | COVID Chestxray set | 521 |239|218| | COVID-19 Radiography Database(first release) | 219 |1345|1341| | Actualmed COVID chestxray dataset| 12 |0|80| | Total|752|1584|1639|

DATA ACCESS AND USE: Academic/Non-Commercial Use Dataset License : Database: Open Database, Contents: Database Contents
o
Multilingual Medical Text Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Multilingual Medical Text Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/64f1a101-d243-4290-a4fc-af738f8ba252
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Healthcare Providers & Services Utilization
Description
This dataset provides a curated collection of accurate and reliable medical translation data. It is an invaluable resource designed for medical professionals, researchers, and language experts. The data encompasses a wide array of medical topics, including diagnoses, treatment plans, clinical research findings, and pharmaceutical information [1]. It supports various languages spoken across the globe, facilitating cross-cultural comparisons and analysis. Each translation has been meticulously crafted by professional translators with specialist knowledge in the medical domain to ensure authenticity and fidelity to the original source text [1]. This dataset aims to improve understanding and communication within the healthcare sector globally, enhancing accessibility to vital medical information regardless of language barriers and ensuring precision in patient care [1].

Columns

translation: Contains the original text in a specific language that requires translation [1].

translation: Contains the translated text in another language [1].

Note: The dataset contains 13,149 unique entries across these translation columns [2].

Distribution

The data is provided in a CSV file format (specifically, train.csv) [1]. The dataset contains 13,149 records [2].

Usage

This dataset offers various ideal applications and use cases: * Natural Language Processing (NLP) Research: Suitable for training and evaluating NLP models specifically for medical translation tasks, aiding in the development of new algorithms and techniques to enhance accuracy and efficiency [1]. * Machine Learning in Healthcare: Can be used to train machine learning algorithms for automatic translation of medical documents or text, thereby speeding up processes and providing healthcare professionals with timely access to essential information [1]. * Development of Medical Translation Applications: Its accurate translations are beneficial for creating mobile or web-based applications that offer instant translation services for healthcare providers, patients, or anyone seeking reliable medical content translations [1]. * Enhanced Global Communication: Supports improved communication with patients who speak different languages and facilitates the accurate transfer of vital medical information across borders [1].

Coverage

The dataset covers various languages spoken worldwide, enabling cross-cultural analysis and supporting global healthcare communications among diverse populations [1]. The region of coverage is Global [3].

License

CC0

Who Can Use It

Medical Professionals: To enhance communication with patients speaking different languages or facilitate transfer of medical information [1].

Researchers: For training machine learning models to automate medical translation or conducting linguistic analyses [1].

Language Experts: As a reliable source of accurate medical translations [1].

Healthcare Providers: To improve patient care and understanding [1].

Individuals: Seeking accurate and reliable translations of medical content [1].

Dataset Name Suggestions

Global Medical Translations

Accurate Healthcare Language Data

Clinical Translation Corpus

Multilingual Medical Text Dataset

Healthcare Communication Translations

Attributes

Original Data Source: Accurate Medical Translation Data
f
Summary of the classification performance with confidence intervals (CIs)...
plos.figshare.com
xls
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens (2025). Summary of the classification performance with confidence intervals (CIs) computed at 95% using bootstrapping (n=1000). “AUC” refer to the receiver-operating curve. “Loc Bag” and “Loc GBP” respectively refer to the localization precision of the sparse BagNet and Guided Backpropagation on ResNet-50 at localizing lesions from annotated images. For each dataset, the first row shows the performance of the interpretable sparse BagNet model, while the second row shows the performance of the baseline black-box ResNet-50 model. The Kaggle dataset (first row) is the internal dataset used to train and evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. The low classification performance on the FCM-UNA and FGA-DR datasets can be explained by the relatively low quality of most images in the FCM-UNA dataset and the large intensity variation of the FGA-DR dataset (S [Dataset]. http://doi.org/10.1371/journal.pdig.0000831.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pdig.0000831.t002
Dataset updated
May 14, 2025
Dataset provided by
PLOS Digital Health
Authors
Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of the classification performance with confidence intervals (CIs) computed at 95% using bootstrapping (n=1000). “AUC” refer to the receiver-operating curve. “Loc Bag” and “Loc GBP” respectively refer to the localization precision of the sparse BagNet and Guided Backpropagation on ResNet-50 at localizing lesions from annotated images. For each dataset, the first row shows the performance of the interpretable sparse BagNet model, while the second row shows the performance of the baseline black-box ResNet-50 model. The Kaggle dataset (first row) is the internal dataset used to train and evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. The low classification performance on the FCM-UNA and FGA-DR datasets can be explained by the relatively low quality of most images in the FCM-UNA dataset and the large intensity variation of the FGA-DR dataset (S
P
MIMIC-III Dataset
paperswithcode.com
opendatalab.com
Updated Feb 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark (2023). MIMIC-III Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iii
Explore at:
Dataset updated
Feb 9, 2021
Authors
Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark
Description
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.

The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.
i
Cardiovascular Disease Dataset
ieee-dataport.org
Updated Oct 25, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajib Kumar Halder Halder (2022). Cardiovascular Disease Dataset [Dataset]. https://ieee-dataport.org/documents/cardiovascular-disease-dataset
Explore at:
Dataset updated
Oct 25, 2022
Authors
Rajib Kumar Halder Halder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)
AI medical chatbot
kaggle.com
Updated Aug 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yousef Saeedian (2024). AI medical chatbot [Dataset]. https://www.kaggle.com/datasets/yousefsaeedian/ai-medical-chatbot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 15, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Yousef Saeedian
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Description:

This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.

Key Features:

Doctor and Patient Roles: Each conversation is annotated with the role of the speaker (doctor or patient), making it easy to analyze communication patterns.

Medical Context: The dataset includes diverse scenarios, from routine check-ups to more complex medical discussions, offering a broad spectrum of healthcare dialogues.

Natural Language: The conversations are presented in natural language, allowing for the development and testing of NLP models focused on healthcare communication.

Applications: This dataset can be used for various applications, such as building dialogue systems, analyzing communication efficacy, developing medical NLP models, and enhancing patient care through better understanding of doctor-patient interactions.

Potential Use Cases:

NLP Model Training: Train models to understand and generate medical dialogues.

Healthcare Communication Studies: Analyze communication strategies between doctors and patients to improve healthcare delivery.

Medical Chatbots: Develop intelligent medical chatbots that can simulate doctor-patient conversations.

Patient Experience Enhancement: Identify common patient concerns and doctor responses to enhance patient care strategies.

This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.
P
MIMIC-IV Dataset
paperswithcode.com
physionet.org
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MIMIC-IV Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iv
Explore at:
Description
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy.

The Medical Information Mart for Intensive Care (MIMIC)-III database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC). Importantly, MIMIC-III was deidentified, and patient identifiers were removed according to the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-III has been integral in driving large amounts of research in clinical informatics, epidemiology, and machine learning. Here we present MIMIC-IV, an update to MIMIC-III, which incorporates contemporary data and improves on numerous aspects of MIMIC-III. MIMIC-IV adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
P
Data from: ADNI Dataset
paperswithcode.com
opendatalab.com
Updated Jun 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). ADNI Dataset [Dataset]. https://paperswithcode.com/dataset/adni
Explore at:
Dataset updated
Jun 29, 2021
Description
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment of Alzheimer’s disease (AD).[1] This cooperative study combines expertise and funding from the private and public sector to study subjects with AD, as well as those who may develop AD and controls with no signs of cognitive impairment.[2] Researchers at 63 sites in the US and Canada track the progression of AD in the human brain with neuroimaging, biochemical, and genetic biological markers.[2][3] This knowledge helps to find better clinical trials for the prevention and treatment of AD. ADNI has made a global impact,[4] firstly by developing a set of standardized protocols to allow the comparison of results from multiple centers,[4] and secondly by its data-sharing policy which makes available all at the data without embargo to qualified researchers worldwide.[5] To date, over 1000 scientific publications have used ADNI data.[6] A number of other initiatives related to AD and other diseases have been designed and implemented using ADNI as a model.[4] ADNI has been running since 2004 and is currently funded until 2021.[7]

Source: Wikipedia, https://en.wikipedia.org/wiki/Alzheimer%27s_Disease_Neuroimaging_Initiative
P
WUSTL_EHMS_2020 Dataset
paperswithcode.com
Updated Apr 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). WUSTL_EHMS_2020 Dataset [Dataset]. https://paperswithcode.com/dataset/wustl-ehms-2020
Explore at:
Dataset updated
Apr 15, 2020
Description
The WUSTL-EHMS-2020 dataset was created using a real-time Enhanced Healthcare Monitoring System (EHMS) testbed [1]. This testbed collects both the network flow metrics and patients' biometrics due to the scarcity of a dataset that combines these biometrics.
o
Coronavirus Question-Answer Model Data
opendatabay.com
.undefined
Updated Jul 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Coronavirus Question-Answer Model Data [Dataset]. https://www.opendatabay.com/data/ai-ml/971a1982-4b03-4c7b-8572-818df3e58d5a
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
Area covered
Health Information Systems & Technology
Description
This dataset is designed to facilitate the creation of question-answer models, specifically tailored for the CORD-19 dataset. It provides a focused collection of high-quality articles, emphasising detected study designs. This makes it an invaluable resource for natural language processing (NLP) research and applications related to the coronavirus, aiding in the development of intelligent health information systems.

Columns

The dataset primarily includes files structured for question, context, and answer combinations. For instance, the CSV file contains the following key columns: * category: Represents the category of the question. * question: The actual text of the question. * context: The contextual passage from which the answer can be extracted. * answer: The specific answer text relevant to the question and context.

Distribution

The dataset is provided in multiple formats, including a line-by-line export of CORD-19 data in cord19.txt, a CSV file (cord19-qa.csv) containing question, context, answer combinations, and a JSON file (cord19-qa.json) formatted for SQuAD 2.0. Specific numbers for rows or records are not detailed in the available information. The current version of this dataset is 1.0 and it is listed as globally available.

Usage

This dataset is ideally suited for building and fine-tuning transformer models for language modelling and SQuAD 2.0 tasks. It serves as a foundational resource for developing advanced question-answering systems in the medical and healthcare domain, particularly those focused on coronavirus-related information. It is highly valuable for researchers and developers working on AI and large language model (LLM) applications.

Coverage

The dataset's focus is global in its applicability, concentrating on high-quality articles with detected study designs related to the CORD-19 research. While a specific time range for the included articles is not provided, the dataset itself was listed on 17th June 2025, indicating its recent availability on the platform. The primary scope is medical and healthcare information, specifically concerning the coronavirus.

License

CC BY-SA

Who Can Use It

This dataset is intended for a broad range of users, including: * AI/ML Engineers and Data Scientists: For training and evaluating question-answering models and other NLP tasks. * Healthcare Researchers: To develop tools for quickly extracting information from a vast corpus of medical literature. * Academic Institutions: For research and educational purposes in the fields of AI, NLP, and medical informatics. * Start-ups and Enterprises: Developing innovative health information systems or AI-powered medical assistants.

Dataset Name Suggestions

CORD-19 QA Dataset

Coronavirus Question-Answer Model Data

Medical NLP QA Resource

AI Health Q&A Dataset

Attributes

Original Data Source: CORD-19 QA

Facebook

Twitter

Click to copy link

Link copied

Cite

Gabriel Moraga (2024). Clinical Trials [Dataset]. https://www.kaggle.com/datasets/gabrielmoraga/clinical-trials/suggestions

Clinical Trials

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 30, 2024

Dataset provided by

Kaggle

Authors

Gabriel Moraga

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset

This dataset was created by Gabriel Moraga

Released under Attribution 4.0 International (CC BY 4.0)

Clear search

Close search

Google apps

Main menu

Clinical Trials

Dataset

Contents

TREC 2022 Clinical Trials Dataset

Clinical Trials

Clinical Trials

Clinical trials over the years along with start / end dates, outcome and more.

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Clinical_trials

Dataset

Contents

Health Care Analytics

Context

Content

Acknowledgements

Inspiration

UCI ML Drug Review dataset

Welcome to the Kaggle University Club Hackathon!

Prompt

Top Submissions

Submission Styling

Medical_cost_dataset

Description:

Summary of the internal and external validation datasets used to evaluate...

Data from: RxNorm

‘Accidental Drug Related Deaths in Connecticut’ analyzed by Analyst-2

About this dataset

COMMERCIAL LICENSE

How to use this dataset

Acknowledgements

Start A New Notebook!

Novel COVID-19 Chestxray Repository Dataset

Multilingual Medical Text Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Summary of the classification performance with confidence intervals (CIs)...

MIMIC-III Dataset

Cardiovascular Disease Dataset

AI medical chatbot

MIMIC-IV Dataset

Data from: ADNI Dataset

WUSTL_EHMS_2020 Dataset

Coronavirus Question-Answer Model Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Clinical Trials

Dataset

Contents