Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by Gabriel Moraga
Released under Attribution 4.0 International (CC BY 4.0)
The goal of the Clinical Trials track is to focus research on the clinical trials matching problem: given a free text summary of a patient health record, find suitable clinical trials for that patient.
By Aero Data Lab [source]
This dataset contains information on clinical trials conducted by sponsors. Each row represents a clinical trial, and the columns represent various attributes of the trial, such as the National Clinical Trial Number, the sponsor of the trial, the title of the trial, and so on.
The purpose of this dataset is to provide a bird's-eye view of the clinical trial landscape. By understanding which sponsors are conducting which trials and for what conditions, we can get a better sense of where research is headed and what new treatments may be on the horizon
- NCT is a unique identifier for clinical trials. It stands for National Clinical Trial Number.
- Sponsor is the organization that is funding the clinical trial.
- Title is the name of the clinical trial.
- Summary is a brief summary of the clinical trial.
- Start Year is the year that the clinical trial started.
- Start Month is the month that the clinical trial started.
- Phase is the stage of development of the investigative drug or device (I), which can be one of four types: I, II, III, or IV.
- Enrollment is The number of participants in the clinical trial.
- Status is The status of enrollment in the study, which can be Recruiting, Not yet recruiting, Active, not recruiting, Completed, Suspended, or Terminated.
Condition indicates what medical condition(s) are being studied in this particular NCT record
- Identify patterns in clinical trials to improve the development process
- Understand how different sponsors fund clinical trials
By Aero Data Lab [source]
License
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.
File: AERO-BirdsEye-Data.csv | Column name | Description | |:----------------|:-----------------------------------------------------------------| | NCT | National Clinical Trial number. (String) | | Sponsor | Name of the sponsor conducting the clinical trial. (String) | | Title | Title of the clinical trial. (String) | | Summary | Brief summary of the clinical trial. (String) | | Start_Year | Year the clinical trial started. (Integer) | | Start_Month | Month the clinical trial started. (String) | | Phase | Phase of the clinical trial. (String) | | Enrollment | Number of participants enrolled in the clinical trial. (Integer) | | Status | Status of the clinical trial. (String) | | Condition | Condition being tested in the clinical trial. (String) |
If you use this dataset in your research, please credit By Aero Data Lab [source]
This dataset was created by DEBRAH DOMINIC ATUAHENE
Part of Janatahack Hackathon in Analytics Vidhya
The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.
MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).
MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.
One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.
The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
Other things to note:
Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
For a few camps, there was hardware failure, so some information about date and time of registration is lost.
MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides
information about several health issues through various awareness stalls.
Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.
Train / Test split:
Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.
Credits to AV
To share with the data science community to jump start their journey in Healthcare Analytics
If you are interested in joining Kaggle University Club, please e-mail Jessica Li at lijessica@google.com
This Hackathon is open to all undergraduate, master, and PhD students who are part of the Kaggle University Club program. The Hackathon provides students with a chance to build capacity via hands-on ML, learn from one another, and engage in a self-defined project that is meaningful to their careers.
Teams must register via Google Form to be eligible for the Hackathon. The Hackathon starts on Monday, November 12, 2018 and ends on Monday, December 10, 2018. Teams have one month to work on a team submission. Teams must do all work within the Kernel editor and set Kernel(s) to public at all times.
The freestyle format of hackathons has time and again stimulated groundbreaking and innovative data insights and technologies. The Kaggle University Club Hackathon recreates this environment virtually on our platform. We challenge you to build a meaningful project around the UCI Machine Learning - Drug Review Dataset. Teams are free to let their creativity run and propose methods to analyze this dataset and form interesting machine learning models.
Machine learning has permeated nearly all fields and disciplines of study. One hot topic is using natural language processing and sentiment analysis to identify, extract, and make use of subjective information. The UCI ML Drug Review dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects (see the acknowledgments section to learn more).
The sky's the limit here in terms of what your team can do! Teams are free to add supplementary datasets in conjunction with the drug review dataset in their Kernel. Discussion is highly encouraged within the forum and Slack so everyone can learn from their peers.
Here are just a couple ideas as to what you could do with the data:
There is no one correct answer to this Hackathon, and teams are free to define the direction of their own project. That being said, there are certain core elements generally found across all outstanding Kernels on the Kaggle platform. The best Kernels are:
Teams with top submissions have a chance to receive exclusive Kaggle University Club swag and be featured on our official blog and across social media.
IMPORTANT: Teams must set all Kernels to public at all times. This is so we can track each team's progression, but more importantly it encourages collaboration, productive discussion, and healthy inspiration to all teams. It is not so that teams can simply copycat good ideas. If a team's Kernel isn't their own organic work, it will not be considered a top submission. Teams must come up with a project on their own.
The final Kernel submission for the Hackathon must contain the following information:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Explore the intricacies of medical costs and healthcare expenses with our meticulously curated Medical Cost Dataset. This dataset offers valuable insights into the factors influencing medical charges, enabling researchers, analysts, and healthcare professionals to gain a deeper understanding of the dynamics within the healthcare industry.
Columns: 1. ID: A unique identifier assigned to each individual record, facilitating efficient data management and analysis. 2. Age: The age of the patient, providing a crucial demographic factor that often correlates with medical expenses. 3. Sex: The gender of the patient, offering insights into potential cost variations based on biological differences. 4. BMI: The Body Mass Index (BMI) of the patient, indicating the relative weight status and its potential impact on healthcare costs. 5. Children: The number of children or dependents covered under the medical insurance, influencing family-related medical expenses. 6. Smoker: A binary indicator of whether the patient is a smoker or not, as smoking habits can significantly impact healthcare costs. 7. Region: The geographic region of the patient, helping to understand regional disparities in healthcare expenditure. 8. Charges: The medical charges incurred by the patient, serving as the target variable for analysis and predictions.
Whether you're aiming to uncover patterns in medical billing, predict future healthcare costs, or explore the relationships between different variables and charges, our Medical Cost Dataset provides a robust foundation for your research. Researchers can utilize this dataset to develop data-driven models that enhance the efficiency of healthcare resource allocation, insurers can refine pricing strategies, and policymakers can make informed decisions to improve the overall healthcare system.
Unlock the potential of healthcare data with our comprehensive Medical Cost Dataset. Gain insights, make informed decisions, and contribute to the advancement of healthcare economics and policy. Start your analysis today and pave the way for a healthier future.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of the internal and external validation datasets used to evaluate the models. “Origin” refers to the country where the data was collected. “Lesion” refers to the number of images in the dataset with lesion annotations. The Kaggle dataset (first row, shaded in gray) is the internal dataset used to evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model.
https://bioregistry.io/spdx:https://uts.nlm.nih.gov/uts/assets/LicenseAgreement.pdfhttps://bioregistry.io/spdx:https://uts.nlm.nih.gov/uts/assets/LicenseAgreement.pdf
RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software, including those of First Databank, Micromedex, and Gold Standard Drug Database. By providing links between these vocabularies, RxNorm can mediate messages between systems not using the same software and vocabulary.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Accidental Drug Related Deaths in Connecticut’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/accidental-drug-related-deaths-in-connecticute on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This dataset contains the list of each accidental death associated with a drug overdose in the state of Connecticut from 2012 to 2018. Deaths are grouped by age, race, ethnicity, and gender and by the types of drugs detected post-death.
COMMERCIAL LICENSE
For subscribing to a commercial license for John Snow Labs Data Library which includes all datasets curated and maintained by John Snow Labs please visit https://www.johnsnowlabs.com/marketplace.
This dataset was created by John and contains around 0 samples along with County, City, technical information and other features such as: - Is Hydrocodone - Age - and more.
- Analyze Race in relation to Is Methadone
- Study the influence of Is Heroin on Is Oxymorphone
- More datasets
If you use this dataset in your research, please credit John
--- Original source retains full ownership of the source dataset ---
Authors of the Dataset:
Pratik Bhowal (B.E., Dept of Electronics and Instrumentation Engineering, Jadavpur University Kolkata, India) [LinkedIn], [Github] Subhankar Sen (B.Tech, Dept of Computer Science Engineering, Manipal University Jaipur, India) [LinkedIn], [Github], [Google Scholar] Jin Hee Yoon (faculty of the Dept. of Mathematics and Statistics at Sejong University, Seoul, South Korea) [LinkedIn], [Google Scholar] Zong Woo Geem (faculty of College of IT Convergence at Gachon University, South Korea) [LinkedIn], [Google Scholar] Ram Sarkar( Professor at Dept. of Computer Science Engineering, Jadavpur Univeristy Kolkata, India) [LinkedIn], [Google Scholar]
Overview The authors have created a new dataset known as Novel COVID-19 Chestxray Repository by the fusion of publicly available chest-xray image repositories. In creating this combined dataset, three different datasets obtained from the Github and Kaggle databases,created by the authors of other research studies in this field, were utilized.In our study,frontal and lateral chest X-ray images are used since this view of radiography is widely used by radiologist in clinical diagnosis.In the following section, authors have summarized how this dataset is created.
COVID-19 Radiography Database: The first release of this dataset reports 219 COVID-19,1345 viral pneumonia and 1341 normal radiographic chest X-ray images. This dataset was created by a team of researchers from Qatar University, Doha, Qatar, and the University of Dhaka, Bangladesh in collaboration with medical doctors and specialists from Pakistan and Malaysia.This database is regularly updated with the emergence of new cases of COVID-19 patients worldwide.Related Paper:https://arxiv.org/abs/2003.13145
COVID-Chestxray set:Joseph Paul Cohen and Paul Morrison and Lan Dao have created a public image repository on Github which consists both CT scans and digital chest x-rays.The data was collected mainly from retrospective cohorts of pediatric patients from Guangzhou Women and Children’s medical center.With the aid of metadata information provided along with the dataset,we were able to extract 521 COVID-19 positive,239 viral and bacterial pneumonias;which are of the following three broad categories:Middle East Respiratory Syndrome (MERS),Severe Acute Respiratory Syndrome (SARS), and Acute Respiratory Distress syndrome (ARDS);and 218 normal radiographic chest X-ray images of varying image resolutions. Related Paper: https://arxiv.org/abs/2006.11988
Actualmed COVID chestxray dataset:Actualmed-COVID-chestxray-dataset comprises of 12 COVID-19 positive and 80 normal radiographic chest x-ray images.
The combined dataset includes chest X-ray images of COVID-19,Pneumonia and Normal (healthy) classes, with a total of 752, 1584, and 1639 images respectively. Information about the Novel COVID-19 Chestxray Database and its parent image repositories is provided in Table 1.
Table 1: Dataset Description | Dataset| COVID-19 |Pneumonia | Normal | | ------------- | ------------- | ------------- | -------------| | COVID Chestxray set | 521 |239|218| | COVID-19 Radiography Database(first release) | 219 |1345|1341| | Actualmed COVID chestxray dataset| 12 |0|80| | Total|752|1584|1639|
DATA ACCESS AND USE: Academic/Non-Commercial Use Dataset License : Database: Open Database, Contents: Database Contents
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a curated collection of accurate and reliable medical translation data. It is an invaluable resource designed for medical professionals, researchers, and language experts. The data encompasses a wide array of medical topics, including diagnoses, treatment plans, clinical research findings, and pharmaceutical information [1]. It supports various languages spoken across the globe, facilitating cross-cultural comparisons and analysis. Each translation has been meticulously crafted by professional translators with specialist knowledge in the medical domain to ensure authenticity and fidelity to the original source text [1]. This dataset aims to improve understanding and communication within the healthcare sector globally, enhancing accessibility to vital medical information regardless of language barriers and ensuring precision in patient care [1].
The data is provided in a CSV file format (specifically, train.csv
) [1]. The dataset contains 13,149 records [2].
This dataset offers various ideal applications and use cases: * Natural Language Processing (NLP) Research: Suitable for training and evaluating NLP models specifically for medical translation tasks, aiding in the development of new algorithms and techniques to enhance accuracy and efficiency [1]. * Machine Learning in Healthcare: Can be used to train machine learning algorithms for automatic translation of medical documents or text, thereby speeding up processes and providing healthcare professionals with timely access to essential information [1]. * Development of Medical Translation Applications: Its accurate translations are beneficial for creating mobile or web-based applications that offer instant translation services for healthcare providers, patients, or anyone seeking reliable medical content translations [1]. * Enhanced Global Communication: Supports improved communication with patients who speak different languages and facilitates the accurate transfer of vital medical information across borders [1].
The dataset covers various languages spoken worldwide, enabling cross-cultural analysis and supporting global healthcare communications among diverse populations [1]. The region of coverage is Global [3].
CC0
Original Data Source: Accurate Medical Translation Data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summary of the classification performance with confidence intervals (CIs) computed at 95% using bootstrapping (n=1000). “AUC” refer to the receiver-operating curve. “Loc Bag” and “Loc GBP” respectively refer to the localization precision of the sparse BagNet and Guided Backpropagation on ResNet-50 at localizing lesions from annotated images. For each dataset, the first row shows the performance of the interpretable sparse BagNet model, while the second row shows the performance of the baseline black-box ResNet-50 model. The Kaggle dataset (first row) is the internal dataset used to train and evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. The low classification performance on the FCM-UNA and FGA-DR datasets can be explained by the relatively low quality of most images in the FCM-UNA dataset and the large intensity variation of the FGA-DR dataset (S
The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.
The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description:
This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.
Key Features:
Potential Use Cases:
This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy.
The Medical Information Mart for Intensive Care (MIMIC)-III database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC). Importantly, MIMIC-III was deidentified, and patient identifiers were removed according to the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-III has been integral in driving large amounts of research in clinical informatics, epidemiology, and machine learning. Here we present MIMIC-IV, an update to MIMIC-III, which incorporates contemporary data and improves on numerous aspects of MIMIC-III. MIMIC-IV adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.
Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment of Alzheimer’s disease (AD).[1] This cooperative study combines expertise and funding from the private and public sector to study subjects with AD, as well as those who may develop AD and controls with no signs of cognitive impairment.[2] Researchers at 63 sites in the US and Canada track the progression of AD in the human brain with neuroimaging, biochemical, and genetic biological markers.[2][3] This knowledge helps to find better clinical trials for the prevention and treatment of AD. ADNI has made a global impact,[4] firstly by developing a set of standardized protocols to allow the comparison of results from multiple centers,[4] and secondly by its data-sharing policy which makes available all at the data without embargo to qualified researchers worldwide.[5] To date, over 1000 scientific publications have used ADNI data.[6] A number of other initiatives related to AD and other diseases have been designed and implemented using ADNI as a model.[4] ADNI has been running since 2004 and is currently funded until 2021.[7]
Source: Wikipedia, https://en.wikipedia.org/wiki/Alzheimer%27s_Disease_Neuroimaging_Initiative
The WUSTL-EHMS-2020 dataset was created using a real-time Enhanced Healthcare Monitoring System (EHMS) testbed [1]. This testbed collects both the network flow metrics and patients' biometrics due to the scarcity of a dataset that combines these biometrics.
This dataset is designed to facilitate the creation of question-answer models, specifically tailored for the CORD-19 dataset. It provides a focused collection of high-quality articles, emphasising detected study designs. This makes it an invaluable resource for natural language processing (NLP) research and applications related to the coronavirus, aiding in the development of intelligent health information systems.
The dataset primarily includes files structured for question, context, and answer combinations. For instance, the CSV file contains the following key columns: * category: Represents the category of the question. * question: The actual text of the question. * context: The contextual passage from which the answer can be extracted. * answer: The specific answer text relevant to the question and context.
The dataset is provided in multiple formats, including a line-by-line export of CORD-19 data in cord19.txt
, a CSV file (cord19-qa.csv
) containing question, context, answer combinations, and a JSON file (cord19-qa.json
) formatted for SQuAD 2.0. Specific numbers for rows or records are not detailed in the available information. The current version of this dataset is 1.0 and it is listed as globally available.
This dataset is ideally suited for building and fine-tuning transformer models for language modelling and SQuAD 2.0 tasks. It serves as a foundational resource for developing advanced question-answering systems in the medical and healthcare domain, particularly those focused on coronavirus-related information. It is highly valuable for researchers and developers working on AI and large language model (LLM) applications.
The dataset's focus is global in its applicability, concentrating on high-quality articles with detected study designs related to the CORD-19 research. While a specific time range for the included articles is not provided, the dataset itself was listed on 17th June 2025, indicating its recent availability on the platform. The primary scope is medical and healthcare information, specifically concerning the coronavirus.
CC BY-SA
This dataset is intended for a broad range of users, including: * AI/ML Engineers and Data Scientists: For training and evaluating question-answering models and other NLP tasks. * Healthcare Researchers: To develop tools for quickly extracting information from a vast corpus of medical literature. * Academic Institutions: For research and educational purposes in the fields of AI, NLP, and medical informatics. * Start-ups and Enterprises: Developing innovative health information systems or AI-powered medical assistants.
Original Data Source: CORD-19 QA
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was created by Gabriel Moraga
Released under Attribution 4.0 International (CC BY 4.0)