72 datasets found
  1. Clinical Trials

    • kaggle.com
    Updated Nov 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Moraga (2024). Clinical Trials [Dataset]. https://www.kaggle.com/datasets/gabrielmoraga/clinical-trials/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 30, 2024
    Dataset provided by
    Kaggle
    Authors
    Gabriel Moraga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Gabriel Moraga

    Released under Attribution 4.0 International (CC BY 4.0)

    Contents

  2. TREC 2022 Clinical Trials Dataset

    • catalog.data.gov
    • s.cnmilf.com
    Updated Sep 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2024). TREC 2022 Clinical Trials Dataset [Dataset]. https://catalog.data.gov/dataset/trec-2022-clinical-trials-dataset
    Explore at:
    Dataset updated
    Sep 11, 2024
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    The goal of the Clinical Trials track is to focus research on the clinical trials matching problem: given a free text summary of a patient health record, find suitable clinical trials for that patient.

  3. Clinical Trials

    • kaggle.com
    Updated Nov 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Clinical Trials [Dataset]. https://www.kaggle.com/datasets/thedevastator/a-quick-overview-of-clinical-trials/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 25, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Description

    Clinical Trials

    Clinical trials over the years along with start / end dates, outcome and more.

    By Aero Data Lab [source]

    About this dataset

    This dataset contains information on clinical trials conducted by sponsors. Each row represents a clinical trial, and the columns represent various attributes of the trial, such as the National Clinical Trial Number, the sponsor of the trial, the title of the trial, and so on.

    The purpose of this dataset is to provide a bird's-eye view of the clinical trial landscape. By understanding which sponsors are conducting which trials and for what conditions, we can get a better sense of where research is headed and what new treatments may be on the horizon

    How to use the dataset

    • NCT is a unique identifier for clinical trials. It stands for National Clinical Trial Number.
    • Sponsor is the organization that is funding the clinical trial.
    • Title is the name of the clinical trial.
    • Summary is a brief summary of the clinical trial.
    • Start Year is the year that the clinical trial started.
    • Start Month is the month that the clinical trial started.
    • Phase is the stage of development of the investigative drug or device (I), which can be one of four types: I, II, III, or IV.
    • Enrollment is The number of participants in the clinical trial.
    • Status is The status of enrollment in the study, which can be Recruiting, Not yet recruiting, Active, not recruiting, Completed, Suspended, or Terminated.

    Condition indicates what medical condition(s) are being studied in this particular NCT record

    Research Ideas

    • Identify patterns in clinical trials to improve the development process
    • Understand how different sponsors fund clinical trials

    Acknowledgements

    By Aero Data Lab [source]

    License

    License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit - Provide a link to the license, and indicate if changes were made. - ShareAlike - You must distribute your contributions under the same license as the original. - Keep intact - all notices that refer to this license, including copyright notices.

    Columns

    File: AERO-BirdsEye-Data.csv | Column name | Description | |:----------------|:-----------------------------------------------------------------| | NCT | National Clinical Trial number. (String) | | Sponsor | Name of the sponsor conducting the clinical trial. (String) | | Title | Title of the clinical trial. (String) | | Summary | Brief summary of the clinical trial. (String) | | Start_Year | Year the clinical trial started. (Integer) | | Start_Month | Month the clinical trial started. (String) | | Phase | Phase of the clinical trial. (String) | | Enrollment | Number of participants enrolled in the clinical trial. (Integer) | | Status | Status of the clinical trial. (String) | | Condition | Condition being tested in the clinical trial. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit By Aero Data Lab [source]

  4. Clinical_trials

    • kaggle.com
    Updated Oct 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DEBRAH DOMINIC ATUAHENE (2021). Clinical_trials [Dataset]. https://www.kaggle.com/dominique2001gh/clinical-trials/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 23, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    DEBRAH DOMINIC ATUAHENE
    Description

    Dataset

    This dataset was created by DEBRAH DOMINIC ATUAHENE

    Contents

  5. Health Care Analytics

    • kaggle.com
    Updated Jan 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abishek Sudarshan (2022). Health Care Analytics [Dataset]. https://www.kaggle.com/datasets/abisheksudarshan/health-care-analytics
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 10, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Abishek Sudarshan
    Description

    Context

    Part of Janatahack Hackathon in Analytics Vidhya

    Content

    The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, health camps and records, and the treatment of chronic diseases.

    MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp).

    MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

    One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

    The Process:

    MedCamp employees / volunteers reach out to people and drive registrations.
    During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
    

    Other things to note:

    Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
    For a few camps, there was hardware failure, so some information about date and time of registration is lost.
    MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides  
    information about several health issues through various awareness stalls.
    

    Favorable outcome:

    For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
    You need to predict the chances (probability) of having a favourable outcome.
    

    Train / Test split:

    Camps started on or before 31st March 2006 are considered in Train
    Test data is for all camps conducted on or after 1st April 2006.
    

    Acknowledgements

    Credits to AV

    Inspiration

    To share with the data science community to jump start their journey in Healthcare Analytics

  6. UCI ML Drug Review dataset

    • kaggle.com
    Updated Dec 13, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jessica Li (2018). UCI ML Drug Review dataset [Dataset]. https://www.kaggle.com/jessicali9530/kuc-hackathon-winter-2018/home
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jessica Li
    Description

    This dataset was used for the Winter 2018 Kaggle University Club Hackathon and is now publicly available. See Acknowledgments section for citation and licensing. Note: The types of data and recommendation based solutions provided by the contestants are purely for NLP learning purposes. They are not suitable for a real world drug recommendations solutions.

    Welcome to the Kaggle University Club Hackathon!

    If you are interested in joining Kaggle University Club, please e-mail Jessica Li at lijessica@google.com

    This Hackathon is open to all undergraduate, master, and PhD students who are part of the Kaggle University Club program. The Hackathon provides students with a chance to build capacity via hands-on ML, learn from one another, and engage in a self-defined project that is meaningful to their careers.

    Teams must register via Google Form to be eligible for the Hackathon. The Hackathon starts on Monday, November 12, 2018 and ends on Monday, December 10, 2018. Teams have one month to work on a team submission. Teams must do all work within the Kernel editor and set Kernel(s) to public at all times.

    Prompt

    The freestyle format of hackathons has time and again stimulated groundbreaking and innovative data insights and technologies. The Kaggle University Club Hackathon recreates this environment virtually on our platform. We challenge you to build a meaningful project around the UCI Machine Learning - Drug Review Dataset. Teams are free to let their creativity run and propose methods to analyze this dataset and form interesting machine learning models.

    Machine learning has permeated nearly all fields and disciplines of study. One hot topic is using natural language processing and sentiment analysis to identify, extract, and make use of subjective information. The UCI ML Drug Review dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects (see the acknowledgments section to learn more).

    The sky's the limit here in terms of what your team can do! Teams are free to add supplementary datasets in conjunction with the drug review dataset in their Kernel. Discussion is highly encouraged within the forum and Slack so everyone can learn from their peers.

    Here are just a couple ideas as to what you could do with the data:

    • Classification: Can you predict the patient's condition based on the review?
    • Regression: Can you predict the rating of the drug based on the review?
    • Sentiment analysis: What elements of a review make it more helpful to others? Which patients tend to have more negative reviews? Can you determine if a review is positive, neutral, or negative?
    • Data visualizations: What kind of drugs are there? What sorts of conditions do these patients have?

    Top Submissions

    There is no one correct answer to this Hackathon, and teams are free to define the direction of their own project. That being said, there are certain core elements generally found across all outstanding Kernels on the Kaggle platform. The best Kernels are:

    1. Complex: How many domains of analysis and topics does this Kernel cover? Does it attempt machine learning methods? Does the Kernel offer a variety of unique analyses and interesting conclusions or solutions?
    2. Original: What is the subject matter of this Kernel? Does it have a well-defined and interesting project scope, narrative or problem? Could the results make an impact? Is it thought provoking?
    3. Approachable: How easy is it to understand this Kernel? Are all thought processes clear? Is the code clean, with useful comments? Are visualizations and processes articulated and self-explanatory?

    Teams with top submissions have a chance to receive exclusive Kaggle University Club swag and be featured on our official blog and across social media.

    IMPORTANT: Teams must set all Kernels to public at all times. This is so we can track each team's progression, but more importantly it encourages collaboration, productive discussion, and healthy inspiration to all teams. It is not so that teams can simply copycat good ideas. If a team's Kernel isn't their own organic work, it will not be considered a top submission. Teams must come up with a project on their own.

    Submission Styling

    The final Kernel submission for the Hackathon must contain the following information:

    • All team members added as collaborators to the Kernel
    • Somewhere at the top of your Kernel, find a space to put down all team member names, university name, club name, and team name (as specified whe...
  7. Medical_cost_dataset

    • kaggle.com
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandita Pore (2023). Medical_cost_dataset [Dataset]. https://www.kaggle.com/datasets/nanditapore/medical-cost-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nandita Pore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Description:

    Explore the intricacies of medical costs and healthcare expenses with our meticulously curated Medical Cost Dataset. This dataset offers valuable insights into the factors influencing medical charges, enabling researchers, analysts, and healthcare professionals to gain a deeper understanding of the dynamics within the healthcare industry.

    Columns: 1. ID: A unique identifier assigned to each individual record, facilitating efficient data management and analysis. 2. Age: The age of the patient, providing a crucial demographic factor that often correlates with medical expenses. 3. Sex: The gender of the patient, offering insights into potential cost variations based on biological differences. 4. BMI: The Body Mass Index (BMI) of the patient, indicating the relative weight status and its potential impact on healthcare costs. 5. Children: The number of children or dependents covered under the medical insurance, influencing family-related medical expenses. 6. Smoker: A binary indicator of whether the patient is a smoker or not, as smoking habits can significantly impact healthcare costs. 7. Region: The geographic region of the patient, helping to understand regional disparities in healthcare expenditure. 8. Charges: The medical charges incurred by the patient, serving as the target variable for analysis and predictions.

    Whether you're aiming to uncover patterns in medical billing, predict future healthcare costs, or explore the relationships between different variables and charges, our Medical Cost Dataset provides a robust foundation for your research. Researchers can utilize this dataset to develop data-driven models that enhance the efficiency of healthcare resource allocation, insurers can refine pricing strategies, and policymakers can make informed decisions to improve the overall healthcare system.

    Unlock the potential of healthcare data with our comprehensive Medical Cost Dataset. Gain insights, make informed decisions, and contribute to the advancement of healthcare economics and policy. Start your analysis today and pave the way for a healthier future.

  8. f

    Summary of the internal and external validation datasets used to evaluate...

    • plos.figshare.com
    xls
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Summary of the internal and external validation datasets used to evaluate the models. “Origin” refers to the country where the data was collected. “Lesion” refers to the number of images in the dataset with lesion annotations. The Kaggle dataset (first row, shaded in gray) is the internal dataset used to evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. [Dataset]. https://plos.figshare.com/articles/dataset/Summary_of_the_internal_and_external_validation_datasets_used_to_evaluate_the_models_Origin_refers_to_the_country_where_the_data_was_collected_Lesion_refers_to_the_number_of_images_in_the_dataset_with_lesion_annotations_The_Kaggle_dataset_f/29048391
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of the internal and external validation datasets used to evaluate the models. “Origin” refers to the country where the data was collected. “Lesion” refers to the number of images in the dataset with lesion annotations. The Kaggle dataset (first row, shaded in gray) is the internal dataset used to evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model.

  9. b

    Data from: RxNorm

    • bioregistry.io
    • kaggle.com
    Updated May 10, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). RxNorm [Dataset]. http://identifiers.org/biolink:RXNORM
    Explore at:
    Dataset updated
    May 10, 2021
    License

    https://bioregistry.io/spdx:https://uts.nlm.nih.gov/uts/assets/LicenseAgreement.pdfhttps://bioregistry.io/spdx:https://uts.nlm.nih.gov/uts/assets/LicenseAgreement.pdf

    Description

    RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management and drug interaction software, including those of First Databank, Micromedex, and Gold Standard Drug Database. By providing links between these vocabularies, RxNorm can mediate messages between systems not using the same software and vocabulary.

  10. A

    ‘Accidental Drug Related Deaths in Connecticut’ analyzed by Analyst-2

    • analyst-2.ai
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com), ‘Accidental Drug Related Deaths in Connecticut’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-accidental-drug-related-deaths-in-connecticut-4048/1de1c015/?iid=004-085&v=presentation
    Explore at:
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Connecticut
    Description

    Analysis of ‘Accidental Drug Related Deaths in Connecticut’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/yamqwe/accidental-drug-related-deaths-in-connecticute on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    About this dataset

    This dataset contains the list of each accidental death associated with a drug overdose in the state of Connecticut from 2012 to 2018. Deaths are grouped by age, race, ethnicity, and gender and by the types of drugs detected post-death.

    COMMERCIAL LICENSE

    For subscribing to a commercial license for John Snow Labs Data Library which includes all datasets curated and maintained by John Snow Labs please visit https://www.johnsnowlabs.com/marketplace.

    This dataset was created by John and contains around 0 samples along with County, City, technical information and other features such as: - Is Hydrocodone - Age - and more.

    How to use this dataset

    • Analyze Race in relation to Is Methadone
    • Study the influence of Is Heroin on Is Oxymorphone
    • More datasets

    Acknowledgements

    If you use this dataset in your research, please credit John

    Start A New Notebook!

    --- Original source retains full ownership of the source dataset ---

  11. P

    Novel COVID-19 Chestxray Repository Dataset

    • paperswithcode.com
    Updated Sep 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pratik Bhowal; Subhankar Sen; Jin Hee Yoon Zong Woo Geem; Ram Sarkar (2021). Novel COVID-19 Chestxray Repository Dataset [Dataset]. https://paperswithcode.com/dataset/novel-covid-19-chestxray-repository
    Explore at:
    Dataset updated
    Sep 8, 2021
    Authors
    Pratik Bhowal; Subhankar Sen; Jin Hee Yoon Zong Woo Geem; Ram Sarkar
    Description

    Authors of the Dataset:

    Pratik Bhowal (B.E., Dept of Electronics and Instrumentation Engineering, Jadavpur University Kolkata, India) [LinkedIn], [Github] Subhankar Sen (B.Tech, Dept of Computer Science Engineering, Manipal University Jaipur, India) [LinkedIn], [Github], [Google Scholar] Jin Hee Yoon (faculty of the Dept. of Mathematics and Statistics at Sejong University, Seoul, South Korea) [LinkedIn], [Google Scholar] Zong Woo Geem (faculty of College of IT Convergence at Gachon University, South Korea) [LinkedIn], [Google Scholar] Ram Sarkar( Professor at Dept. of Computer Science Engineering, Jadavpur Univeristy Kolkata, India) [LinkedIn], [Google Scholar]

    Overview The authors have created a new dataset known as Novel COVID-19 Chestxray Repository by the fusion of publicly available chest-xray image repositories. In creating this combined dataset, three different datasets obtained from the Github and Kaggle databases,created by the authors of other research studies in this field, were utilized.In our study,frontal and lateral chest X-ray images are used since this view of radiography is widely used by radiologist in clinical diagnosis.In the following section, authors have summarized how this dataset is created.

    COVID-19 Radiography Database: The first release of this dataset reports 219 COVID-19,1345 viral pneumonia and 1341 normal radiographic chest X-ray images. This dataset was created by a team of researchers from Qatar University, Doha, Qatar, and the University of Dhaka, Bangladesh in collaboration with medical doctors and specialists from Pakistan and Malaysia.This database is regularly updated with the emergence of new cases of COVID-19 patients worldwide.Related Paper:https://arxiv.org/abs/2003.13145

    COVID-Chestxray set:Joseph Paul Cohen and Paul Morrison and Lan Dao have created a public image repository on Github which consists both CT scans and digital chest x-rays.The data was collected mainly from retrospective cohorts of pediatric patients from Guangzhou Women and Children’s medical center.With the aid of metadata information provided along with the dataset,we were able to extract 521 COVID-19 positive,239 viral and bacterial pneumonias;which are of the following three broad categories:Middle East Respiratory Syndrome (MERS),Severe Acute Respiratory Syndrome (SARS), and Acute Respiratory Distress syndrome (ARDS);and 218 normal radiographic chest X-ray images of varying image resolutions. Related Paper: https://arxiv.org/abs/2006.11988

    Actualmed COVID chestxray dataset:Actualmed-COVID-chestxray-dataset comprises of 12 COVID-19 positive and 80 normal radiographic chest x-ray images.

    The combined dataset includes chest X-ray images of COVID-19,Pneumonia and Normal (healthy) classes, with a total of 752, 1584, and 1639 images respectively. Information about the Novel COVID-19 Chestxray Database and its parent image repositories is provided in Table 1.

    Table 1: Dataset Description | Dataset| COVID-19 |Pneumonia | Normal | | ------------- | ------------- | ------------- | -------------| | COVID Chestxray set | 521 |239|218| | COVID-19 Radiography Database(first release) | 219 |1345|1341| | Actualmed COVID chestxray dataset| 12 |0|80| | Total|752|1584|1639|

    DATA ACCESS AND USE: Academic/Non-Commercial Use Dataset License : Database: Open Database, Contents: Database Contents

  12. o

    Multilingual Medical Text Dataset

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Multilingual Medical Text Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/64f1a101-d243-4290-a4fc-af738f8ba252
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Healthcare Providers & Services Utilization
    Description

    This dataset provides a curated collection of accurate and reliable medical translation data. It is an invaluable resource designed for medical professionals, researchers, and language experts. The data encompasses a wide array of medical topics, including diagnoses, treatment plans, clinical research findings, and pharmaceutical information [1]. It supports various languages spoken across the globe, facilitating cross-cultural comparisons and analysis. Each translation has been meticulously crafted by professional translators with specialist knowledge in the medical domain to ensure authenticity and fidelity to the original source text [1]. This dataset aims to improve understanding and communication within the healthcare sector globally, enhancing accessibility to vital medical information regardless of language barriers and ensuring precision in patient care [1].

    Columns

    • translation: Contains the original text in a specific language that requires translation [1].
    • translation: Contains the translated text in another language [1].
      • Note: The dataset contains 13,149 unique entries across these translation columns [2].

    Distribution

    The data is provided in a CSV file format (specifically, train.csv) [1]. The dataset contains 13,149 records [2].

    Usage

    This dataset offers various ideal applications and use cases: * Natural Language Processing (NLP) Research: Suitable for training and evaluating NLP models specifically for medical translation tasks, aiding in the development of new algorithms and techniques to enhance accuracy and efficiency [1]. * Machine Learning in Healthcare: Can be used to train machine learning algorithms for automatic translation of medical documents or text, thereby speeding up processes and providing healthcare professionals with timely access to essential information [1]. * Development of Medical Translation Applications: Its accurate translations are beneficial for creating mobile or web-based applications that offer instant translation services for healthcare providers, patients, or anyone seeking reliable medical content translations [1]. * Enhanced Global Communication: Supports improved communication with patients who speak different languages and facilitates the accurate transfer of vital medical information across borders [1].

    Coverage

    The dataset covers various languages spoken worldwide, enabling cross-cultural analysis and supporting global healthcare communications among diverse populations [1]. The region of coverage is Global [3].

    License

    CC0

    Who Can Use It

    • Medical Professionals: To enhance communication with patients speaking different languages or facilitate transfer of medical information [1].
    • Researchers: For training machine learning models to automate medical translation or conducting linguistic analyses [1].
    • Language Experts: As a reliable source of accurate medical translations [1].
    • Healthcare Providers: To improve patient care and understanding [1].
    • Individuals: Seeking accurate and reliable translations of medical content [1].

    Dataset Name Suggestions

    • Global Medical Translations
    • Accurate Healthcare Language Data
    • Clinical Translation Corpus
    • Multilingual Medical Text Dataset
    • Healthcare Communication Translations

    Attributes

    Original Data Source: Accurate Medical Translation Data

  13. f

    Summary of the classification performance with confidence intervals (CIs)...

    • plos.figshare.com
    xls
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens (2025). Summary of the classification performance with confidence intervals (CIs) computed at 95% using bootstrapping (n=1000). “AUC” refer to the receiver-operating curve. “Loc Bag” and “Loc GBP” respectively refer to the localization precision of the sparse BagNet and Guided Backpropagation on ResNet-50 at localizing lesions from annotated images. For each dataset, the first row shows the performance of the interpretable sparse BagNet model, while the second row shows the performance of the baseline black-box ResNet-50 model. The Kaggle dataset (first row) is the internal dataset used to train and evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. The low classification performance on the FCM-UNA and FGA-DR datasets can be explained by the relatively low quality of most images in the FCM-UNA dataset and the large intensity variation of the FGA-DR dataset (S [Dataset]. http://doi.org/10.1371/journal.pdig.0000831.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    PLOS Digital Health
    Authors
    Kerol Djoumessi; Ziwei Huang; Laura Kühlewein; Annekatrin Rickmann; Natalia Simon; Lisa M. Koch; Philipp Berens
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of the classification performance with confidence intervals (CIs) computed at 95% using bootstrapping (n=1000). “AUC” refer to the receiver-operating curve. “Loc Bag” and “Loc GBP” respectively refer to the localization precision of the sparse BagNet and Guided Backpropagation on ResNet-50 at localizing lesions from annotated images. For each dataset, the first row shows the performance of the interpretable sparse BagNet model, while the second row shows the performance of the baseline black-box ResNet-50 model. The Kaggle dataset (first row) is the internal dataset used to train and evaluate the model, while the other datasets were used for external validation to assess the generalization properties of the trained model. The low classification performance on the FCM-UNA and FGA-DR datasets can be explained by the relatively low quality of most images in the FCM-UNA dataset and the large intensity variation of the FGA-DR dataset (S

  14. P

    MIMIC-III Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Feb 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark (2023). MIMIC-III Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iii
    Explore at:
    Dataset updated
    Feb 9, 2021
    Authors
    Alistair E.W. Johnson; Tom J. Pollard; Lu Shen; Li-wei H. Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G. Mark
    Description

    The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.

    The database supports applications including academic and industrial research, quality improvement initiatives, and higher education coursework.

  15. i

    Cardiovascular Disease Dataset

    • ieee-dataport.org
    Updated Oct 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rajib Kumar Halder Halder (2022). Cardiovascular Disease Dataset [Dataset]. https://ieee-dataport.org/documents/cardiovascular-disease-dataset
    Explore at:
    Dataset updated
    Oct 25, 2022
    Authors
    Rajib Kumar Halder Halder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This heart disease dataset is curated by combining 3 popular heart disease datasets. The first dataset (Collected from Kaggle) contains 70000 records with 11 independent features which makes it the largest heart disease dataset available so far for research purposes. These data were collected at the moment of medical examination and information given by the patient. Second and third datasets contain 303 and 293 intstances respectively with 13 common features. The three datasets used for its curation are:Cardio Data (Kaggle Dataset)

  16. AI medical chatbot

    • kaggle.com
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousef Saeedian (2024). AI medical chatbot [Dataset]. https://www.kaggle.com/datasets/yousefsaeedian/ai-medical-chatbot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yousef Saeedian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description:

    This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.

    Key Features:

    • Doctor and Patient Roles: Each conversation is annotated with the role of the speaker (doctor or patient), making it easy to analyze communication patterns.
    • Medical Context: The dataset includes diverse scenarios, from routine check-ups to more complex medical discussions, offering a broad spectrum of healthcare dialogues.
    • Natural Language: The conversations are presented in natural language, allowing for the development and testing of NLP models focused on healthcare communication.
    • Applications: This dataset can be used for various applications, such as building dialogue systems, analyzing communication efficacy, developing medical NLP models, and enhancing patient care through better understanding of doctor-patient interactions.

    Potential Use Cases:

    • NLP Model Training: Train models to understand and generate medical dialogues.
    • Healthcare Communication Studies: Analyze communication strategies between doctors and patients to improve healthcare delivery.
    • Medical Chatbots: Develop intelligent medical chatbots that can simulate doctor-patient conversations.
    • Patient Experience Enhancement: Identify common patient concerns and doctor responses to enhance patient care strategies.

    This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.

  17. P

    MIMIC-IV Dataset

    • paperswithcode.com
    • physionet.org
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MIMIC-IV Dataset [Dataset]. https://paperswithcode.com/dataset/mimic-iv
    Explore at:
    Description

    Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy.

    The Medical Information Mart for Intensive Care (MIMIC)-III database provided critical care data for over 40,000 patients admitted to intensive care units at the Beth Israel Deaconess Medical Center (BIDMC). Importantly, MIMIC-III was deidentified, and patient identifiers were removed according to the Health Insurance Portability and Accountability Act (HIPAA) Safe Harbor provision. MIMIC-III has been integral in driving large amounts of research in clinical informatics, epidemiology, and machine learning. Here we present MIMIC-IV, an update to MIMIC-III, which incorporates contemporary data and improves on numerous aspects of MIMIC-III. MIMIC-IV adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.

  18. P

    Data from: ADNI Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). ADNI Dataset [Dataset]. https://paperswithcode.com/dataset/adni
    Explore at:
    Dataset updated
    Jun 29, 2021
    Description

    Alzheimer's Disease Neuroimaging Initiative (ADNI) is a multisite study that aims to improve clinical trials for the prevention and treatment of Alzheimer’s disease (AD).[1] This cooperative study combines expertise and funding from the private and public sector to study subjects with AD, as well as those who may develop AD and controls with no signs of cognitive impairment.[2] Researchers at 63 sites in the US and Canada track the progression of AD in the human brain with neuroimaging, biochemical, and genetic biological markers.[2][3] This knowledge helps to find better clinical trials for the prevention and treatment of AD. ADNI has made a global impact,[4] firstly by developing a set of standardized protocols to allow the comparison of results from multiple centers,[4] and secondly by its data-sharing policy which makes available all at the data without embargo to qualified researchers worldwide.[5] To date, over 1000 scientific publications have used ADNI data.[6] A number of other initiatives related to AD and other diseases have been designed and implemented using ADNI as a model.[4] ADNI has been running since 2004 and is currently funded until 2021.[7]

    Source: Wikipedia, https://en.wikipedia.org/wiki/Alzheimer%27s_Disease_Neuroimaging_Initiative

  19. P

    WUSTL_EHMS_2020 Dataset

    • paperswithcode.com
    Updated Apr 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). WUSTL_EHMS_2020 Dataset [Dataset]. https://paperswithcode.com/dataset/wustl-ehms-2020
    Explore at:
    Dataset updated
    Apr 15, 2020
    Description

    The WUSTL-EHMS-2020 dataset was created using a real-time Enhanced Healthcare Monitoring System (EHMS) testbed [1]. This testbed collects both the network flow metrics and patients' biometrics due to the scarcity of a dataset that combines these biometrics.

  20. o

    Coronavirus Question-Answer Model Data

    • opendatabay.com
    .undefined
    Updated Jul 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Coronavirus Question-Answer Model Data [Dataset]. https://www.opendatabay.com/data/ai-ml/971a1982-4b03-4c7b-8572-818df3e58d5a
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 6, 2025
    Dataset authored and provided by
    Datasimple
    Area covered
    Health Information Systems & Technology
    Description

    This dataset is designed to facilitate the creation of question-answer models, specifically tailored for the CORD-19 dataset. It provides a focused collection of high-quality articles, emphasising detected study designs. This makes it an invaluable resource for natural language processing (NLP) research and applications related to the coronavirus, aiding in the development of intelligent health information systems.

    Columns

    The dataset primarily includes files structured for question, context, and answer combinations. For instance, the CSV file contains the following key columns: * category: Represents the category of the question. * question: The actual text of the question. * context: The contextual passage from which the answer can be extracted. * answer: The specific answer text relevant to the question and context.

    Distribution

    The dataset is provided in multiple formats, including a line-by-line export of CORD-19 data in cord19.txt, a CSV file (cord19-qa.csv) containing question, context, answer combinations, and a JSON file (cord19-qa.json) formatted for SQuAD 2.0. Specific numbers for rows or records are not detailed in the available information. The current version of this dataset is 1.0 and it is listed as globally available.

    Usage

    This dataset is ideally suited for building and fine-tuning transformer models for language modelling and SQuAD 2.0 tasks. It serves as a foundational resource for developing advanced question-answering systems in the medical and healthcare domain, particularly those focused on coronavirus-related information. It is highly valuable for researchers and developers working on AI and large language model (LLM) applications.

    Coverage

    The dataset's focus is global in its applicability, concentrating on high-quality articles with detected study designs related to the CORD-19 research. While a specific time range for the included articles is not provided, the dataset itself was listed on 17th June 2025, indicating its recent availability on the platform. The primary scope is medical and healthcare information, specifically concerning the coronavirus.

    License

    CC BY-SA

    Who Can Use It

    This dataset is intended for a broad range of users, including: * AI/ML Engineers and Data Scientists: For training and evaluating question-answering models and other NLP tasks. * Healthcare Researchers: To develop tools for quickly extracting information from a vast corpus of medical literature. * Academic Institutions: For research and educational purposes in the fields of AI, NLP, and medical informatics. * Start-ups and Enterprises: Developing innovative health information systems or AI-powered medical assistants.

    Dataset Name Suggestions

    • CORD-19 QA Dataset
    • Coronavirus Question-Answer Model Data
    • Medical NLP QA Resource
    • AI Health Q&A Dataset

    Attributes

    Original Data Source: CORD-19 QA

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Gabriel Moraga (2024). Clinical Trials [Dataset]. https://www.kaggle.com/datasets/gabrielmoraga/clinical-trials/suggestions
Organization logo

Clinical Trials

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2024
Dataset provided by
Kaggle
Authors
Gabriel Moraga
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Dataset

This dataset was created by Gabriel Moraga

Released under Attribution 4.0 International (CC BY 4.0)

Contents

Search
Clear search
Close search
Google apps
Main menu