100+ datasets found
  1. Identifying Diseases Treatments in Healthcare Data

    • kaggle.com
    zip
    Updated Mar 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sagar Maru (2025). Identifying Diseases Treatments in Healthcare Data [Dataset]. https://www.kaggle.com/datasets/marusagar/identifying-diseases-treatments-in-healthcare-data
    Explore at:
    zip(166655 bytes)Available download formats
    Dataset updated
    Mar 5, 2025
    Authors
    Sagar Maru
    Description

    Identifying Entities (Diseases, Treatments) in Healthcare Data

    Finding diseases and treatments in medical text—because even AI needs a medical degree to understand doctor’s notes! 🩺🤖

    📊 Understanding the Dataset

    In the contemporary healthcare ecosystem, substantial amounts of unstructured textual facts are generated day by day thru electronic health facts (EHRs), medical doctor’s notes, prescriptions, and medical literature. The potential to extract meaningful insights from this records is critical for improving patient care, advancing clinical studies, and optimizing healthcare offerings. The dataset in cognizance incorporates text-based totally scientific statistics, in which sicknesses and their corresponding remedies are embedded inside unstructured sentences.

    The dataset consists of categorized textual content samples, that are classified into: -**Train Sentences**: These sentences comprise clinical records, including patient diagnoses and the treatments administered. -**Train Labels**: The corresponding annotations for the train sentences, marking diseases and remedies as named entities. -**Test Sentences**: Similar to educate sentences however used to evaluate model overall performance. -**Test Labels**: The ground reality labels for the test sentences.

    A sneak from the dataset may look as follows:

    🔍 Example from Dataset:

    Train Sentences:

    _ "The patient was a 62 -year -old man with squamous epithelium, who was previously treated with success with a combination of radiation therapy and chemotherapy."

    Train Labels:

    • Disease: 🦠 lung cancer
    • Treatment: 💉 Radiation therapy, chemotherapy

    This dataset requires the use of** designated Unit Recognition (NER)** to remove and map and map diseases for related treatments 💊, causing the composition of unarmed medical data for analytical purposes.

    ⚙️ Dataset Properties

    1. Unnecessary medical text: Data set contains free-powered medical notes, where disease and treatment conditions are clearly mentioned. Removing this information without clear mapping is a challenge.
    2. Many unit types: Datasets contain different - -called institutions such as diseases, treatment, symptoms and possibly medication.
    3. Relevant addiction: Many treatments apply to many diseases, and proper mapping depends on reference. For example, "radiotherapy" is used for different cancers, which makes relevant understanding significantly.
    4. Unbalanced data distribution: Some diseases and treatment can be displayed more often than others, to balance model performance requires techniques such as overfalling, sub -sampling or transmission of learning.
    5. Domain-specific language: is rich in lesson medical terminology, which requires special preprochet using domain-specific NLP techniques and medical oncology such as UML or SNOM CT.

    🚧 Challenges Working with Dataset

    • Complex medical vocabulary: Medical texts often use vocals, which require special NLP models that are trained at the clinical company.

    • Implicit Relationships: Unlike based datasets, ailment-treatment relationships are inferred from context in preference to explicitly stated.

    • Synonyms and Abbreviations: Diseases and treatments can be cited the use of special names (e.G., ‘myocardial infarction’ vs. ‘coronary heart assault’). Handling such versions is vital.

    • Noise in Data: Unstructured records may additionally contain irrelevant records, typographical errors, and inconsistencies that affect extraction accuracy.

    🛠️ Approach to Extracting Insights from the Dataset

    To extract sicknesses and their respective treatments from this dataset, we follow a based NLP pipeline:

    1. Data Preprocessing 🧹

    • Text Cleaning: Remove needless characters, numbers, and stopwords whilst preserving clinical terms.
    • Tokenization: Split sentences into phrases for higher processing.
    • Medical Term Standardization: Use area-precise libraries like SciSpacy to standardize synonyms and abbreviations.

    2. Named Entity Recognition (NER) Model Development 🤖

    • Annotation: Ensure accurate labeling of sicknesses and treatments in the dataset.
    • Model Selection: Train a deep-mastering-based version like BioBERT or a rule-based model the use of spaCy.
    • Training: Use annotated data to teach a custom NER model that classifies words as sickness or treatment entities.
    • Evaluation: Measure precision, bear in mind, and F1-score to evaluate version overall performance.

    3. Mapping Diseases to Treatments 🔄

    • Contextual Relationship Extraction: Identify which treatment corresponds to which sickness using dependency parsing and courting extraction.
    • Dictionary or Tabular Output: Store extracted mappings in a based layout.

    Example Output:

    | 🦠 Disease | 💉 Treatments | |----------|--------------------...

  2. h

    A granular assessment of the day-to-day variation in emergency presentations...

    • healthdatagateway.org
    unknown
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158) (2024). A granular assessment of the day-to-day variation in emergency presentations [Dataset]. https://healthdatagateway.org/en/dataset/175
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Oct 8, 2024
    Dataset authored and provided by
    This publication uses data from PIONEER, an ethically approved database and analytical environment (East Midlands Derby Research Ethics 20/EM/0158)
    License

    https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/

    Description

    The acute-care pathway (from the emergency department (ED) through acute medical units or ambulatory care and on to wards) is the most visible aspect of the hospital health-care system to most patients. Acute hospital admissions are increasing yearly and overcrowded emergency departments and high bed occupancy rates are associated with a range of adverse patient outcomes. Predicted growth in demand for acute care driven by an ageing population and increasing multimorbidity is likely to exacerbate these problems in the absence of innovation to improve the processes of care.

    Key targets for Emergency Medicine services are changing, moving away from previous 4-hour targets. This will likely impact the assessment of patients admitted to hospital through Emergency Departments.

    This data set provides highly granular patient level information, showing the day-to-day variation in case mix and acuity. The data includes detailed demography, co-morbidity, symptoms, longitudinal acuity scores, physiology and laboratory results, all investigations, prescriptions, diagnoses and outcomes. It could be used to develop new pathways or understand the prevalence or severity of specific disease presentations.

    PIONEER geography: The West Midlands (WM) has a population of 5.9 million & includes a diverse ethnic & socio-economic mix.

    Electronic Health Record: University Hospital Birmingham is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & an expanded 250 ITU bed capacity during COVID. UHB runs a fully electronic healthcare record (EHR) (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”.

    Scope: All patients with a medical emergency admitted to hospital, flowing through the acute medical unit. Longitudinal & individually linked, so that the preceding & subsequent health journey can be mapped & healthcare utilisation prior to & after admission understood. The dataset includes patient demographics, co-morbidities taken from ICD-10 & SNOMED-CT codes. Serial, structured data pertaining to process of care (timings, admissions, wards and readmissions), physiology readings (NEWS2 score and clinical frailty scale), Charlson comorbidity index and time dimensions.

    Available supplementary data: Matched controls; ambulance data, OMOP data, synthetic data.

    Available supplementary support: Analytics, Model build, validation & refinement; A.I.; Data partner support for ETL (extract, transform & load) process, Clinical expertise, Patient & end-user access, Purchaser access, Regulatory requirements, Data-driven trials, “fast screen” services.

  3. f

    Table_1_Operational Challenges in the Use of Structured Secondary Data for...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kelsy N. Areco; Tulio Konstantyner; Paulo Bandiera-Paiva; Rita C. X. Balda; Daniela T. Costa-Nobre; Adriana Sanudo; Carlos Roberto V. Kiffer; Mandira D. Kawakami; Milton H. Miyoshi; Ana Sílvia Scavacini Marinonio; Rosa M. V. Freitas; Liliam C. C. Morais; Monica L. P. Teixeira; Bernadette Waldvogel; Maria Fernanda B. Almeida; Ruth Guinsburg (2023). Table_1_Operational Challenges in the Use of Structured Secondary Data for Health Research.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2021.642163.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers
    Authors
    Kelsy N. Areco; Tulio Konstantyner; Paulo Bandiera-Paiva; Rita C. X. Balda; Daniela T. Costa-Nobre; Adriana Sanudo; Carlos Roberto V. Kiffer; Mandira D. Kawakami; Milton H. Miyoshi; Ana Sílvia Scavacini Marinonio; Rosa M. V. Freitas; Liliam C. C. Morais; Monica L. P. Teixeira; Bernadette Waldvogel; Maria Fernanda B. Almeida; Ruth Guinsburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Background: In Brazil, secondary data for epidemiology are largely available. However, they are insufficiently prepared for use in research, even when it comes to structured data since they were often designed for other purposes. To date, few publications focus on the process of preparing secondary data. The present findings can help in orienting future research projects that are based on secondary data.Objective: Describe the steps in the process of ensuring the adequacy of a secondary data set for a specific use and to identify the challenges of this process.Methods: The present study is qualitative and reports methodological issues about secondary data use. The study material was comprised of 6,059,454 live births and 73,735 infant death records from 2004 to 2013 of children whose mothers resided in the State of São Paulo - Brazil. The challenges and description of the procedures to ensure data adequacy were undertaken in 6 steps: (1) problem understanding, (2) resource planning, (3) data understanding, (4) data preparation, (5) data validation and (6) data distribution. For each step, procedures, and challenges encountered, and the actions to cope with them and partial results were described. To identify the most labor-intensive tasks in this process, the steps were assessed by adding the number of procedures, challenges, and coping actions. The highest values were assumed to indicate the most critical steps.Results: In total, 22 procedures and 23 actions were needed to deal with the 27 challenges encountered along the process of ensuring the adequacy of the study material for the intended use. The final product was an organized database for a historical cohort study suitable for the intended use. Data understanding and data preparation were identified as the most critical steps, accounting for about 70% of the challenges observed for data using.Conclusion: Significant challenges were encountered in the process of ensuring the adequacy of secondary health data for research use, mainly in the data understanding and data preparation steps. The use of the described steps to approach structured secondary data and the knowledge of the potential challenges along the process may contribute to planning health research.

  4. p

    Data from: MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital...

    • physionet.org
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari (2025). MIMIC-IV-Ext-BHC: Labeled Clinical Notes Dataset for Hospital Course Summarization [Dataset]. http://doi.org/10.13026/5gte-bv70
    Explore at:
    Dataset updated
    Feb 3, 2025
    Authors
    Asad Aali; Dave Van Veen; Yamin Arefeen; Jason Hom; Christian Bluethgen; Eduardo Pontes Reis; Sergios Gatidis; Namuun Clifford; Joseph Daws; Arash Tehrani; Jangwon Kim; Akshay Chaudhari
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    This dataset presents a curated collection of preprocessed and labeled clinical notes derived from the MIMIC-IV-Note database. The primary aim of this resource is to facilitate the development and training of machine learning models focused on summarizing brief hospital courses (BHC) from clinical discharge notes.

    The dataset contains 270,033 meticulously cleaned and standardized clinical notes containing an average token length of 2,267, ensuring usability for machine learning (ML) applications. Each clinical note is paired with a corresponding BHC summary, providing a robust foundation for supervised learning tasks. The preprocessing pipeline employed uses regular expressions to address common issues in the raw clinical text, such as special characters, extraneous whitespace, inconsistent formatting, and irrelevant text, to produce a high-quality, structured dataset with separated clinical note sections through appropriate headings.

    By offering this resource, we aim to support healthcare professionals and researchers in their efforts to enhance patient care through the automation of BHC summarization. This dataset is ideal for exploring various NLP techniques, developing predictive models, and improving the efficiency and accuracy of clinical documentation practices. We invite the research community to utilize this dataset to advance the field of medical informatics and contribute to better health outcomes.

  5. d

    Prognostics Design Solutions in Structural Health Monitoring Systems

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Prognostics Design Solutions in Structural Health Monitoring Systems [Dataset]. https://catalog.data.gov/dataset/prognostics-design-solutions-in-structural-health-monitoring-systems
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    The chapter describes the application of prognostic techniques to the domain of structural health and demonstrates the efficacy of the methods using fatigue data from a graphite-epoxy composite coupon. Prognostics denotes the in-situ assessment of the health of a component and the repeated estimation of remaining life, conditional on anticipated future usage. The methods shown here use a physics-based modeling approach whereby the behavior of the damaged components is encapsulated via mathematical equations that describe the characteristics of the components as it experiences increasing degrees of degradation. Mathematical rigorous techniques are used to extrapolate the remaining life to a failure threshold. Additionally, mathematical tools are used to calculate the uncertainty associated with making predictions. The information stemming from the predictions can be used in an operational context for go/no go decisions, quantify risk of ability to complete a (set of) mission or operation, and when to schedule maintenance.

  6. t

    FAIR Dataset for Disease Prediction in Healthcare Applications

    • test.researchdata.tuwien.ac.at
    bin, csv, json, png
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf (2025). FAIR Dataset for Disease Prediction in Healthcare Applications [Dataset]. http://doi.org/10.70124/5n77a-dnf02
    Explore at:
    csv, json, bin, pngAvailable download formats
    Dataset updated
    Apr 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf; Sufyan Yousaf
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Description

    Context and Methodology

    • Research Domain/Project:
      This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.

    • Purpose of the Dataset:
      The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.

    • Dataset Creation:
      Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

    Technical Details

    • Structure of the Dataset:
      The dataset consists of several files organized into folders by data type:

      • Training Data: Contains the training dataset used to train the machine learning model.

      • Validation Data: Used for hyperparameter tuning and model selection.

      • Test Data: Reserved for final model evaluation.

      Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.

    • Software Requirements:
      To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:

      • Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

    Further Details

    • Reusability:
      Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.

    • Limitations:
      The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

  7. Healthcare Ransomware Dataset

    • kaggle.com
    zip
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    River | Datasets for SQL Practice (2025). Healthcare Ransomware Dataset [Dataset]. https://www.kaggle.com/datasets/rivalytics/healthcare-ransomware-dataset
    Explore at:
    zip(221852 bytes)Available download formats
    Dataset updated
    Feb 21, 2025
    Authors
    River | Datasets for SQL Practice
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    📌 Context of the Dataset

    The Healthcare Ransomware Dataset was created to simulate real-world cyberattacks in the healthcare industry. Hospitals, clinics, and research labs have become prime targets for ransomware due to their reliance on real-time patient data and legacy IT infrastructure. This dataset provides insight into attack patterns, recovery times, and cybersecurity practices across different healthcare organizations.

    Why is this important?

    Ransomware attacks on healthcare organizations can shut down entire hospitals, delay treatments, and put lives at risk. Understanding how different healthcare organizations respond to attacks can help develop better security strategies. The dataset allows cybersecurity analysts, data scientists, and researchers to study patterns in ransomware incidents and explore predictive modeling for risk mitigation.

    📌 Sources and Research Inspiration This simulated dataset was inspired by real-world cybersecurity reports and built using insights from official sources, including:

    1️⃣ IBM Cost of a Data Breach Report (2024)

    The healthcare sector had the highest average cost of data breaches ($10.93 million per incident). On average, organizations recovered only 64.8% of their data after paying ransom. Healthcare breaches took 277 days on average to detect and contain.

    2️⃣ Sophos State of Ransomware in Healthcare (2024)

    67% of healthcare organizations were hit by ransomware in 2024, an increase from 60% in 2023. 66% of backup compromise attempts succeeded, making data recovery significantly more difficult. The most common attack vectors included exploited vulnerabilities (34%) and compromised credentials (34%).

    3️⃣ Health & Human Services (HHS) Cybersecurity Reports

    Ransomware incidents in healthcare have doubled since 2016. Organizations that fail to monitor threats frequently experience higher infection rates.

    4️⃣ Cybersecurity & Infrastructure Security Agency (CISA) Alerts

    Identified phishing, unpatched software, and exposed RDP ports as top ransomware entry points. Only 13% of healthcare organizations monitor cyber threats more than once per day, increasing the risk of undetected attacks.

    5️⃣ Emsisoft 2020 Report on Ransomware in Healthcare

    The number of ransomware attacks in healthcare increased by 278% between 2018 and 2023. 560 healthcare facilities were affected in a single year, disrupting patient care and emergency services.

    📌 Why is This a Simulated Dataset?

    This dataset does not contain real patient data or actual ransomware cases. Instead, it was built using probabilistic modeling and structured randomness based on industry benchmarks and cybersecurity reports.

    How It Was Created:

    1️⃣ Defining the Dataset Structure

    The dataset was designed to simulate realistic attack patterns in healthcare, using actual ransomware case studies as inspiration.

    Columns were selected based on what real-world cybersecurity teams track, such as: Attack methods (phishing, RDP exploits, credential theft). Infection rates, recovery time, and backup compromise rates. Organization type (hospitals, clinics, research labs) and monitoring frequency.

    2️⃣ Generating Realistic Data Using ChatGPT & Python

    ChatGPT assisted in defining relationships between attack factors, ensuring that key cybersecurity concepts were accurately reflected. Python’s NumPy and Pandas libraries were used to introduce randomized attack simulations based on real-world statistics. Data was validated against industry research to ensure it aligns with actual ransomware attack trends.

    3️⃣ Ensuring Logical Relationships Between Data Points

    Hospitals take longer to recover due to larger infrastructure and compliance requirements. Organizations that track more cyber threats recover faster because they detect attacks earlier. Backup security significantly impacts recovery time, reflecting the real-world risk of backup encryption attacks.

  8. Hospital Building Data

    • data.chhs.ca.gov
    • data.ca.gov
    • +3more
    csv, zip
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Hospital Building Data [Dataset]. https://data.chhs.ca.gov/dataset/hospital-building-data
    Explore at:
    csv(2534), csv(1543128), zipAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    Department of Health Care Access and Information
    Description

    Provides basic information for general acute care hospital buildings such as height, number of stories, the building code used to design the building, and the year it was completed. The data is sorted by counties and cities. Structural Performance Categories (SPC ratings) are also provided. SPC ratings range from 1 to 5 with SPC 1 assigned to buildings that may be at risk of collapse during a strong earthquake and SPC 5 assigned to buildings reasonably capable of providing services to the public following a strong earthquake. Where SPC ratings have not been confirmed by the Department of Health Care Access and Information (HCAI) yet, the rating index is followed by 's'. A URL for the building webpage in HCAI/OSHPD eServices Portal is also provided to view projects related to any building.

  9. h

    AgentDS-Healthcare

    • huggingface.co
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    An Luo (2025). AgentDS-Healthcare [Dataset]. https://huggingface.co/datasets/lainmn/AgentDS-Healthcare
    Explore at:
    Dataset updated
    Nov 5, 2025
    Authors
    An Luo
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    🏥 AgentDS-Healthcare

    This dataset is part of the AgentDS Benchmark — a multi-domain benchmark for evaluating human-AI collaboration in real-world, domain-specific data science. AgentDS-Healthcare includes structured and time-series clinical data for 3 challenges:

    30-day hospital readmission prediction
    Emergency department (ED) cost forecasting
    Discharge readiness prediction

    👉 Files are organized in the Healthcare/ folder and reused across challenges.Refer to the included… See the full description on the dataset page: https://huggingface.co/datasets/lainmn/AgentDS-Healthcare.

  10. m

    SymbiPredict

    • data.mendeley.com
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jay Tucker (2024). SymbiPredict [Dataset]. http://doi.org/10.17632/dv5z3v2xyd.1
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    Jay Tucker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Symptom-Disease Prediction Dataset (SDPD) is a comprehensive collection of structured data linking symptoms to various diseases, meticulously curated to facilitate research and development in predictive healthcare analytics. Inspired by the methodology employed by renowned institutions such as the Centers for Disease Control and Prevention (CDC), this dataset aims to provide a reliable foundation for the development of symptom-based disease prediction models. The dataset encompasses a diverse range of symptoms sourced from reputable medical literature, clinical observations, and expert consensus.

  11. a

    Data from: Public Health Departments

    • nc-onemap-2-nconemap.hub.arcgis.com
    • nconemap.gov
    • +3more
    Updated Feb 19, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NC OneMap / State of North Carolina (2010). Public Health Departments [Dataset]. https://nc-onemap-2-nconemap.hub.arcgis.com/datasets/public-health-departments
    Explore at:
    Dataset updated
    Feb 19, 2010
    Dataset authored and provided by
    NC OneMap / State of North Carolina
    License

    https://www.nconemap.gov/pages/termshttps://www.nconemap.gov/pages/terms

    Area covered
    Description

    State and Local Public Health Departments Governmental public health departments are responsible for creating and maintaining conditions that keep people healthy. A local health department may be locally governed, part of a region or district, be an office or an administrative unit of the state health department, or a hybrid of these. Furthermore, each community has a unique "public health system" comprising individuals and public and private entities that are engaged in activities that affect the public's health. (Excerpted from the Operational Definition of a functional local health department, National Association of County and City Health Officials, November 2005) Please reference http://www.naccho.org/topics/infrastructure/accreditation/upload/OperationalDefinitionBrochure-2.pdf for more information. Facilities involved in direct patient care are intended to be excluded from this dataset; however, some of the entities represented in this dataset serve as both administrative and clinical locations. This dataset only includes the headquarters of Public Health Departments, not their satellite offices. Some health departments encompass multiple counties; therefore, not every county will be represented by an individual record. Also, some areas will appear to have over representation depending on the structure of the health departments in that particular region. Visiting nurses are represented in this dataset if they are contracted through the local government to fulfill the duties and responsibilities of the local health organization. Effort was made by TechniGraphics to verify whether or not each health department tracks statistics on communicable diseases. Records with "-DOD" appended to the end of the [NAME] value are located on a military base, as defined by the Defense Installation Spatial Data Infrastructure (DISDI) military installations and military range boundaries. "#" and "*" characters were automatically removed from standard fields populated by TechniGraphics. Double spaces were replaced by single spaces in these same fields. Text fields in this dataset have been set to all upper case to facilitate consistent database engine search results. All diacritics (e.g., the German umlaut or the Spanish tilde) have been replaced with their closest equivalent English character to facilitate use with database systems that may not support diacritics. The currentness of this dataset is indicated by the [CONTDATE] field. Based on this field, the oldest record dates from 11/25/2009 and the newest record dates from 12/28/2009

  12. Health Care Data Set ( 20+ Tables )

    • kaggle.com
    zip
    Updated Nov 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moid Ahmed (2025). Health Care Data Set ( 20+ Tables ) [Dataset]. https://www.kaggle.com/datasets/moid1234/health-care-data-set-20-tables
    Explore at:
    zip(2540688774 bytes)Available download formats
    Dataset updated
    Nov 1, 2025
    Authors
    Moid Ahmed
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    NOTE: Please Read Text File named "ERD Relationship Text" for Detailed Information.

    This dataset represents a complete healthcare management system modeled as a relational database containing over 20 interlinked tables. It captures the entire lifecycle of healthcare operations from patient registration to diagnosis, treatment, billing, inventory, and vendor management. The data structure is designed to simulate a real-world hospital information system (HIS), enabling advanced analytics, data modeling, and visualization. You can easily visualize and explore the schema using tools like dbdiagram.io by pasting the provided table definitions.

    The dataset covers multiple operational areas of a hospital including patient information, clinical operations, financial transactions, human resources, and logistics.

    Patient Information includes personal, contact, and emergency details, along with identification and insurance. Clinical Operations include visits, appointments, diagnoses, treatments, and medications. Financial Transactions cover bills, payments, and vendor settlements. Human Resources include staff details, departments, and medical teams. Logistics and Inventory include equipment, medicines, supplies, and vendor relationships.

    • Patients (STG_EHP_PATN) are linked to Appointments, Visits, Diagnoses, Treatments, Bills, and Insurance Policies.
    • Medical Teams (STG_EHP_MEDT) connect Staff with Visits and Treatments.
    • Allergies and Patient Allergies tables track patient-specific allergy information.
    • Financial tables (Bills, Payments, Vendor Payments) are interconnected through reference numbers for consistent transaction tracing.
    • Inventory tables record medicine and equipment stock movements, supply receipts, and vendor sourcing.

    This dataset can be used for data modeling and SQL practice for complex joins and normalization, healthcare analytics projects involving cost analysis, treatment efficiency, and patient demographics, visualization projects in Power BI, Tableau, or Domo for operational insights, building ETL pipelines and data warehouse models for healthcare systems, and machine learning applications such as predicting patient readmission, billing anomalies, or treatment outcomes.

    To explore the data relationships visually, go to dbdiagram.io, paste the entire provided schema code, and press 2 then 1 (or 2 and Enter) to auto-align the diagram. You’ll see an interactive Entity Relationship Diagram (ERD) representing the entire healthcare ecosystem.

    Total Tables: 20+ Total Columns: 200+ Primary Focus: Patient Management, Clinical Operations, Billing, and Supply Chain

  13. d

    Data from: A Structural Model Decomposition Framework for Systems Health...

    • catalog.data.gov
    • s.cnmilf.com
    • +2more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). A Structural Model Decomposition Framework for Systems Health Management [Dataset]. https://catalog.data.gov/dataset/a-structural-model-decomposition-framework-for-systems-health-management
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Systems health management (SHM) is an impor- tant set of technologies aimed at increasing system safety and reliability by detecting, isolating, and identifying faults; and predicting when the system reaches end of life (EOL), so that appropriate fault mitigation and recovery actions can be taken. Model-based SHM approaches typically make use of global, monolithic system models for online analysis, which results in a loss of scalability and efficiency for large-scale systems. Improvement in scalability and efficiency can be achieved by decomposing the system model into smaller local submodels and operating on these submodels instead. In this paper, the global system model is analyzed offline and structurally decomposed into local submodels. We define a common model decomposition framework for extracting submodels from the global model. This framework is then used to develop algorithms for solving model decomposition problems for the design of three separate SHM technologies, namely, estimation (which is useful for fault detection and identification), fault isolation, and EOL predic- tion. We solve these model decomposition problems using a three-tank system as a case study.

  14. The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases,...

    • zenodo.org
    bin, csv, zip
    Updated Jan 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux (2024). The MultiCaRe Dataset: A Multimodal Case Report Dataset with Clinical Cases, Labeled Images and Captions from Open Access PMC Articles [Dataset]. http://doi.org/10.5281/zenodo.10079370
    Explore at:
    zip, bin, csvAvailable download formats
    Dataset updated
    Jan 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mauro Nievas Offidani; Mauro Nievas Offidani; Claudio Delrieux; Claudio Delrieux
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains multi-modal data from over 75,000 open access and de-identified case reports, including metadata, clinical cases, image captions and more than 130,000 images. Images and clinical cases belong to different medical specialties, such as oncology, cardiology, surgery and pathology. The structure of the dataset allows to easily map images with their corresponding article metadata, clinical case, captions and image labels. Details of the data structure can be found in the file data_dictionary.csv.

    Almost 100,000 patients and almost 400,000 medical doctors and researchers were involved in the creation of the articles included in this dataset. The citation data of each article can be found in the metadata.parquet file.

    Refer to the examples showcased in this GitHub repository to understand how to optimize the use of this dataset.

    For a detailed insight about the contents of this dataset, please refer to this data article published in Data In Brief.

  15. w

    Global AI Training Dataset Market Research Report: By Application (Natural...

    • wiseguyreports.com
    Updated Sep 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global AI Training Dataset Market Research Report: By Application (Natural Language Processing, Computer Vision, Speech Recognition, Predictive Analytics), By Data Type (Structured Data, Unstructured Data, Semi-Structured Data), By Industry (Healthcare, Finance, Retail, Automotive, Manufacturing), By Data Acquisition Method (Manual Data Collection, Automated Data Collection, Synthetic Data Generation) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/ai-training-dataset-market
    Explore at:
    Dataset updated
    Sep 30, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Sep 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20243.83(USD Billion)
    MARKET SIZE 20254.62(USD Billion)
    MARKET SIZE 203530.0(USD Billion)
    SEGMENTS COVEREDApplication, Data Type, Industry, Data Acquisition Method, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSgrowing demand for AI applications, increasing data generation, need for high-quality datasets, advancements in machine learning, regulatory compliance concerns
    MARKET FORECAST UNITSUSD Billion
    KEY COMPANIES PROFILEDIBM, Facebook, Palantir Technologies, OpenAI, NVIDIA, C3.ai, Clarifai, Microsoft, DeepMind, UiPath, Element AI, Amazon, Google, H2O.ai, DataRobot
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESData privacy and compliance solutions, Customized dataset services for industries, Expansion in emerging markets, Integration with cloud platforms, High-demand for diverse datasets
    COMPOUND ANNUAL GROWTH RATE (CAGR) 20.6% (2025 - 2035)
  16. Heart Disease Risk Prediction Dataset

    • kaggle.com
    zip
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahatir Ahmed Tusher (2025). Heart Disease Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/mahatiratusher/heart-disease-risk-prediction-dataset
    Explore at:
    zip(1448235 bytes)Available download formats
    Dataset updated
    Feb 7, 2025
    Authors
    Mahatir Ahmed Tusher
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Heart Disease Risk Prediction Dataset

    Overview

    This synthetic dataset is designed to predict the risk of heart disease based on a combination of symptoms, lifestyle factors, and medical history. Each row in the dataset represents a patient, with binary (Yes/No) indicators for symptoms and risk factors, along with a computed risk label indicating whether the patient is at high or low risk of developing heart disease.

    The dataset contains 70,000 samples, making it suitable for training machine learning models for classification tasks. The goal is to provide researchers, data scientists, and healthcare professionals with a clean and structured dataset to explore predictive modeling for cardiovascular health.

    This dataset is a side project of EarlyMed, developed by students of Vellore Institute of Technology (VIT-AP). EarlyMed aims to leverage data science and machine learning for early detection and prevention of chronic diseases.

    Dataset Features

    Input Features

    Symptoms (Binary - Yes/No)

    1. Chest Pain (chest_pain): Presence of chest pain, a common symptom of heart disease.
    2. Shortness of Breath (shortness_of_breath): Difficulty breathing, often associated with heart conditions.
    3. Unexplained Fatigue (fatigue): Persistent tiredness without an obvious cause.
    4. Palpitations (palpitations): Irregular or rapid heartbeat.
    5. Dizziness/Fainting (dizziness): Episodes of lightheadedness or fainting.
    6. Swelling in Legs/Ankles (swelling): Swelling due to fluid retention, often linked to heart failure.
    7. Pain in Arm/Jaw/Neck/Back (radiating_pain): Radiating pain, a hallmark of angina or heart attacks.
    8. Cold Sweats & Nausea (cold_sweats): Symptoms commonly associated with acute cardiac events.

    Risk Factors (Binary - Yes/No or Continuous)

    1. Age (age): Patient's age in years (continuous variable).
    2. High Blood Pressure (hypertension): History of hypertension (Yes/No).
    3. High Cholesterol (cholesterol_high): Elevated cholesterol levels (Yes/No).
    4. Diabetes (diabetes): Diagnosis of diabetes (Yes/No).
    5. Smoking History (smoker): Whether the patient is a smoker (Yes/No).
    6. Obesity (obesity): Obesity status (Yes/No).
    7. Family History of Heart Disease (family_history): Family history of cardiovascular conditions (Yes/No).

    Output Label

    • Heart Disease Risk (risk_label): Binary label indicating the risk of heart disease:
      • 0: Low risk
      • 1: High risk

    Data Generation Process

    This dataset was synthetically generated using Python libraries such as numpy and pandas. The generation process ensured a balanced distribution of high-risk and low-risk cases while maintaining realistic correlations between features. For example: - Patients with multiple risk factors (e.g., smoking, hypertension, and diabetes) were more likely to be labeled as high risk. - Symptom patterns were modeled after clinical guidelines and research studies on heart disease.

    Sources of Inspiration

    The design of this dataset was inspired by the following resources:

    Books

    • "Harrison's Principles of Internal Medicine" by J. Larry Jameson et al.: A comprehensive resource on cardiovascular diseases and their symptoms.
    • "Mayo Clinic Cardiology" by Joseph G. Murphy et al.: Provides insights into heart disease risk factors and diagnostic criteria.

    Research Papers

    • Framingham Heart Study: A landmark study identifying key risk factors for cardiovascular disease.
    • American Heart Association (AHA) Guidelines: Recommendations for diagnosing and managing heart disease.

    Existing Datasets

    • UCI Heart Disease Dataset: A widely used dataset for heart disease prediction.
    • Kaggle’s Heart Disease datasets: Various datasets contributed by the community.

    Clinical Guidelines

    • Centers for Disease Control and Prevention (CDC): Information on heart disease symptoms and risk factors.
    • World Health Organization (WHO): Global statistics and risk factor analysis for cardiovascular diseases.

    Applications

    This dataset can be used for a variety of purposes:

    1. Machine Learning Research:

      • Train classification models (e.g., Logistic Regression, Random Forest, XGBoost) to predict heart disease risk.
      • Experiment with feature engineering, model tuning, and evaluation metrics like Accuracy, Precision, Recall, and ROC-AUC.
    2. Healthcare Analytics:

      • Identify key risk factors contributing to heart disease.
      • Develop decision support systems for early detection of cardiovascular risks.
    3. Educational Purposes:

      • Teach students and practitioners about predictive modeling in healthcare.
      • Demonstrate the importance of feature selection...
  17. h

    medical_records_parsing_validation_set

    • huggingface.co
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eka Care (2025). medical_records_parsing_validation_set [Dataset]. https://huggingface.co/datasets/ekacare/medical_records_parsing_validation_set
    Explore at:
    Dataset updated
    Nov 14, 2025
    Dataset authored and provided by
    Eka Care
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Medical Records Parsing Validation Set

      Dataset Composition and Clinical Relevance
    

    The Eka Medical Records Parsing Dataset empowers evaluation of AI systems designed to extract structured information from unstructured medical documents, enabling true digitisation of healthcare data while maintaining clinical accuracy. The dataset comprise 288 carefully selected images of laboratory reports and prescriptions representing diverse formats and templates encountered in Indian… See the full description on the dataset page: https://huggingface.co/datasets/ekacare/medical_records_parsing_validation_set.

  18. F

    Vietnamese Agent-Customer Chat Dataset for Healthcare Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Vietnamese Agent-Customer Chat Dataset for Healthcare Domain [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/vietnamese-healthcare-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The Vietnamese Healthcare Chat Dataset is a rich collection of over 10,000 text-based conversations between customers and call center agents, focused on real-world healthcare interactions. Designed to reflect authentic language use and domain-specific dialogue patterns, this dataset supports the development of conversational AI, chatbots, and NLP models tailored for healthcare applications in Vietnamese-speaking regions.

    Participant & Chat Overview

    Participants: 150+ native Vietnamese speakers from the FutureBeeAI Crowd Community
    Conversation Length: 300–700 words per chat
    Turns per Chat: 50–150 dialogue turns across both participants
    Chat Types: Inbound and outbound
    Sentiment Coverage: Positive, neutral, and negative outcomes included

    Topic Diversity

    The dataset captures a wide spectrum of healthcare-related chat scenarios, ensuring comprehensive coverage for training robust AI systems:

    Inbound Chats (Customer-Initiated): Appointment scheduling, new patient registration, surgery and treatment consultations, diet and lifestyle discussions, insurance claim inquiries, lab result follow-ups
    Outbound Chats (Agent-Initiated): Appointment reminders and confirmations, health and wellness program offers, test result notifications, preventive care and vaccination reminders, subscription renewals, risk assessment and eligibility follow-ups

    This variety helps simulate realistic healthcare support workflows and patient-agent dynamics.

    Language Diversity & Realism

    This dataset reflects the natural flow of Vietnamese healthcare communication and includes:

    Authentic Naming Patterns: Vietnamese personal names, clinic names, and brands
    Localized Contact Elements: Addresses, emails, phone numbers, and clinic locations in regional Vietnamese formats
    Time & Currency References: Use of dates, times, numeric expressions, and currency units aligned with Vietnamese-speaking regions
    Colloquial & Medical Expressions: Local slang, informal speech, and common healthcare-related terminology

    These elements ensure the dataset is contextually relevant and linguistically rich for real-world use cases.

    Conversational Flow & Structure

    Conversations range from simple inquiries to complex advisory sessions, including:

    General inquiries
    Detailed problem-solving
    Routine status updates
    Treatment recommendations
    Support and feedback interactions

    Each conversation typically includes these structural components:

    Greetings and verification
    Information gathering
    Problem definition
    Solution delivery
    Closing messages
    Follow-up and feedback (where applicable)

    This structured flow mirrors actual healthcare support conversations and is ideal for training advanced dialogue systems.

    Data Format & Structure

    Available in JSON, CSV, and TXT formats, each conversation includes:

    Full message history with clear speaker labels
    Participant identifiers
    Metadata (e.g., topic tags, region, sentiment)
    Compatibility with common NLP and ML pipelines
    <h3 style="font-weight:

  19. Z

    Dataset from structural health monitoring of a steel bridge in Sweden

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Aug 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leander, John; Nyman, Jacob; Karoumi, Raid; Rosengren, Peter; Johansson, Gunnar (2023). Dataset from structural health monitoring of a steel bridge in Sweden [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8300494
    Explore at:
    Dataset updated
    Aug 30, 2023
    Dataset provided by
    KTH Royal Institute of Technology
    IoTBridge AB
    Authors
    Leander, John; Nyman, Jacob; Karoumi, Raid; Rosengren, Peter; Johansson, Gunnar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Sweden
    Description

    The dataset consists of sensor data from the Vänersborg Bridge in Sweden, comprising 64 bridge opening events registered as accelerations, strains, inclinations, and weather conditions. Data from before, during, and after a verified fracture are provided. The sample of raw data captured the same bridge opening event multiple times over the monitoring duration. Classifying data before and after damage enables the development and verification of routines for novelty detection.

  20. Synthetic Dataset of Emergency Healthcare Services

    • figshare.com
    csv
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marco Ferreira (2024). Synthetic Dataset of Emergency Healthcare Services [Dataset]. http://doi.org/10.6084/m9.figshare.28012784.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Marco Ferreira
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was generated using Simio simulation software. The simulations model patient flow in healthcare settings, capturing key metrics such as queue times, length of stay (LOS) for patients, and nurse utilization rates. Each CSV file contains time-series data, with measured variables including patient waiting times, resource utilization percentages, and service durations.## File Overview**CheckBloodPressure.csv** - (9 KB): Contains blood pressure Server records of patients.**CheckPatientType.csv** - (19 KB): Identifies the type of each patient (e.g., 1 or 3).**Fill_Information.csv** - (2 KB): Fill information records for new patients.**MedicalRecord1.csv** - (10 KB): Medical record dataset for patient type 1.**MedicalRecord2.csv** - (4 KB): Medical record dataset for patient type 2.**MedicalRecord3.csv** - (2 KB): Medical record dataset for patient type 3.**MedicalRecord4.csv** - (13 KB): Medical record dataset for patient type 4.**OutPatientDepartment.csv** - (18 KB): Data related to the satisfaction and length of stay of an given patient.**Triage.csv** - (13 KB): Data related to the triage process.**README.txt** - (4 KB): Documentation of the dataset, including structure, metadata, and usage.## Common Fields Across Files**Patient ID** (Integer): Unique identifier for each patient.**Patient Type** (Integer): Classification of patient (e.g., 1, 4).**Medical Records Arrival Time** (DateTime): Timestamp of the patient's first arrival in the medical record department.**Exiting Time** (DateTime): Timestamp when the patient exits a Server.**Waiting Time (min)** (Real): Total waiting time before being attended to.**Resource Used** (String): Resource (e.g., Operator) allocated to the patient.**Utilization %** (Real): Utilization rate of the resource as a percentage.**Queue Count Before Processing** (Integer): Number of patients in the queue before processing begins.**Queue Count After Processing** (Integer): Number of patients in the queue after processing ends.**Queue Difference** (Integer): Difference between the before and after queue counts.**Length of Stay (min)** (Real): Total time spent in the simulation by the patient.**LOS without Queues (min)** (Real): Length of stay excluding any queuing time.**Satisfaction %** (Real): Patient satisfaction rating based on their experience.**New Patient?** (String): Indicates if this is a new patient or a returning one.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sagar Maru (2025). Identifying Diseases Treatments in Healthcare Data [Dataset]. https://www.kaggle.com/datasets/marusagar/identifying-diseases-treatments-in-healthcare-data
Organization logo

Identifying Diseases Treatments in Healthcare Data

Identifying Entities (Diseases and Treatments) in Healthcare Dataset

Explore at:
zip(166655 bytes)Available download formats
Dataset updated
Mar 5, 2025
Authors
Sagar Maru
Description

Identifying Entities (Diseases, Treatments) in Healthcare Data

Finding diseases and treatments in medical text—because even AI needs a medical degree to understand doctor’s notes! 🩺🤖

📊 Understanding the Dataset

In the contemporary healthcare ecosystem, substantial amounts of unstructured textual facts are generated day by day thru electronic health facts (EHRs), medical doctor’s notes, prescriptions, and medical literature. The potential to extract meaningful insights from this records is critical for improving patient care, advancing clinical studies, and optimizing healthcare offerings. The dataset in cognizance incorporates text-based totally scientific statistics, in which sicknesses and their corresponding remedies are embedded inside unstructured sentences.

The dataset consists of categorized textual content samples, that are classified into: -**Train Sentences**: These sentences comprise clinical records, including patient diagnoses and the treatments administered. -**Train Labels**: The corresponding annotations for the train sentences, marking diseases and remedies as named entities. -**Test Sentences**: Similar to educate sentences however used to evaluate model overall performance. -**Test Labels**: The ground reality labels for the test sentences.

A sneak from the dataset may look as follows:

🔍 Example from Dataset:

Train Sentences:

_ "The patient was a 62 -year -old man with squamous epithelium, who was previously treated with success with a combination of radiation therapy and chemotherapy."

Train Labels:

  • Disease: 🦠 lung cancer
  • Treatment: 💉 Radiation therapy, chemotherapy

This dataset requires the use of** designated Unit Recognition (NER)** to remove and map and map diseases for related treatments 💊, causing the composition of unarmed medical data for analytical purposes.

⚙️ Dataset Properties

  1. Unnecessary medical text: Data set contains free-powered medical notes, where disease and treatment conditions are clearly mentioned. Removing this information without clear mapping is a challenge.
  2. Many unit types: Datasets contain different - -called institutions such as diseases, treatment, symptoms and possibly medication.
  3. Relevant addiction: Many treatments apply to many diseases, and proper mapping depends on reference. For example, "radiotherapy" is used for different cancers, which makes relevant understanding significantly.
  4. Unbalanced data distribution: Some diseases and treatment can be displayed more often than others, to balance model performance requires techniques such as overfalling, sub -sampling or transmission of learning.
  5. Domain-specific language: is rich in lesson medical terminology, which requires special preprochet using domain-specific NLP techniques and medical oncology such as UML or SNOM CT.

🚧 Challenges Working with Dataset

  • Complex medical vocabulary: Medical texts often use vocals, which require special NLP models that are trained at the clinical company.

  • Implicit Relationships: Unlike based datasets, ailment-treatment relationships are inferred from context in preference to explicitly stated.

  • Synonyms and Abbreviations: Diseases and treatments can be cited the use of special names (e.G., ‘myocardial infarction’ vs. ‘coronary heart assault’). Handling such versions is vital.

  • Noise in Data: Unstructured records may additionally contain irrelevant records, typographical errors, and inconsistencies that affect extraction accuracy.

🛠️ Approach to Extracting Insights from the Dataset

To extract sicknesses and their respective treatments from this dataset, we follow a based NLP pipeline:

1. Data Preprocessing 🧹

  • Text Cleaning: Remove needless characters, numbers, and stopwords whilst preserving clinical terms.
  • Tokenization: Split sentences into phrases for higher processing.
  • Medical Term Standardization: Use area-precise libraries like SciSpacy to standardize synonyms and abbreviations.

2. Named Entity Recognition (NER) Model Development 🤖

  • Annotation: Ensure accurate labeling of sicknesses and treatments in the dataset.
  • Model Selection: Train a deep-mastering-based version like BioBERT or a rule-based model the use of spaCy.
  • Training: Use annotated data to teach a custom NER model that classifies words as sickness or treatment entities.
  • Evaluation: Measure precision, bear in mind, and F1-score to evaluate version overall performance.

3. Mapping Diseases to Treatments 🔄

  • Contextual Relationship Extraction: Identify which treatment corresponds to which sickness using dependency parsing and courting extraction.
  • Dictionary or Tabular Output: Store extracted mappings in a based layout.

Example Output:

| 🦠 Disease | 💉 Treatments | |----------|--------------------...

Search
Clear search
Close search
Google apps
Main menu