100+ datasets found
  1. m

    An Extensive Dataset for the Heart Disease Classification System

    • data.mendeley.com
    Updated Feb 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sozan S. Maghdid (2022). An Extensive Dataset for the Heart Disease Classification System [Dataset]. http://doi.org/10.17632/65gxgy2nmg.1
    Explore at:
    Dataset updated
    Feb 15, 2022
    Authors
    Sozan S. Maghdid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.

  2. AI medical chatbot

    • kaggle.com
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousef Saeedian (2024). AI medical chatbot [Dataset]. https://www.kaggle.com/datasets/yousefsaeedian/ai-medical-chatbot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Yousef Saeedian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description:

    This dataset comprises transcriptions of conversations between doctors and patients, providing valuable insights into the dynamics of medical consultations. It includes a wide range of interactions, covering various medical conditions, patient concerns, and treatment discussions. The data is structured to capture both the questions and concerns raised by patients, as well as the medical advice, diagnoses, and explanations provided by doctors.

    Key Features:

    • Doctor and Patient Roles: Each conversation is annotated with the role of the speaker (doctor or patient), making it easy to analyze communication patterns.
    • Medical Context: The dataset includes diverse scenarios, from routine check-ups to more complex medical discussions, offering a broad spectrum of healthcare dialogues.
    • Natural Language: The conversations are presented in natural language, allowing for the development and testing of NLP models focused on healthcare communication.
    • Applications: This dataset can be used for various applications, such as building dialogue systems, analyzing communication efficacy, developing medical NLP models, and enhancing patient care through better understanding of doctor-patient interactions.

    Potential Use Cases:

    • NLP Model Training: Train models to understand and generate medical dialogues.
    • Healthcare Communication Studies: Analyze communication strategies between doctors and patients to improve healthcare delivery.
    • Medical Chatbots: Develop intelligent medical chatbots that can simulate doctor-patient conversations.
    • Patient Experience Enhancement: Identify common patient concerns and doctor responses to enhance patient care strategies.

    This dataset is a valuable resource for researchers, data scientists, and healthcare professionals interested in the intersection of technology and medicine, aiming to improve healthcare communication through data-driven approaches.

  3. o

    Public Health Portfolio dataset

    • nihr.opendatasoft.com
    • nihr.aws-ec2-eu-central-1.opendatasoft.com
    csv, excel, json
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Public Health Portfolio dataset [Dataset]. https://nihr.opendatasoft.com/explore/dataset/phof-datase/
    Explore at:
    excel, json, csvAvailable download formats
    Dataset updated
    May 29, 2025
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The NIHR is one of the main funders of public health research in the UK. Public health research falls within the remit of a range of NIHR Research Programmes, NIHR Centres of Excellence and Facilities, plus the NIHR Academy. NIHR awards from all NIHR Research Programmes and the NIHR Academy that were funded between January 2006 and the present extraction date are eligible for inclusion in this dataset. An agreed inclusion/exclusion criteria is used to categorise awards as public health awards (see below). Following inclusion in the dataset, public health awards are second level coded to one of the four Public Health Outcomes Framework domains. These domains are: (1) wider determinants (2) health improvement (3) health protection (4) healthcare and premature mortality.More information on the Public Health Outcomes Framework domains can be found here.This dataset is updated quarterly to include new NIHR awards categorised as public health awards. Please note that for those Public Health Research Programme projects showing an Award Budget of £0.00, the project is undertaken by an on-call team for example, PHIRST, Public Health Review Team, or Knowledge Mobilisation Team, as part of an ongoing programme of work.Inclusion criteriaThe NIHR Public Health Overview project team worked with colleagues across NIHR public health research to define the inclusion criteria for NIHR public health research awards. NIHR awards are categorised as public health awards if they are determined to be ‘investigations of interventions in, or studies of, populations that are anticipated to have an effect on health or on health inequity at a population level.’ This definition of public health is intentionally broad to capture the wide range of NIHR public health awards across prevention, health improvement, health protection, and healthcare services (both within and outside of NHS settings). This dataset does not reflect the NIHR’s total investment in public health research. The intention is to showcase a subset of the wider NIHR public health portfolio. This dataset includes NIHR awards categorised as public health awards from NIHR Research Programmes and the NIHR Academy. This dataset does not currently include public health awards or projects funded by any of the three NIHR Research Schools or any of the NIHR Centres of Excellence and Facilities. Therefore, awards from the NIHR Schools for Public Health, Primary Care and Social Care, NIHR Public Health Policy Research Unit and the NIHR Health Protection Research Units do not feature in this curated portfolio.DisclaimersUsers of this dataset should acknowledge the broad definition of public health that has been used to develop the inclusion criteria for this dataset. This caveat applies to all data within the dataset irrespective of the funding NIHR Research Programme or NIHR Academy award.Please note that this dataset is currently subject to a limited data quality review. We are working to improve our data collection methodologies. Please also note that some awards may also appear in other NIHR curated datasets. Further informationFurther information on the individual awards shown in the dataset can be found on the NIHR’s Funding & Awards website here. Further information on individual NIHR Research Programme’s decision making processes for funding health and social care research can be found here.Further information on NIHR’s investment in public health research can be found as follows: NIHR School for Public Health here. NIHR Public Health Policy Research Unit here. NIHR Health Protection Research Units here. NIHR Public Health Research Programme Health Determinants Research Collaborations (HDRC) here. NIHR Public Health Research Programme Public Health Intervention Responsive Studies Teams (PHIRST) here.

  4. HHS IDs

    • healthdata.gov
    • data.virginia.gov
    • +5more
    Updated May 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). HHS IDs [Dataset]. https://healthdata.gov/Hospital/HHS-IDs/vz64-k9wr
    Explore at:
    xml, csv, application/rdfxml, application/rssxml, tsv, kmz, kml, application/geo+jsonAvailable download formats
    Dataset updated
    May 3, 2024
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    After May 3, 2024, this dataset and webpage will no longer be updated because hospitals are no longer required to report data on COVID-19 hospital admissions, and hospital capacity and occupancy data, to HHS through CDC’s National Healthcare Safety Network. Data voluntarily reported to NHSN after May 1, 2024, will be available starting May 10, 2024, at COVID Data Tracker Hospitalizations.

    This file helps define the HHS_ID column that is published in both the

    'COVID-19 Reported Patient Impact and Hospital Capacity by Facility' found here: https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u

    COVID-19 Reported Patient Impact and 'Hospital Capacity by Facility -- RAW' found here: https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/uqq2-txqb

    As a part of an effort to improve the granularity of spatial data, unique identifiers (named “HHS IDs” in the datasets) have been assigned to each individual facility. These unique identifiers are provided so data users can reference each individual “brick and mortar” facility that is reporting data to HHS, even in cases when multiple facilities report under the same CMS Certification Number (CCN). Additional datasets and further details related to HHS IDs will be released at a later date.

    With this file, you can associate the reporting facility with its physical location(s).

    On October 8, 2021, this file will now include the HHS IDs for Psychiatric, Rehabilitation and Behavioral hospitals, as well as Ambulatory Surgical Centers and Free Standing Emergency departments wherever these institutions are reporting under https://www.hhs.gov/sites/default/files/covid-19-faqs-hospitals-hospital-laboratory-acute-care-facility-data-reporting.pdf

    Starting on January 6, 2023, this dataset will no longer be posted on weekends.

  5. u

    Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical...

    • beta.ukdataservice.ac.uk
    Updated 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCL Institute Of Education University College London (2025). Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Child Health Reviews, 2000-2015: Secure Access [Dataset]. http://doi.org/10.5255/ukda-sn-8709-1
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    UCL Institute Of Education University College London
    Area covered
    Scotland
    Description

    Background:
    The Millennium Cohort Study (MCS) is a large-scale, multi-purpose longitudinal dataset providing information about babies born at the beginning of the 21st century, their progress through life, and the families who are bringing them up, for the four countries of the United Kingdom. The original objectives of the first MCS survey, as laid down in the proposal to the Economic and Social Research Council (ESRC) in March 2000, were:

    • to chart the initial conditions of social, economic and health advantages and disadvantages facing children born at the start of the 21st century, capturing information that the research community of the future will require
    • to provide a basis for comparing patterns of development with the preceding cohorts (the National Child Development Study, held at the UK Data Archive under GN 33004, and the 1970 Birth Cohort Study, held under GN 33229)
    • to collect information on previously neglected topics, such as fathers' involvement in children's care and development
    • to focus on parents as the most immediate elements of the children's 'background', charting their experience as mothers and fathers of newborn babies in the year 2000, recording how they (and any other children in the family) adapted to the newcomer, and what their aspirations for her/his future may be
    • to emphasise intergenerational links including those back to the parents' own childhood
    • to investigate the wider social ecology of the family, including social networks, civic engagement and community facilities and services, splicing in geo-coded data when available
    Additional objectives subsequently included for MCS were:
    • to provide control cases for the national evaluation of Sure Start (a government programme intended to alleviate child poverty and social exclusion)
    • to provide samples of adequate size to analyse and compare the smaller countries of the United Kingdom, and include disadvantaged areas of England

    Further information about the MCS can be found on the Centre for Longitudinal Studies web pages.

    The content of MCS studies, including questions, topics and variables can be explored via the CLOSER Discovery website.

    The first sweep (MCS1) interviewed both mothers and (where resident) fathers (or father-figures) of infants included in the sample when the babies were nine months old, and the second sweep (MCS2) was carried out with the same respondents when the children were three years of age. The third sweep (MCS3) was conducted in 2006, when the children were aged five years old, the fourth sweep (MCS4) in 2008, when they were seven years old, the fifth sweep (MCS5) in 2012-2013, when they were eleven years old, the sixth sweep (MCS6) in 2015, when they were fourteen years old, and the seventh sweep (MCS7) in 2018, when they were seventeen years old.

    End User Licence versions of MCS studies:
    The End User Licence (EUL) versions of MCS1, MCS2, MCS3, MCS4, MCS5, MCS6 and MCS7 are held under UK Data Archive SNs 4683, 5350, 5795, 6411, 7464, 8156 and 8682 respectively. The longitudinal family file is held under SN 8172.

    Sub-sample studies:
    Some studies based on sub-samples of MCS have also been conducted, including a study of MCS respondent mothers who had received assisted fertility treatment, conducted in 2003 (see EUL SN 5559). Also, birth registration and maternity hospital episodes for the MCS respondents are held as a separate dataset (see EUL SN 5614).

    Release of Sweeps 1 to 4 to Long Format (Summer 2020)
    To support longitudinal research and make it easier to compare data from different time points, all data from across all sweeps is now in a consistent format. The update affects the data from sweeps 1 to 4 (from 9 months to 7 years), which are updated from the old/wide to a new/long format to match the format of data of sweeps 5 and 6 (age 11 and 14 sweeps). The old/wide formatted datasets contained one row per family with multiple variables for different respondents. The new/long formatted datasets contain one row per respondent (per parent or per cohort member) for each MCS family. Additional updates have been made to all sweeps to harmonise variable labels and enhance anonymisation.

    How to access genetic and/or bio-medical sample data from a range of longitudinal surveys:
    For information on how to access biomedical data from MCS that are not held at the UKDS, see the CLS Genetic data and biological samples webpage.

    Secure Access datasets:
    Secure Access versions of the MCS have more restrictive access conditions than versions available under the standard End User Licence or Special Licence (see 'Access data' tab above).

    Secure Access versions of the MCS include:
    • detailed sensitive variables not available under EUL. These have been grouped thematically and are held under SN 8753 (socio-economic, accommodation and occupational data), SN 8754 (self-reported health, behaviour and fertility), SN 8755 (demographics, language and religion) and SN 8756 (exact participation dates). These files replace previously available studies held under SNs 8456 and 8622-8627
    • detailed geographical identifier files which are grouped by sweep held under SN 7758 (MCS1), SN 7759 (MCS2), SN 7760 (MCS3), SN 7761 (MCS4), SN 7762 (MCS5 2001 Census Boundaries), SN 7763 (MCS5 2011 Census Boundaries), SN 8231 (MCS6 2001 Census Boundaries), SN 8232 (MCS6 2011 Census Boundaries), SN 8757 (MCS7), SN 8758 (MCS7 2001 Census Boundaries) and SN 8759 (MCS7 2011 Census Boundaries). These files replace previously available files grouped by geography SN 7049 (Ward level), SN 7050 (Lower Super Output Area level), and SN 7051 (Output Area level)
    • linked education administrative datasets for Key Stages 1, 2, 4 and 5 held under SN 8481 (England). This replaces previously available datasets for Key Stage 1 (SN 6862) and Key Stage 2 (SN 7712)
    • linked education administrative datasets for Key Stage 1 held under SN 7414 (Scotland)
    • linked education administrative dataset for Key Stages 1, 2, 3 and 4 under SN 9085 (Wales)
    • linked NHS Patient Episode Database for Wales (PEDW) for MCS1 – MCS5 held under SN 8302
    • linked Scottish Medical Records data held under SNs 8709, 8710, 8711, 8712, 8713 and 8714;
    • Banded Distances to English Grammar Schools for MCS5 held under SN 8394
    • linked Health Administrative Datasets (Hospital Episode Statistics) for England for years 2000-2019 held under SN 9030
    • linked Health Administrative Datasets (SAIL) for Wales held under SN 9310
    • linked Hospital of Birth data held under SN 5724.
    The linked education administrative datasets held under SNs 8481,7414 and 9085 may be ordered alongside the MCS detailed geographical identifier files only if sufficient justification is provided in the application.

    Researchers applying for access to the Secure Access MCS datasets should indicate on their ESRC Accredited Researcher application form the EUL dataset(s) that they also wish to access (selected from the MCS Series Access web page).

    The Millennium Cohort Study: Linked Health Administrative Data (Scottish Medical Records), Child Health Reviews, 2000-2015: Secure Access includes data files from the NHS Digital Hospital Episode Statistics database for those cohort members who provided consent to health data linkage in the Age 50 sweep, and had ever lived in Scotland. The Scottish Medical Records database contains information about all hospital admissions in Scotland. This study concerns the Child Health Reviews (CHR) from first visit to school reviews.

    Other datasets are available from the Scottish Medical Records database, these include:

    • Prescribing Information System (PIS) held under SN 8710
    • Scottish Immunisation and Recall System (SIRS) held under SN 8711
    • Scottish Birth Records (SMR11) held under SN 8712
    • Inpatient and Day Care Attendance (SMR01) held under SN 8713
    • Outpatient Attendance (SMR00) held under SN 8714

    Users

  6. G

    Open Database of Healthcare Facilities

    • open.canada.ca
    • catalogue.arctic-sdi.org
    • +1more
    csv, esri rest +4
    Updated Mar 2, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statistics Canada (2022). Open Database of Healthcare Facilities [Dataset]. https://open.canada.ca/data/en/dataset/a1bcd4ee-8e57-499b-9c6f-94f6902fdf32
    Explore at:
    fgdb/gdb, esri rest, csv, html, pdf, wmsAvailable download formats
    Dataset updated
    Mar 2, 2022
    Dataset provided by
    Statistics Canada
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Description

    The Open Database of Healthcare Facilities (ODHF) is a collection of open data containing the names, types, and locations of health facilities across Canada. It is released under the Open Government License - Canada. The ODHF compiles open, publicly available, and directly-provided data on health facilities across Canada. Data sources include regional health authorities, provincial, territorial and municipal governments, and public health and professional healthcare bodies. This database aims to provide enhanced access to a harmonized listing of health facilities across Canada by making them available as open data. This database is a component of the Linkable Open Data Environment (LODE).

  7. HCPCS Level II

    • kaggle.com
    zip
    Updated Feb 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Medicare & Medicaid Services (2019). HCPCS Level II [Dataset]. https://www.kaggle.com/datasets/cms/cms-codes
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Feb 12, 2019
    Dataset authored and provided by
    Centers for Medicare & Medicaid Services
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    The Healthcare Common Procedure Coding System (HCPCS, often pronounced by its acronym as "hick picks") is a set of health care procedure codes based on the American Medical Association's Current Procedural Terminology (CPT).

    HCPCS includes three levels of codes: Level I consists of the American Medical Association's Current Procedural Terminology (CPT) and is numeric. Level II codes are alphanumeric and primarily include non-physician services such as ambulance services and prosthetic devices, and represent items and supplies and non-physician services, not covered by CPT-4 codes (Level I). Level III codes, also called local codes, were developed by state Medicaid agencies, Medicare contractors, and private insurers for use in specific programs and jurisdictions. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) instructed CMS to adopt a standard coding systems for reporting medical transactions. The use of Level III codes was discontinued on December 31, 2003, in order to adhere to consistent coding standards.

    Content

    Classification of procedures performed for patients is important for billing and reimbursement in healthcare. The primary classification system used in the United States is Healthcare Common Procedure Coding System (HCPCS), maintained by Centers for Medicare and Medicaid Services (CMS). This system is divided into two levels: level I and level II.

    Level I HCPCS codes classify services rendered by physicians. This system is based on Common Procedure Terminology (CPT), a coding system maintained by the American Medical Association (AMA). Level II codes, which are the focus of this public dataset, are used to identify products, supplies, and services not included in level I codes. The level II codes include items such as ambulance services, durable medical goods, prosthetics, orthotics and supplies used outside a physician’s office.

    Given the ubiquity of administrative data in healthcare, HCPCS coding systems are also commonly used in areas of clinical research such as outcomes based research.

    Update Frequency: Yearly

    Fork this kernel to get started.

    Acknowledgements

    https://bigquery.cloud.google.com/table/bigquery-public-data:cms_codes.hcpcs

    https://cloud.google.com/bigquery/public-data/hcpcs-level2

    Dataset Source: Center for Medicare and Medicaid Services. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    Banner Photo by @rawpixel from Unplash.

    Inspiration

    What are the descriptions for a set of HCPCS level II codes?

  8. MHS Dashboard Children and Youth Demographic Datasets

    • data.chhs.ca.gov
    • data.ca.gov
    • +1more
    csv, zip
    Updated Aug 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Services (2024). MHS Dashboard Children and Youth Demographic Datasets [Dataset]. https://data.chhs.ca.gov/dataset/child-youth-ab470-datasets
    Explore at:
    csv(32085), csv(268395), csv(35041649), csv(270327), csv(191127), csv(44757018), csv(430905), csv(374496), csv(1358269), csv(2298761), csv(18869990), csv(116973), csv(11599), csv(1324593), csv(1396290), csv(998465), csv(31283542), csv(43150), csv(1072808), csv(461467), zipAvailable download formats
    Dataset updated
    Aug 28, 2024
    Dataset provided by
    California Department of Health Care Serviceshttp://www.dhcs.ca.gov/
    Authors
    Department of Health Care Services
    Description

    The following datasets are based on the children and youth (under age 21) beneficiary population and consist of aggregate Mental Health Service data derived from Medi-Cal claims, encounter, and eligibility systems. These datasets were developed in accordance with California Welfare and Institutions Code (WIC) § 14707.5 (added as part of Assembly Bill 470 on 10/7/17). Please contact BHData@dhcs.ca.gov for any questions or to request previous years’ versions of these datasets. Note: The Performance Dashboard AB 470 Report Application Excel tool development has been discontinued. Please see the Behavioral Health reporting data hub at https://behavioralhealth-data.dhcs.ca.gov/ for access to dashboards utilizing these datasets and other behavioral health data.

  9. p

    MIMIC-IV

    • physionet.org
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark (2024). MIMIC-IV [Dataset]. http://doi.org/10.13026/kpb9-mt58
    Explore at:
    Dataset updated
    Oct 11, 2024
    Authors
    Alistair Johnson; Lucas Bulgarelli; Tom Pollard; Brian Gow; Benjamin Moody; Steven Horng; Leo Anthony Celi; Roger Mark
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy. Here we present Medical Information Mart for Intensive Care (MIMIC)-IV, a large deidentified dataset of patients admitted to the emergency department or an intensive care unit at the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-IV contains data for over 65,000 patients admitted to an ICU and over 200,000 patients admitted to the emergency department. MIMIC-IV incorporates contemporary data and adopts a modular approach to data organization, highlighting data provenance and facilitating both individual and combined use of disparate data sources. MIMIC-IV is intended to carry on the success of MIMIC-III and support a broad set of applications within healthcare.

  10. h

    medmcqa

    • huggingface.co
    Updated May 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open Life Science AI (2022). medmcqa [Dataset]. https://huggingface.co/datasets/openlifescienceai/medmcqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 22, 2022
    Dataset authored and provided by
    Open Life Science AI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for MedMCQA

      Dataset Summary
    

    MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which require… See the full description on the dataset page: https://huggingface.co/datasets/openlifescienceai/medmcqa.

  11. C

    Hospital Annual Financial Data - Selected Data & Pivot Tables

    • data.chhs.ca.gov
    • data.ca.gov
    • +5more
    csv, data, doc, html +4
    Updated Apr 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Hospital Annual Financial Data - Selected Data & Pivot Tables [Dataset]. https://data.chhs.ca.gov/dataset/hospital-annual-financial-data-selected-data-pivot-tables
    Explore at:
    xlsx, xlsx(770931), xls(44967936), data, xls, html, xls(51554816), xlsx(752914), xls(16002048), xlsx(765216), xls(44933632), xls(14657536), xlsx(750199), xlsx(756356), pdf(303198), pdf(333268), xls(51424256), xls(19650048), xls(18445312), pdf(383996), pdf(121968), xlsx(768036), zip, xlsx(779866), xls(19625472), xlsx(771275), xlsx(758376), xls(19599360), doc, xls(19577856), pdf(310420), xlsx(758089), xls(18301440), xlsx(754073), xlsx(763636), xlsx(14714368), xlsx(769128), xls(920576), csv(205488092), pdf(258239), xlsx(777616), xlsx(782546), xlsx(790979)Available download formats
    Dataset updated
    Apr 23, 2025
    Dataset authored and provided by
    Department of Health Care Access and Information
    Description

    On an annual basis (individual hospital fiscal year), individual hospitals and hospital systems report detailed facility-level data on services capacity, inpatient/outpatient utilization, patients, revenues and expenses by type and payer, balance sheet and income statement.

    Due to the large size of the complete dataset, a selected set of data representing a wide range of commonly used data items, has been created that can be easily managed and downloaded. The selected data file includes general hospital information, utilization data by payer, revenue data by payer, expense data by natural expense category, financial ratios, and labor information.

    There are two groups of data contained in this dataset: 1) Selected Data - Calendar Year: To make it easier to compare hospitals by year, hospital reports with report periods ending within a given calendar year are grouped together. The Pivot Tables for a specific calendar year are also found here. 2) Selected Data - Fiscal Year: Hospital reports with report periods ending within a given fiscal year (July-June) are grouped together.

  12. Data from: COVID-19 Case Surveillance Public Use Data with Geography

    • data.cdc.gov
    • data.virginia.gov
    • +5more
    application/rdfxml +5
    Updated Jul 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data with Geography [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data-with-Ge/n8mc-b4w4
    Explore at:
    application/rssxml, csv, tsv, application/rdfxml, xml, jsonAvailable download formats
    Dataset updated
    Jul 9, 2024
    Dataset provided by
    Centers for Disease Control and Preventionhttp://www.cdc.gov/
    Authors
    CDC Data, Analytics and Visualization Task Force
    License

    https://www.usa.gov/government-workshttps://www.usa.gov/government-works

    Description

    Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

    Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

    This case surveillance public use dataset has 19 elements for all COVID-19 cases shared with CDC and includes demographics, geography (county and state of residence), any exposure history, disease severity indicators and outcomes, and presence of any underlying medical conditions and risk behaviors.

    Currently, CDC provides the public with three versions of COVID-19 case surveillance line-listed data: this 19 data element dataset with geography, a 12 data element public use dataset, and a 33 data element restricted access dataset.

    The following apply to the public use datasets and the restricted access dataset:

    Overview

    The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

    For more information: NNDSS Supports the COVID-19 Response | CDC.

    COVID-19 Case Reports COVID-19 case reports are routinely submitted to CDC by public health jurisdictions using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19. Current versions of these case definitions are available at: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/. All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for lab-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. States and territories continue to use this form.

    Data are Considered Provisional

    • The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
    • Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.

    Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

    Data Limitations

    To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

    Data Quality Assurance Procedures

    CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:

    • Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question "Was the individual hospitalized?" where the possible answer choices include "Yes," "No," or "Unknown," the blank value is recoded to "Missing" because the case report form did not include a response to the question.
    • Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
    • Additional data quality processing to recode free text data is ongoing. Data on symptoms, race, ethnicity, and healthcare worker status have been prioritized.

    Data Suppression

    To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<11 COVID-19 case records with a given values). Suppression includes low frequency combinations of case month, geographic characteristics (county and state of residence), and demographic characteristics (sex, age group, race, and ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

    Additional COVID-19 Data

    COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These and other COVID-19 data are available from multiple public locations: COVID Data Tracker; United States COVID-19 Cases and Deaths by State; COVID-19 Vaccination Reporting Data Systems; and COVID-19 Death Data and Resources.

    Notes:

    March 1, 2022: The "COVID-19 Case Surveillance Public Use Data with Geography" will be updated on a monthly basis.

    April 7, 2022: An adjustment was made to CDC’s cleaning algorithm for COVID-19 line level case notification data. An assumption in CDC's algorithm led to misclassifying deaths that were not COVID-19 related. The algorithm has since been revised, and this dataset update reflects corrected individual level information about death status for all cases collected to date.

    June 25, 2024: An adjustment

  13. m

    Heart Attack Dataset

    • data.mendeley.com
    Updated Nov 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarik A. Rashid (2022). Heart Attack Dataset [Dataset]. http://doi.org/10.17632/wmhctcrt5v.1
    Explore at:
    Dataset updated
    Nov 23, 2022
    Authors
    Tarik A. Rashid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The heart attack datasets were collected at Zheen hospital in Erbil, Iraq, from January 2019 to May 2019. The attributes of this dataset are: age, gender, heart rate, systolic blood pressure, diastolic blood pressure, blood sugar, ck-mb and troponin with negative or positive output. According to the provided information, the medical dataset classifies either heart attack or none. The gender column in the data is normalized: the male is set to 1 and the female to 0. The glucose column is set to 1 if it is > 120; otherwise, 0. As for the output, positive is set to 1 and negative to 0.

  14. p

    MIMIC-III Clinical Database

    • physionet.org
    Updated Sep 4, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alistair Johnson; Tom Pollard; Roger Mark (2016). MIMIC-III Clinical Database [Dataset]. http://doi.org/10.13026/C2XW26
    Explore at:
    Dataset updated
    Sep 4, 2016
    Authors
    Alistair Johnson; Tom Pollard; Roger Mark
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (including post-hospital discharge).MIMIC supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. It is notable for three factors: it is freely available to researchers worldwide; it encompasses a diverse and very large population of ICU patients; and it contains highly granular data, including vital signs, laboratory results, and medications.

  15. C

    Hospital Quarterly Financial & Utilization Report - Complete Data Set

    • data.chhs.ca.gov
    • healthdata.gov
    • +4more
    aspx, csv, docx, pdf +3
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department of Health Care Access and Information (2025). Hospital Quarterly Financial & Utilization Report - Complete Data Set [Dataset]. https://data.chhs.ca.gov/dataset/hospital-quarterly-financial-utilization-report-complete-data-set
    Explore at:
    pdf(429528), aspx, docx, csv, xlsx(370421), xlsx(372005), xlsx(371121), xlsx(375969), csv(306089), xlsx(366794), xlsx(378788), xlsx(380227), xlsx(366051), csv(427425), xlsx(371090), xlsx(371907), csv(425428), xlsx(378446), xlsx(366388), xlsx, csv(423900), xlsx(423343), pdf(479472), xlsx(422435), xls(517632), xlsx(419489), zip, xlsx(373069), csv(304994), xlsx(373128), csv(426414), xlsx(363778), xlsx(373778), csv(426317), xlsx(422675)Available download formats
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Department of Health Care Access and Information
    Description

    On a quarterly basis (every three months), individual hospitals and hospital systems report summary facility-level data on services capacity, revenues and expenses by payer, and utilization by payer. The complete database contains all of the data reported by hospitals. Data for the current year are available by individual calendar quarters. Once the 4th quarter of the current year is posted, then the prior year quarters will be rolled into one spreadsheet file which combines all the Quarterly data for that year.

  16. HCUP National Inpatient Database

    • redivis.com
    application/jsonl +7
    Updated May 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford Center for Population Health Sciences (2024). HCUP National Inpatient Database [Dataset]. http://doi.org/10.57761/d67b-fz41
    Explore at:
    application/jsonl, csv, avro, arrow, parquet, stata, sas, spssAvailable download formats
    Dataset updated
    May 11, 2024
    Dataset provided by
    Redivis Inc.
    Authors
    Stanford Center for Population Health Sciences
    Time period covered
    Jan 1, 2000 - Dec 31, 2021
    Description

    Abstract

    The NIS is the largest publicly available all-payer inpatient healthcare database designed to produce U.S. regional and national estimates of inpatient utilization, access, cost, quality, and outcomes. Unweighted, it contains data from around 7 million hospital stays each year. Weighted, it estimates around 35 million hospitalizations nationally. Developed through a Federal-State-Industry partnership sponsored by the Agency for Healthcare Research and Quality (AHRQ), HCUP data inform decision making at the national, State, and community levels.

    Its large sample size is ideal for developing national and regional estimates and enables analyses of rare conditions, uncommon treatments, and special populations.

    Usage

    IMPORTANT NOTE: Some records are missing from the Severity Measures table for 2017 & 2018, but none are missing from any of the other 2012-2020 data. We are in the process of trying to recover the missing records, and will update this note when we have done so.

    Also %3Cu%3EDO NOT%3C/u%3E

    use this data without referring to the NIS Database Documentation, which includes:

    • Description of NIS Database
    • Restrictions on Use

    %3C!-- --%3E

    • Data Elements
    • Additional Resources for Data Elements
    • ICD-10-CM/PCS Data Included in the NIS Starting with 2015 (More details about this transition available here.)
    • Known Data Issues
    • NIS Supplemental Files
    • HCUP Tools: Labels and Formats
    • Obtaining HCUP Data

    %3C!-- --%3E

    Before Manuscript Submission

    All manuscripts (and other items you'd like to publish) must be submitted to

    phsdatacore@stanford.edu for approval prior to journal submission.

    We will check your cell sizes and citations.

    For more information about how to cite PHS and PHS datasets, please visit:

    https:/phsdocs.developerhub.io/need-help/citing-phs-data-core

    HCUP Online Tutorials

    For additional assistance, AHRQ has created the HCUP Online Tutorial Series, a series of free, interactive courses which provide training on technical methods for conducting research with HCUP data. Topics include an HCUP Overview Course and these tutorials:

    • The HCUP Sampling Design tutorial is designed to help users learn how to account for sample design in their work with HCUP national (nationwide) databases. • The Producing National HCUP Estimates tutorial is designed to help users understand how the three national (nationwide) databases – the NIS, Nationwide Emergency Department Sample (NEDS), and Kids' Inpatient Database (KID) – can be used to produce national and regional estimates. HCUP 2020 NIS (8/22/22) 14 Introduction • The Calculating Standard Errors tutorial shows how to accurately determine the precision of the estimates produced from the HCUP nationwide databases. Users will learn two methods for calculating standard errors for estimates produced from the HCUP national (nationwide) databases. • The HCUP Multi-year Analysis tutorial presents solutions that may be necessary when conducting analyses that span multiple years of HCUP data. • The HCUP Software Tools Tutorial provides instructions on how to apply the AHRQ software tools to HCUP or other administrative databases.

    New tutorials are added periodically, and existing tutorials are updated when necessary. The Online Tutorial Series is located on the HCUP-US website at www.hcupus.ahrq.gov/tech_assist/tutorials.jsp.

    Important notes about the 2015 data

    In 2015, AHRQ restructured the data as described here:

    https://hcup-us.ahrq.gov/db/nation/nis/2015HCUPNationalInpatientSample.pdf

    Some key points:

    • For the 2015 data, all diagnosis and procedure data elements, including any data elements derived from diagnoses and procedures, were moved out of the Core File and into the Diagnosis and Procedure Groups Files.
    • Prior to 2015, and for Q1-3 of 2015, the DX1-30 and PR1-15 variables (which use ICD-9 codes) variables were used, but starting in Q4 of 2015, the I10_DX1-30 and I10_PR1-I10-15 (which use ICD-10 codes) were used. The best way to identify discharges for quarter 1-3 or quarter 4 is based on the value of the diagnosis version (DXVER); For quarters 1-3, DXVER has a value of 9; while for quarter 4, DXVER has a value of 10.
    • Some other variables also transitioned in Q4 of 2015. Please refer to the link above for more details.
    • Starting in 2016, the diagnosis and procedure information returned to the Core file. Additional details about the data in 2016 are available here: https://hcup-us.ahrq.gov/db/nation/nis/NISChangesBeginningDataYr2016.pdf

    %3C!-- --%3E

    NIS Areas of Research and HCUP Publications

  17. d

    DSS Medical Benefit Plan Participation by Month CY 2012-2025

    • catalog.data.gov
    • data.ct.gov
    Updated Jul 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.ct.gov (2025). DSS Medical Benefit Plan Participation by Month CY 2012-2025 [Dataset]. https://catalog.data.gov/dataset/dss-medical-benefit-plan-participation-by-month-cy-2012-2020
    Explore at:
    Dataset updated
    Jul 19, 2025
    Dataset provided by
    data.ct.gov
    Description

    In order to facilitate public review and access, enrollment data published on the Open Data Portal is provided as promptly as possible after the end of each month or year, as applicable to the data set. Due to eligibility policies and operational processes, enrollment can vary slightly after publication. Please be aware of the point-in-time nature of the published data when comparing to other data published or shared by the Department of Social Services, as this data may vary slightly. As a general practice, for monthly data sets published on the Open Data Portal, DSS will continue to refresh the monthly enrollment data for three months, after which time it will remain static. For example, when March data is published the data in January and February will be refreshed. When April data is published, February and March data will be refreshed, but January will not change. This allows the Department to account for the most common enrollment variations in published data while also ensuring that data remains as stable as possible over time. In the event of a significant change in enrollment data, the Department may republish reports and will notate such republication dates and reasons accordingly. In March 2020, Connecticut opted to add a new Medicaid coverage group: the COVID-19 Testing Coverage for the Uninsured. Enrollment data on this limited-benefit Medicaid coverage group is being incorporated into Medicaid data effective January 1, 2021. Enrollment data for this coverage group prior to January 1, 2021, was listed under State Funded Medical. Effective January 1, 2021, this coverage group have been separated: (1) the COVID-19 Testing Coverage for the Uninsured is now G06-I and is now listed as a limited benefit plan that rolls up into “Program Name” of Medicaid and “Medical Benefit Plan” of HUSKY Limited Benefit; (2) the emergency medical coverage has been separated into G06-II as a limited benefit plan that rolls up into “Program Name” of Emergency Medical and “Medical Benefit Plan” of Other Medical. An historical accounting of enrollment of the specific coverage group starting in calendar year 2020 will also be published separately. This data represents number of active recipients who received benefits under a medical benefit plan in that calendar year and month. A recipient may have received benefits from multiple plans in the same month; if so that recipient will be included in multiple categories in this dataset (counted more than once.) 2021 is a partial year. For privacy considerations, a count of zero is used for counts less than five. NOTE: On April 22, 2019 the methodology for determining HUSKY A Newborn recipients changed, which caused an increase of recipients for that benefit starting in October 2016. We now count recipients recorded in the ImpaCT system as well as in the HIX system for that assistance type, instead using HIX exclusively. Also, corrections in the ImpaCT system for January and February 2019 caused the addition of around 2000 and 3000 recipients respectively, and the counts for many types of assistance (e.g. SNAP) were adjusted upward for those 2 months. Also, the methodology for determining the address of the recipients changed: 1. The address of a recipient in the ImpaCT system is now correctly determined specific to that month instead of using the address of the most recent month. This resulted in some shuffling of the recipients among townships starting in October 2016. 2. If, in a given month, a recipient has benefit records in both the HIX system and in the ImpaCT system, the address of the recipient is now calculated as follows to resolve conflicts: Use the residential address in ImpaCT if it exists, else use the mailing address in ImpaCT if it exists, else use the address in HIX. This resulted in a reduction in counts for most townships starting in March 2017 because a single address is now used instead of two when the systems do not agree.\ NOTE: On February 14 2019, the enrollment

  18. n

    Data from: Exploiting hierarchy in medical concept embedding

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Oct 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg (2021). Exploiting hierarchy in medical concept embedding [Dataset]. http://doi.org/10.5061/dryad.v9s4mw6v0
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 27, 2021
    Dataset provided by
    Mid-Atlantic Permanente Medical Group
    Mid-Atlantic Permanente Research Institute
    Authors
    Anthony Finch; Alexander Crowell; Mamta Bhatia; Pooja Parameshwarappa; Yung-Chieh Chang; Jose Martinez; Michael Horberg
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective

    To construct and publicly release a set of medical concept embeddings for codes following the ICD-10 coding standard which explicitly incorporate hierarchical information from medical codes into the embedding formulation.

    Materials and Methods

    We trained concept embeddings using several new extensions to the Word2Vec algorithm using a dataset of approximately 600,000 patients from a major integrated healthcare organization in the Mid-Atlantic US. Our concept embeddings included additional entities to account for the medical categories assigned to codes by the Clinical Classification Software Revised (CCSR) dataset. We compare these results to sets of publicly-released pretrained embeddings and alternative training methodologies.

    Results

    We found that Word2Vec models which included hierarchical data outperformed ordinary Word2Vec alternatives on tasks which compared naïve clusters to canonical ones provided by CCSR. Our Skip-Gram model with both codes and categories achieved 61.4% Normalized Mutual Information with canonical labels in comparison to 57.5% with traditional Skip-Gram. In models operating on two different outcomes we found that including hierarchical embedding data improved classification performance 96.2% of the time. When controlling for all other variables, we found that co-training embeddings improved classification performance 66.7% of the time. We found that all models outperformed our competitive benchmarks.

    Discussion

    We found significant evidence that our proposed algorithms can express the hierarchical structure of medical codes more fully than ordinary Word2Vec models, and that this improvement carries forward into classification tasks. As part of this publication, we have released several sets of pretrained medical concept embeddings using the ICD-10 standard which significantly outperform other well-known pretrained vectors on our tested outcomes.

    Methods This dataset includes trained medical concept embeddings for 5428 ICD-10 codes and 394 Clinical Classification Software (Revised) (CCSR) categories. We include several different sets of concept embeddings, each trained using a slightly different set of hyperparameters and algorithms.

    To train our models, we employed data from the Kaiser Permanente Mid-Atlantic States (KPMAS) medical system. KPMAS is an integrated medical system serving approximately 780,000 members in Maryland, Virginia, and the District of Columbia. KPMAS has a comprehensive Electronic Medical Record system which includes data from all patient interactions with primary or specialty caregivers, from which all data is derived. Our embeddings training set included diagnoses allocated to all adult patients in calendar year 2019.

    For each code, we also recovered an associated category, as assigned by the Clinical Classification Software (Revised).

    We trained 12 sets of embeddings using classical Word2Vec models with settings differing across three parameters. Our first parameter was the selection of training algorithm, where we trained both CBOW and SG models. Each model was trained using dimension k of 10, 50, and 100. Furthermore, each model-dimension combination was trained with categories and codes trained separately and together (referred to hereafter as ‘co-trained embeddings’ or ‘co-embeddings’). Each model was trained for 10 iterations. We employed an arbitrarily large context window (100), since all codes necessarily occurred within a short period (1 year).

    We also trained a set of validation embeddings only on ICD-10 codes using the Med2Vec architecture as a comparison. We trained the Med2Vec model on our data using its default settings, including the default vector size (200) and a training regime of 10 epochs. We grouped all codes occurring on the same calendar date as Med2Vec ‘visits.’ Our Med2Vec model benchmark did not include categorical entities or other novel innovations.

    Word2Vec embeddings were generated using the GenSim package in Python. Med2Vec embeddings were generated using the Med2Vec code published by Choi. The JSON files used in this repository were generated using the JSON package in Python.

  19. Medical Text Dataset -Cancer Doc Classification

    • kaggle.com
    Updated Aug 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Falgunipatel19 (2022). Medical Text Dataset -Cancer Doc Classification [Dataset]. https://www.kaggle.com/datasets/falgunipatel19/biomedical-text-publication-classification
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Falgunipatel19
    Description

    For Biomedical text document classification, abstract and full papers(whose length less than or equal to 6 pages) available and used. This dataset focused on long research paper whose page size more than 6 pages. Dataset includes cancer documents to be classified into 3 categories like 'Thyroid_Cancer','Colon_Cancer','Lung_Cancer'. Total publications=7569. it has 3 class labels in dataset. number of samples in each categories: colon cancer=2579, lung cancer=2180, thyroid cancer=2810

  20. m

    Diabetes Dataset

    • data.mendeley.com
    Updated Jul 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahlam Rashid (2020). Diabetes Dataset [Dataset]. http://doi.org/10.17632/wj9rwkp9c2.1
    Explore at:
    Dataset updated
    Jul 18, 2020
    Authors
    Ahlam Rashid
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The construction of diabetes dataset was explained. The data were collected from the Iraqi society, as they data were acquired from the laboratory of Medical City Hospital and (the Specializes Center for Endocrinology and Diabetes-Al-Kindy Teaching Hospital). Patients' files were taken and data extracted from them and entered in to the database to construct the diabetes dataset. The data consist of medical information, laboratory analysis. The data attribute are: The data consist of medical information, laboratory analysis… etc. The data that have been entered initially into the system are: No. of Patient, Sugar Level Blood, Age, Gender, Creatinine ratio(Cr), Body Mass Index (BMI), Urea, Cholesterol (Chol), Fasting lipid profile, including total, LDL, VLDL, Triglycerides(TG) and HDL Cholesterol , HBA1C, Class (the patient's diabetes disease class may be Diabetic, Non-Diabetic, or Predict-Diabetic).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sozan S. Maghdid (2022). An Extensive Dataset for the Heart Disease Classification System [Dataset]. http://doi.org/10.17632/65gxgy2nmg.1

An Extensive Dataset for the Heart Disease Classification System

Explore at:
Dataset updated
Feb 15, 2022
Authors
Sozan S. Maghdid
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Finding a good data source is the first step toward creating a database. Cardiovascular illnesses (CVDs) are the major cause of death worldwide. CVDs include coronary heart disease, cerebrovascular disease, rheumatic heart disease, and other heart and blood vessel problems. According to the World Health Organization, 17.9 million people die each year. Heart attacks and strokes account for more than four out of every five CVD deaths, with one-third of these deaths occurring before the age of 70 A comprehensive database for factors that contribute to a heart attack has been constructed , The main purpose here is to collect characteristics of Heart Attack or factors that contribute to it. As a result, a form is created to accomplish this. Microsoft Excel was used to create this form. Figure 1 depicts the form which It has nine fields, where eight fields for input fields and one field for output field. Age, gender, heart rate, systolic BP, diastolic BP, blood sugar, CK-MB, and Test-Troponin are representing the input fields, while the output field pertains to the presence of heart attack, which is divided into two categories (negative and positive).negative refers to the absence of a heart attack, while positive refers to the presence of a heart attack.Table 1 show the detailed information and max and min of values attributes for 1319 cases in the whole database.To confirm the validity of this data, we looked at the patient files in the hospital archive and compared them with the data stored in the laboratories system. On the other hand, we interviewed the patients and specialized doctors. Table 2 is a sample for 1320 cases, which shows 44 cases and the factors that lead to a heart attack in the whole database,After collecting this data, we checked the data if it has null values (invalid values) or if there was an error during data collection. The value is null if it is unknown. Null values necessitate special treatment. This value is used to indicate that the target isn’t a valid data element. When trying to retrieve data that isn't present, you can come across the keyword null in Processing. If you try to do arithmetic operations on a numeric column with one or more null values, the outcome will be null. An example of a null values processing is shown in Figure 2.The data used in this investigation were scaled between 0 and 1 to guarantee that all inputs and outputs received equal attention and to eliminate their dimensionality. Prior to the use of AI models, data normalization has two major advantages. The first is to avoid overshadowing qualities in smaller numeric ranges by employing attributes in larger numeric ranges. The second goal is to avoid any numerical problems throughout the process.After completion of the normalization process, we split the data set into two parts - training and test sets. In the test, we have utilized1060 for train 259 for testing Using the input and output variables, modeling was implemented.

Search
Clear search
Close search
Google apps
Main menu