Sample dataset for training webinar on Tuesday, April 16 2024.
PMC-Patients is a first-of-its-kind dataset consisting of 167k patient summaries extracted from case reports in PubMed Central (PMC), 3.1M patient-article relevance and 293k patient-patient similarity annotations defined by PubMed citation graph.
Dataset Description - **Homepage:** https://github.com/pmc-patients/pmc-patients
- **Repository:** https://github.com/pmc-patients/pmc-patients
- **Paper:** https://arxiv.org/pdf/2202.13876.pdf
- **Leaderboard:** https://pmc-patients.github.io/
- **Point of Contact:** zhengyun21@mails.tsinghua.edu.cn Dataset Structure This file contains all information about patients summaries in PMC-Patients, with the following columns:
%3C!-- --%3E
Dataset Creation
If you are interested in the collection of PMC-Patients and reproducing our baselines, please refer to [this repository](https://github.com/zhao-zy15/PMC-Patients).
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for PMC-Patients
News
We released PMC-Patients-V2 (in JSON format with the same keys), which is based on 2024 PMC baseline and contains 250,294 patients. The data collection pipeline remains the same except for using more PMC articles.
Dataset Summary
PMC-Patients is a first-of-its-kind dataset consisting of 167k patient summaries extracted from case reports in PubMed Central (PMC), 3.1M patient-article relevance and 293k patient-patient similarity… See the full description on the dataset page: https://huggingface.co/datasets/THUMedInfo/PMC-Patients.
The table PMC patients is part of the dataset PMC patient notes, available at https://stanford.redivis.com/datasets/73ag-4jmwbmba3. It contains 167034 rows across 10 variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
json
file, which is a list of dictionaries with the following keys:- patient_id
: string. A continuous id of patients, starting from 0.- patient_uid
: string. Unique ID for each patient, with format PMID-x, where PMID is the PubMed Identifier of source article of the note and x denotes index of the note in source article.- PMID
: string. PMID for source article.- file_path
: string. File path of xml file of source article.- title
: string. Source article title.- patient
: string. Patient note.- age
: list of tuples. Each entry is in format (value, unit)
where value is a float number and unit is in 'year', 'month', 'week', 'day' and 'hour' indicating age unit. For example, [[1.0, 'year'], [2.0, 'month']]
indicating the patient is a one-year- and two-month-old infant.- gender
: 'M' or 'F'. Male or Female.- relevant_articles
: dict. The key is PMID of the relevant articles and the corresponding value is its relevance score (2 or 1 as defined in the Methods'' section).- `similar_patients`: dict. The key is patient_uid of the similar patients and the corresponding value is its similarity score (2 or 1 as defined in the
Methods'' section).https://www.icpsr.umich.edu/web/ICPSR/studies/34644/termshttps://www.icpsr.umich.edu/web/ICPSR/studies/34644/terms
Overview: The goal of the project was to develop a unique database linking chronic disease clinical data from an electronic medical record (EMR) of a large academic healthcare system to multi-payer claims data. The longitudinal relational database can be used to study clinical effectiveness of many diagnostic and treatment interventions. The population of patients used consisted of those patients who were attributed to the University of Michigan Health System (UMHS) as continuing care patients, who are also in adjudicated and validated chronic disease registries. Data Access: These data are not available from ICPSR. The data are restricted to use by the principal investigator and cannot be shared.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the U.S., every hospital that receives payments from Medicare and Medicaid is mandated to provide quality data to The Centers for Medicare and Medicaid Services (CMS) annually. This data helps gauge patient satisfaction levels across the country. While overall hospital scores can be influenced by the quality of customer services, there may also be variations in satisfaction based on the type of hospital or its location.
Year: 2016 - 2020
The Star Rating Program, implemented by The Centers for Medicare & Medicaid Services (CMS), employs a five-star grading system to evaluate the experiences of Medicare beneficiaries with their respective health plans and the overall healthcare system. Health plans receive scores ranging from 1 to 5 stars, with 5 stars denoting the highest quality.
Benefits:
Historical Analysis: With data spanning from 2016 to 2020, researchers and analysts can observe trends over time, understanding how patient satisfaction has evolved over these years.
Benchmarking: Hospitals can compare their performance against national averages or against peer institutions to see where they stand.
Identifying Areas for Improvement: By analyzing specific metrics and feedback, hospitals can pinpoint areas where their services may be lacking and need enhancement.
Policy and Decision Making: Governments and healthcare administrators can use the data to make informed decisions about healthcare policies, funding allocations, and other strategic decisions.
Research and Academic Purposes: Academics and researchers can use the dataset for various studies, including correlational studies, predictions, and more.
Geographical Insights: The dataset may provide insights into regional variations in patient satisfaction, helping to identify areas or states with particularly high or low scores.
Understanding Factors Affecting Satisfaction: By correlating satisfaction scores with other variables (e.g., hospital type, size, location), it might be possible to determine which factors play the most significant role in patient satisfaction.
Performance Evaluation: Hospitals can use the data to evaluate the efficacy of any interventions or changes they've made over the years in terms of improving patient satisfaction.
Enhancing Patient Trust: Demonstrating transparency and a commitment to improvement can enhance patient trust and loyalty.
Informed Patients: By making such data publicly available, potential patients can make more informed decisions about where to seek care based on the satisfaction ratings of previous patients.
Source: https://data.cms.gov/provider-data/archived-data/hospitals
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
cancer patients in China, designed for medical research, survival prediction modeling, and healthcare disparity analysis. The data includes tumor characteristics, treatment types, survival status, and lifestyle factors such as smoking and alcohol use. It reflects realistic cancer epidemiology, with higher frequencies of lung, stomach, and liver cancers, and considers regional disparities in treatment and outcomes. Key features include:
Geographic spread across major Chinese provinces with proportional representation.
Cancer types, stages, and tumor sizes aligned with epidemiological trends in China.
Treatment methods (e.g., surgery, chemotherapy, immunotherapy) and session counts.
Comorbidities, genetic mutation data (with intentional 5–10% missing values).
Survival outcome and follow-up durations up to 60 months.
This dataset is suitable for use in machine learning models, public health studies, predictive analytics, and academic research—especially in the context of cancer outcome prediction, treatment effectiveness evaluation, and equity in access to advanced care.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We collected over 10,087 posts from cancer patients and their caregivers on platforms like Reddit, Daily Strength, and the Health Board. The posts were related to five types of cancer: brain, colon, liver, leukemia, and lung cancer. Two team members scored each post based on the emotions expressed, using a scale from -2 to 1. Negative scores (-1 or -2) were given for posts showing grief or suffering, positive scores (1) for happy emotions like relief or accomplishment, and posts with no emotion received a score of 0 and were considered neutral. This analysis aims to understand the emotional aspects of cancer patients posts for a mental health study.
In 2009, there were nearly 19 million federally funded community health center patients, whereas by 2022, there were 30.5 million patients in the United States. This statistic depicts the total number of health center patients in the U.S. from 2009 to 2022.
This dataset contains counts of inpatient hospitalizations and emergency department visits for persons experiencing homelessness.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A 100,000-patient database that contains in total 100,000 virtual patients, 361,760 admissions, and 107,535,387 lab observations.
https://www.pioneerdatahub.co.uk/data/data-request-process/https://www.pioneerdatahub.co.uk/data/data-request-process/
Community Acquired Pneumonia (CAP) is the leading cause of infectious death and the third leading cause of death globally. Disease severity and outcomes are highly variable, dependent on host factors (such as age, smoking history, frailty and comorbidities), microbial factors (the causative organism) and what treatments are given. Clinical decision pathways are complex and despite guidelines, there is significant national variability in how guidelines are adhered to and patient outcomes.
For clinicians treating pneumonia in the hospital setting, care of these patients can be challenging. Key decisions include the type of antibiotics (oral or intravenous), the appropriate place of care (home, hospital or intensive care), and when it is appropriate to stop antibiotics. Decision support tools to help inform clinical management would be highly valuable to the clinical community.
This dataset is synthetic, formed from statistical modelling using real patient data, and represents a population with significant diversity in terms of patient demography, socio-economic status, CAP severity, treatments and outcomes. It can be used to develop code for deployment on real data, train data analysts and increase familiarity with this disease and its management.
PIONEER geography: The West Midlands (WM) has a population of 5.9 million & includes a diverse ethnic & socio-economic mix.
EHR. UHB is one of the largest NHS Trusts in England, providing direct acute services & specialist care across four hospital sites, with 2.2 million patient episodes per year, 2750 beds & an expanded 250 ITU bed capacity during COVID. UHB runs a fully electronic healthcare record (EHR) (PICS; Birmingham Systems), a shared primary & secondary care record (Your Care Connected) & a patient portal “My Health”. This synthetic dataset has been modelled to reflect data collected from this EHR.
Scope: A synthetic dataset which has been statistically modelled on all hospitalised patients admitted to UHB with Community Acquired Pneumonia. The dataset includes highly granular patient demographics & co-morbidities taken from ICD-10 & SNOMED-CT codes. Serial, structured data pertaining to process of care including timings, admissions, escalation of care to ITU, discharge outcomes, physiology readings (heart rate, blood pressure, AVPU score and others), blood results and drug prescribing and administration.
Available supplementary data: Matched synthetic controls; ambulance, OMOP data, real patient CAP data. Available supplementary support: Analytics, Model build, validation & refinement; A.I.; Data partner support for ETL (extract, transform & load) process, Clinical expertise, Patient & end-user access, Purchaser access, Regulatory requirements, Data-driven trials, “fast screen” services.
These datasets focus on patients leaving California hospitals in 2019-2020 against medical advice (AMA), which is defined as choosing to leave the hospital before the treating physician recommends discharge. Patients leaving AMA are exposed to higher risks due to inadequately treated medical issues, which may result in the need for readmission.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset and data dictionary for the manuscript entitled "Multinational attitudes towards AI in healthcare and diagnostics among hospital patients: Cross-sectional evidence from the COMFORT study." Please cite the corresponding publication as a reference.
description:
The National Patient Care Database (NPCD), located at the Austin Information Technology Center, is part of the National Medical Information Systems (NMIS). The NPCD collects integrated patient care data from all Veterans Health Information Systems and Technology Architecture (VistA) IT systems. Data recorded in the VistA Patient Care Encounter (PCE) package, which captures clinical data resulting from ambulatory care patient encounters is transmitted to the NPCD using the Ambulatory Care Reporting (ACR) Module of the VistA Patient Information Management System (PIMS) package. The Ambulatory Care Reporting Module provides necessary information on patient treatment, what services were rendered to patients, who provided the services, and whether services reported were synchronized with the VA medical center database. Directive 2006-026 (05/05/2006) required the inclusion to patient care data capture requirements the capture of inpatient encounters for patients seen in outpatient clinics and inpatient billable professional services.Additionally, NPCD includes VistA Spinal Cord Dysfunction (SCD) package and Primary Care Management Module (PCMM) data. The SCD central registry in NPCD is used to provide VA-wide review of patient demographics, clinical aspects of injury and disease, and resource utilization involved in providing care to patients. As of October 2010, data for the Spinal Cord Dysfunction is being maintained in the Spinal Cord Injury and Disorders Outcomes (SCIDO) database; current SCD data in NPCD is residual data only. The data load and extraction process for SCD data in NPCD will be discontinued in FY12. The PCMM data in NPCD includes primary care patient to provider assignments and provider utilization data.The NPCD is used by Veterans Health Administration (VHA) program offices for a wide variety of tasks to include research and budget allocation to medical centers.
; abstract:The National Patient Care Database (NPCD), located at the Austin Information Technology Center, is part of the National Medical Information Systems (NMIS). The NPCD collects integrated patient care data from all Veterans Health Information Systems and Technology Architecture (VistA) IT systems. Data recorded in the VistA Patient Care Encounter (PCE) package, which captures clinical data resulting from ambulatory care patient encounters is transmitted to the NPCD using the Ambulatory Care Reporting (ACR) Module of the VistA Patient Information Management System (PIMS) package. The Ambulatory Care Reporting Module provides necessary information on patient treatment, what services were rendered to patients, who provided the services, and whether services reported were synchronized with the VA medical center database. Directive 2006-026 (05/05/2006) required the inclusion to patient care data capture requirements the capture of inpatient encounters for patients seen in outpatient clinics and inpatient billable professional services.Additionally, NPCD includes VistA Spinal Cord Dysfunction (SCD) package and Primary Care Management Module (PCMM) data. The SCD central registry in NPCD is used to provide VA-wide review of patient demographics, clinical aspects of injury and disease, and resource utilization involved in providing care to patients. As of October 2010, data for the Spinal Cord Dysfunction is being maintained in the Spinal Cord Injury and Disorders Outcomes (SCIDO) database; current SCD data in NPCD is residual data only. The data load and extraction process for SCD data in NPCD will be discontinued in FY12. The PCMM data in NPCD includes primary care patient to provider assignments and provider utilization data.The NPCD is used by Veterans Health Administration (VHA) program offices for a wide variety of tasks to include research and budget allocation to medical centers.
Department of State Hospitals Patient Population Demographic (Fiscal Effective Dates: 2010-2020)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
152 views (1 recent) Non-expenditure health care data provide information on institutions providing health care in countries, on resources used and on output produced in the framework of health care provision. Data on health care form a major element of public health information as they describe the capacities available for different types of health care provision as well as potential 'bottlenecks' observed. The quantity and quality of health care services provided and the work sharing established between the different institutions are a subject of ongoing debate in all countries. Sustainability - continuously providing the necessary monetary and personal resources needed - and meeting the challenges of ageing societies are the primary perspectives used when analysing and using the data. The output-related data ('activities') refer to contacts between patients and the health care system, and to the treatment thereby received. Data are available for hospital discharges of in-patients and day cases, average length of stay of in-patients and medical procedures performed in hospitals. Annual national and regional data are provided in absolute numbers and in population-standardised rates (per 100 000 inhabitants). Wherever applicable, the definitions and classifications of the System of Health Accounts (SHA) are followed, e.g. International Classification for Health Accounts - Providers of health care (ICHA-HP). For hospital discharges, the International Shortlist for Hospital Morbidity Tabulation (ISHMT) is used. Health care data on activities are largely based on administrative data sources in the countries. Therefore, they reflect the country-specific way of organising health care and may not always be completely comparable.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Importance Serious illness conversations (SICs) that elicit patients’ values, goals, and care preferences reduce anxiety and depression and improve quality of life, but occur infrequently for patients with cancer. Behavioral economic implementation strategies (nudges) directed at clinicians and/or patients may increase SIC completion.Objective To test the independent and combined effects of clinician and patient nudges on SIC completion.Design, Setting, and Participants A 2 × 2 factorial, cluster randomized trial was conducted from September 7, 2021, to March 11, 2022, at oncology clinics across 4 hospitals and 6 community sites within a large academic health system in Pennsylvania and New Jersey among 163 medical and gynecologic oncology clinicians and 4450 patients with cancer at high risk of mortality (≥10% risk of 180-day mortality).Interventions Clinician clusters and patients were independently randomized to receive usual care vs nudges, resulting in 4 arms: (1) active control, operating for 2 years prior to trial start, consisting of clinician text message reminders to complete SICs for patients at high mortality risk; (2) clinician nudge only, consisting of active control plus weekly peer comparisons of clinician-level SIC completion rates; (3) patient nudge only, consisting of active control plus a preclinic electronic communication designed to prime patients for SICs; and (4) combined clinician and patient nudges.Main Outcomes and Measures The primary outcome was a documented SIC in the electronic health record within 6 months of a participant’s first clinic visit after randomization. Analysis was performed on an intent-to-treat basis at the patient level.Results The study accrued 4450 patients (median age, 67 years [IQR, 59-75 years]; 2352 women [52.9%]) seen by 163 clinicians, randomized to active control (n = 1004), clinician nudge (n = 1179), patient nudge (n = 997), or combined nudges (n = 1270). Overall patient-level rates of 6-month SIC completion were 11.2% for the active control arm (112 of 1004), 11.5% for the clinician nudge arm (136 of 1179), 11.5% for the patient nudge arm (115 of 997), and 14.1% for the combined nudge arm (179 of 1270). Compared with active control, the combined nudges were associated with an increase in SIC rates (ratio of hazard ratios [rHR], 1.55 [95% CI, 1.00-2.40]; P = .049), whereas the clinician nudge (HR, 0.95 [95% CI, 0.64-1.41; P = .79) and patient nudge (HR, 0.99 [95% CI, 0.73-1.33]; P = .93) were not.Conclusions and Relevance In this cluster randomized trial, nudges combining clinician peer comparisons with patient priming questionnaires were associated with a marginal increase in documented SICs compared with an active control. Combining clinician- and patient-directed nudges may help to promote SICs in routine cancer care.
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global market size for patient experience technology was valued at approximately $12.4 billion in 2023 and is expected to reach a staggering $35.2 billion by 2032, growing at a robust CAGR of 12.5% over the forecast period. The key growth factors driving this market include increasing patient expectations for high-quality care, burgeoning healthcare costs, and advancements in telehealth and digital health technologies.
One of the primary growth drivers for the patient experience technology market is the increasing demand for personalized and patient-centered care. Patients today are more informed and involved in their healthcare decisions than ever before. The rise of digital health tools such as patient portals, mobile health apps, and telemedicine platforms has empowered patients to take an active role in managing their health. These technologies not only enhance patient satisfaction but also improve health outcomes by providing timely and accurate information.
Another significant factor contributing to the market's growth is the escalating healthcare costs. Healthcare providers are under immense pressure to reduce costs while maintaining high standards of care. Patient experience technologies offer a viable solution by streamlining administrative processes, reducing hospital readmission rates, and improving operational efficiency. For instance, automated appointment scheduling and electronic health records (EHR) systems can significantly reduce administrative burdens, allowing healthcare providers to focus more on patient care.
Advancements in telehealth and digital health technologies have also played a crucial role in the market's expansion. The COVID-19 pandemic has accelerated the adoption of telehealth services, leading to a surge in demand for virtual care solutions. These technologies have made healthcare more accessible, especially for patients in remote or underserved areas. Moreover, innovations such as artificial intelligence (AI) and machine learning are being integrated into patient experience technologies, providing more sophisticated and personalized care solutions.
Regionally, North America holds the largest market share, driven by advanced healthcare infrastructure, high healthcare expenditure, and the presence of major technology providers. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. Factors such as increasing government initiatives to improve healthcare services, rising healthcare expenditure, and growing awareness about patient-centered care are contributing to this growth. Developing countries in the region are increasingly adopting advanced healthcare technologies, thereby creating lucrative opportunities for market players.
The patient experience technology market is segmented by components into software, hardware, and services. The software segment holds a significant share in the market due to the increasing adoption of EHR systems, patient portals, and telehealth platforms. These software solutions provide comprehensive and integrated patient care, facilitating better communication between patients and healthcare providers. Additionally, the growing trend of mobile health applications has further boosted the demand for software solutions, enabling patients to access healthcare services conveniently from their mobile devices.
The hardware segment, although smaller compared to software, is also witnessing substantial growth. Hardware components such as wearable devices, monitors, and medical kiosks play a crucial role in enhancing patient experience. Wearable devices, for instance, allow continuous monitoring of patients' vital signs, providing real-time data to healthcare providers. This not only improves patient outcomes but also empowers patients to take a proactive approach to their health. The integration of advanced technologies like IoT (Internet of Things) in these devices is further driving their adoption.
Services form another critical component of the patient experience technology market. These services include consulting, implementation, maintenance, and support services. As healthcare providers increasingly adopt advanced technologies, the need for professional services to effectively deploy and manage these solutions is growing. Consulting services help healthcare organizations to identify the right technologies and develop strategies to enhance patient experience. Implementation services ensure the smooth deployment of these technologies
About 33 percent of U.S. physicians spent 17-24 minutes with their patients, according to a survey conducted in 2018. Physicians are often constrained in their time directly working with patients, which could have an impact on patient care outcomes. Studies found out that physicians spend almost half of their time in office on data entry and other desk work. More sophisticated, network-enabled EHR (electronic health records) systems for physicians could probably be a step towards more time directly with patients.
U.S. physicians
Physicians work in a variety of fields and across direct patient care and research. Within the last 50 years, the total number of active physicians has increased dramatically throughout the United States. Among all U.S. states, including the District of Columbia, the District of Columbia had the highest rate of all U.S. states of active physicians.
Physician time
In a recent study, physicians were asked about the time they spend with their patients. According to the results, a majority of physicians said that they felt their time with patients was limited. In 2018, most physicians saw 11-20 patients per day. Some reports have estimated that for every hour of direct patient contact, physicians spend an additional 2 hours working on reporting and desk work. Recent physician surveys have also indicated that one of the primary reasons for physician burn-out is having too many bureaucratic tasks.
Sample dataset for training webinar on Tuesday, April 16 2024.
PMC-Patients is a first-of-its-kind dataset consisting of 167k patient summaries extracted from case reports in PubMed Central (PMC), 3.1M patient-article relevance and 293k patient-patient similarity annotations defined by PubMed citation graph.
Dataset Description - **Homepage:** https://github.com/pmc-patients/pmc-patients
- **Repository:** https://github.com/pmc-patients/pmc-patients
- **Paper:** https://arxiv.org/pdf/2202.13876.pdf
- **Leaderboard:** https://pmc-patients.github.io/
- **Point of Contact:** zhengyun21@mails.tsinghua.edu.cn Dataset Structure This file contains all information about patients summaries in PMC-Patients, with the following columns:
%3C!-- --%3E
Dataset Creation
If you are interested in the collection of PMC-Patients and reproducing our baselines, please refer to [this repository](https://github.com/zhao-zy15/PMC-Patients).