Archived as of 6/26/2025: The datasets will no longer receive updates but the historical data will continue to be available for download. This dataset provides information related to the major services for patients. It contains information about the total number of patients, total number of claims, and dollar amount paid, grouped by recipient zip code. Restricted to claims with service date between 01/2012 to 12/2017. Service categories considered are: 01 - Inpatient Service 03 - Outpatient Service 06 - Physician Service 11 - Lab Service 12 - X-Ray Service 17 - Clinic Service 26 - Mental Health Service 27 - Dental Service/Child 28 - Dental Service/Adult 31 - Eye Care and Exams 38 - EPSDT Service Provider is billing provider. This data is for research purposes and is not intended to be used for reporting. Due to differences in geographic aggregation, time period considerations, and units of analysis, these numbers may differ from those reported by FSSA. Distance between recipient and provider is a straight-line distance calculated and not the physical distance.
Overview
This dataset of medical misinformation was collected and is published by Kempelen Institute of Intelligent Technologies (KInIT). It consists of approx. 317k news articles and blog posts on medical topics published between January 1, 1998 and February 1, 2022 from a total of 207 reliable and unreliable sources. The dataset contains full-texts of the articles, their original source URL and other extracted metadata. If a source has a credibility score available (e.g., from Media Bias/Fact Check), it is also included in the form of annotation. Besides the articles, the dataset contains around 3.5k fact-checks and extracted verified medical claims with their unified veracity ratings published by fact-checking organisations such as Snopes or FullFact. Lastly and most importantly, the dataset contains 573 manually and more than 51k automatically labelled mappings between previously verified claims and the articles; mappings consist of two values: claim presence (i.e., whether a claim is contained in the given article) and article stance (i.e., whether the given article supports or rejects the claim or provides both sides of the argument).
The dataset is primarily intended to be used as a training and evaluation set for machine learning methods for claim presence detection and article stance classification, but it enables a range of other misinformation related tasks, such as misinformation characterisation or analyses of misinformation spreading.
Its novelty and our main contributions lie in (1) focus on medical news article and blog posts as opposed to social media posts or political discussions; (2) providing multiple modalities (beside full-texts of the articles, there are also images and videos), thus enabling research of multimodal approaches; (3) mapping of the articles to the fact-checked claims (with manual as well as predicted labels); (4) providing source credibility labels for 95% of all articles and other potential sources of weak labels that can be mined from the articles' content and metadata.
The dataset is associated with the research paper "Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims" accepted and presented at ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22).
The accompanying Github repository provides a small static sample of the dataset and the dataset's descriptive analysis in a form of Jupyter notebooks.
Options to access the dataset
There are two ways how to get access to the dataset:
In order to obtain an access to the dataset (either to full static dump or REST API), please, request the access by following instructions provided below.
References
If you use this dataset in any publication, project, tool or in any other form, please, cite the following papers:
@inproceedings{SrbaMonantPlatform, author = {Srba, Ivan and Moro, Robert and Simko, Jakub and Sevcech, Jakub and Chuda, Daniela and Navrat, Pavol and Bielikova, Maria}, booktitle = {Proceedings of Workshop on Reducing Online Misinformation Exposure (ROME 2019)}, pages = {1--7}, title = {Monant: Universal and Extensible Platform for Monitoring, Detection and Mitigation of Antisocial Behavior}, year = {2019} }
@inproceedings{SrbaMonantMedicalDataset, author = {Srba, Ivan and Pecher, Branislav and Tomlein Matus and Moro, Robert and Stefancova, Elena and Simko, Jakub and Bielikova, Maria}, booktitle = {Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '22)}, numpages = {11}, title = {Monant Medical Misinformation Dataset: Mapping Articles to Fact-Checked Claims}, year = {2022}, doi = {10.1145/3477495.3531726}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3477495.3531726}, }
Dataset creation process
In order to create this dataset (and to continuously obtain new data), we used our research platform Monant. The Monant platform provides so called data providers to extract news articles/blogs from news/blog sites as well as fact-checking articles from fact-checking sites. General parsers (from RSS feeds, Wordpress sites, Google Fact Check Tool, etc.) as well as custom crawler and parsers were implemented (e.g., for fact checking site Snopes.com). All data is stored in the unified format in a central data storage.
Ethical considerations
The dataset was collected and is published for research purposes only. We collected only publicly available content of news/blog articles. The dataset contains identities of authors of the articles if they were stated in the original source; we left this information, since the presence of an author's name can be a strong credibility indicator. However, we anonymised the identities of the authors of discussion posts included in the dataset.
The main identified ethical issue related to the presented dataset lies in the risk of mislabelling of an article as supporting a false fact-checked claim and, to a lesser extent, in mislabelling an article as not containing a false claim or not supporting it when it actually does. To minimise these risks, we developed a labelling methodology and require an agreement of at least two independent annotators to assign a claim presence or article stance label to an article. It is also worth noting that we do not label an article as a whole as false or true. Nevertheless, we provide partial article-claim pair veracities based on the combination of claim presence and article stance labels.
As to the veracity labels of the fact-checked claims and the credibility (reliability) labels of the articles' sources, we take these from the fact-checking sites and external listings such as Media Bias/Fact Check as they are and refer to their methodologies for more details on how they were established.
Lastly, the dataset also contains automatically predicted labels of claim presence and article stance using our baselines described in the next section. These methods have their limitations and work with certain accuracy as reported in this paper. This should be taken into account when interpreting them.
Reporting mistakes in the dataset The mean to report considerable mistakes in raw collected data or in manual annotations is by creating a new issue in the accompanying Github repository. Alternately, general enquiries or requests can be sent at info [at] kinit.sk.
Dataset structure
Raw data
At first, the dataset contains so called raw data (i.e., data extracted by the Web monitoring module of Monant platform and stored in exactly the same form as they appear at the original websites). Raw data consist of articles from news sites and blogs (e.g. naturalnews.com), discussions attached to such articles, fact-checking articles from fact-checking portals (e.g. snopes.com). In addition, the dataset contains feedback (number of likes, shares, comments) provided by user on social network Facebook which is regularly extracted for all news/blogs articles.
Raw data are contained in these CSV files (and corresponding REST API endpoints):
sources.csv
articles.csv
article_media.csv
article_authors.csv
discussion_posts.csv
discussion_post_authors.csv
fact_checking_articles.csv
fact_checking_article_media.csv
claims.csv
feedback_facebook.csv
Note: Personal information about discussion posts' authors (name, website, gravatar) are anonymised.
Annotations
Secondly, the dataset contains so called annotations. Entity annotations describe the individual raw data entities (e.g., article, source). Relation annotations describe relation between two of such entities.
Each annotation is described by the following attributes:
category of annotation (annotation_category
). Possible values: label (annotation corresponds to ground truth, determined by human experts) and prediction (annotation was created by means of AI method).
type of annotation (annotation_type_id
). Example values: Source reliability (binary), Claim presence. The list of possible values can be obtained from enumeration in annotation_types.csv.
method which created annotation (method_id
). Example values: Expert-based source reliability evaluation, Fact-checking article to claim transformation method. The list of possible values can be obtained from enumeration methods.csv.
its value (value
). The value is stored in JSON format and its structure differs according to particular annotation type.
At the same time, annotations are associated with a particular object identified by:
entity type (parameter entity_type
in case of entity annotations, or source_entity_type
and target_entity_type
in case of relation annotations). Possible values: sources, articles, fact-checking-articles.
entity id (parameter entity_id
in case of entity annotations, or source_entity_id
and target_entity_id
in case of relation annotations).
The dataset provides specifically these entity annotations:
Source reliability (binary). Determines validity of source (website) at a binary scale with two options: reliable source and unreliable source.
Article veracity. Aggregated information about veracity from article-claim pairs.
The dataset provides specifically these relation annotations:
Fact-checking article to claim mapping. Determines mapping between fact-checking article and claim.
Claim presence. Determines presence of claim in article.
Claim stance. Determines stance of an article to a claim.
Annotations are contained in these CSV files (and corresponding REST API endpoints):
entity_annotations.csv
relation_annotations.csv
Note: Identification of human annotators authors (email provided in the annotation app) is anonymised.
Archived as of 6/26/2025: The datasets will no longer receive updates but the historical data will continue to be available for download. This dataset provides information related to the claims that serviced mental health patients. It contains information about the total number of patients, total number of claims, and total dollar amount, grouped by provider. Restricted to claims with service date between 01/2016 to 12/2016. Patients with mental health problems is identified by a list of mental health patients matched to their Medicaid recipient id from DMHA. ER claims are defined as claims with CPT codes: 99281, 99282, 99283, 99284, and 99285. Providers are billing providers. This data is for research purposes and is not intended to be used for reporting. Due to differences in geographic aggregation, time period considerations, and units of analysis, these numbers may differ from those reported by FSSA.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Healthcare Common Procedure Coding System (HCPCS, often pronounced by its acronym as "hick picks") is a set of health care procedure codes based on the American Medical Association's Current Procedural Terminology (CPT).
HCPCS includes three levels of codes: Level I consists of the American Medical Association's Current Procedural Terminology (CPT) and is numeric. Level II codes are alphanumeric and primarily include non-physician services such as ambulance services and prosthetic devices, and represent items and supplies and non-physician services, not covered by CPT-4 codes (Level I). Level III codes, also called local codes, were developed by state Medicaid agencies, Medicare contractors, and private insurers for use in specific programs and jurisdictions. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) instructed CMS to adopt a standard coding systems for reporting medical transactions. The use of Level III codes was discontinued on December 31, 2003, in order to adhere to consistent coding standards.
Classification of procedures performed for patients is important for billing and reimbursement in healthcare. The primary classification system used in the United States is Healthcare Common Procedure Coding System (HCPCS), maintained by Centers for Medicare and Medicaid Services (CMS). This system is divided into two levels: level I and level II.
Level I HCPCS codes classify services rendered by physicians. This system is based on Common Procedure Terminology (CPT), a coding system maintained by the American Medical Association (AMA). Level II codes, which are the focus of this public dataset, are used to identify products, supplies, and services not included in level I codes. The level II codes include items such as ambulance services, durable medical goods, prosthetics, orthotics and supplies used outside a physician’s office.
Given the ubiquity of administrative data in healthcare, HCPCS coding systems are also commonly used in areas of clinical research such as outcomes based research.
Update Frequency: Yearly
Fork this kernel to get started.
https://bigquery.cloud.google.com/table/bigquery-public-data:cms_codes.hcpcs
https://cloud.google.com/bigquery/public-data/hcpcs-level2
Dataset Source: Center for Medicare and Medicaid Services. This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - http://www.data.gov/privacy-policy#data_policy — and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.
Banner Photo by @rawpixel from Unplash.
What are the descriptions for a set of HCPCS level II codes?
Classification of procedures performed for patients is important for billing and reimbursement in healthcare. The primary classification system used in the United States is Healthcare Common Procedure Coding System (HCPCS), maintained by Centers for Medicare and Medicaid Services (CMS). This system is divided into two levels: level I and level II. Level I HCPCS codes classify services rendered by physicians. This system is based on Common Procedure Terminology (CPT), a coding system maintained by the American Medical Association (AMA). Level II codes, which are the focus of this public dataset, are used to identify products, supplies, and services not included in level I codes. The level II codes include items such as ambulance services, durable medical goods, prosthetics, orthotics and supplies used outside a physician’s office. Given the ubiquity of administrative data in healthcare, HCPCS coding systems are also commonly used in areas of clinical research such as outcomes based research. This public dataset is hosted in Google BigQuery and is included in BigQuery's 1TB/mo of free tier processing. This means that each user receives 1TB of free BigQuery processing every month, which can be used to run queries on this public dataset. Watch this short video to learn how to get started quickly using BigQuery to access public datasets. What is BigQuery .
By Health Data New York [source]
This dataset provides comprehensive measures to evaluate the quality of medical services provided to Medicaid beneficiaries by Health Homes, including the Centers for Medicare & Medicaid Services (CMS) Core Set and Health Home State Plan Amendment (SPA). This allows us to gain insight into how well these health homes are performing in terms of delivering high-quality care. Our data sources include the Medicaid Data Mart, QARR Member Level Files, and New York State Delivery System Inform Incentive Program (DSRIP) Data Warehouse. With this data set you can explore essential indicators such as rates for indicators within scope of Core Set Measures, sub domains, domains and measure descriptions; age categories used; denominators of each measure; level of significance for each indicator; and more! By understanding more about Health Home Quality Measures from this resource you can help make informed decisions about evidence based health practices while also promoting better patient outcomes
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains measures that evaluate the quality of care delivered by Health Homes for the Centers for Medicare & Medicaid Services (CMS). With this dataset, you can get an overview of how a health home is performing in terms of quality. You can use this data to compare different health homes and their respective service offerings.
The data used to create this dataset was collected from Medicaid Data Mart, QARR Member Level Files, and New York State Delivery System Incentive Program (DSRIP) Data Warehouse sources.
In order to use this dataset effectively, you should start by looking at the columns provided. These include: Measurement Year; Health Home Name; Domain; Sub Domain; Measure Description; Age Category; Denominator; Rate; Level of Significance; Indicator. Each column provides valuable insight into how a particular health home is performing in various measurements of healthcare quality.
When examining this data, it is important to remember that many variables are included in any given measure and that changes may have occurred over time due to varying factors such as population or financial resources available for healthcare delivery. Furthermore, changes in policy may also affect performance over time so it is important to take these things into account when evaluating the performance of any given health home from one year to the next or when comparing different health homes on a specific measure or set of indicators over time
- Using this dataset, state governments can evaluate the effectiveness of their health home programs by comparing the performance across different domains and subdomains.
- Healthcare providers and organizations can use this data to identify areas for improvement in quality of care provided by health homes and strategies to reduce disparities between individuals receiving care from health homes.
- Researchers can use this dataset to analyze how variations in cultural context, geography, demographics or other factors impact delivery of quality health home services across different locations
If you use this dataset in your research, please credit the original authors. Data Source
See the dataset description for more information.
File: health-home-quality-measures-beginning-2013-1.csv | Column name | Description | |:--------------------------|:----------------------------------------------------| | Measurement Year | The year in which the data was collected. (Integer) | | Health Home Name | The name of the health home. (String) | | Domain | The domain of the measure. (String) | | Sub Domain | The sub domain of the measure. (String) | | Measure Description | A description of the measure. (String) | | Age Category | The age category of the patient. (String) | | Denominator | The denominator of the measure. (Integer) | | Rate | The rate of the measure. (Float) | | Level of Significance | The level of significance of the measure. (String) | | Indicator | The indicator of the measure. (String) |
...
By Data Society [source]
Do you want to explore the complexities of Health Insurance Marketplace and uncover insights into plan rates, benefits, and networks? Look no further! With this dataset from the Centers for Medicare & Medicaid Services (CMS), you can investigate trends in plan rates, access coverage across states and zip codes, compare metal level plans (across years), as well as analyze benefit information all in one place.
We’ve provided six CSV files containing combined data from across all years: BenefitsCostSharing.csv provides details on benefits, BusinessRules.csv provides details about premium payment requirements for a plan or set of plans, Network.csv offers details about health plans’ networks of providers who offer services at different cost levels to members enrolled in a given plan or set of plans; PlanAttributes.csv gives attributes like age off dates for various plans; Rate.csv delivers information on rate changes; ServiceArea.csv reveals demographic characteristics related to each service area associated with a specific issuer and two CSV files that join data across years (Crosswalk2015 & Crosswalk2016).
So come on board and use your creativity to unlock the mysteries behind changes in benefits in relation to costs while exploring network providers within different regions!!!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset contains information about the health insurance plans offered in the US Health Insurance Marketplace. It includes data on plan benefits, cost-sharing, networks, rates and service areas for different states. The data can be used to compare and analyze plan characteristics across different states and ages which will help guide users decision making when purchasing a health insurance plan.
To begin using the dataset, you should start by looking at the columns available. These include State, Dental Plan, Multistate Plan (2015 & 2016), Metal Level (2015 & 2016), Child/Adult Only (2015 & 2016), FIPS Code, Zip Code Crosswalk Level, Reason for Crosswalk, Multistate Plan Ageoff (2016 & 2015) and MetalLevel Ageoff (2016 & 2015). These columns provide important information on each plan that can be used to compare them across states or between years.
Using this data you can explore several interesting questions such as: How do benefit levels vary among states? Are there any differences in network providers between states? What factors influence plan rates?
In order to answer these questions you should join together relevant tables from across years using Crosswalk 2015/2016 CSV files then organize your data accordingly so that it is easier to visualize differences in features between plans sold across different states or years. Once the information is organized it might be helpful to use visualizations such as line graphs or bar charts to view comparison between feature values of two plans versus one another more clearly in order differentiate variations of plans among Consumers.
By doing this you can gain a better understanding of how certain factors may affect rate changes over time or how certain benefit levels might differ by state which will allow Consumers make an informed choice when selecting their next health insurance plan
- Analyzing the effectiveness of different plan benefits and how they affect premiums to determine a fair price point for different types of healthcare plans.
- Examining the variation in rates, benefits and coverage by state or zip code to identify potential trends or disparities in access to quality health care services across regions.
- Developing an algorithm that can predict premium prices based on certain factors such as age groups, type of plan (metal levels), multistate coverage, etc., to help consumers more easily understand the true cost of their health insurance plans before committing to purchase them
If you use this dataset in your research, please credit the original authors. Data Source
License: Dataset copyright by authors - You are free to: - Share - copy and redistribute the material in any medium or format for any purpose, even commercially. - Adapt - remix, transform, and build upon the material for any purpose, even commercially. - You must: - Give appropriate credit -...
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
The COVID-19 Claims Reimbursement to Health Care Providers and Facilities for Testing, Treatment, and Vaccine Administration for the Uninsured Program provides reimbursements on a rolling basis directly to eligible health care entities for claims that are attributed to the testing, treatment, and or vaccine administration of COVID-19 for uninsured individuals. The program funding information is as follow:
TESTING The American Rescue Plan Act (ARP) which provided $4.8 billion to reimburse providers for testing the uninsured; the Families First Coronavirus Response Act (FFCRA) Relief Fund, which includes funds received from the Public Health and Social Services Emergency Fund, as appropriated in the FFCRCA (P.L. 116-127) and the Paycheck Protection Program and Health Care Enhancement Act (P.L. 116-139) (PPPHCEA), which each appropriated $1 billion to reimburse health care entities for conducting COVID-19 testing for the uninsured.
TREATMENT & VACCINATION The Provider Relief Fund, which includes funds received from the Public Health and Social Services Emergency Fund, as appropriated in the Coronavirus Aid, Relief, and Economic Security (CARES) Act (P.L. 116-136), provided $100 billion in relief funds. The PPPHCEA appropriated an additional $75 billion in relief funds and the Coronavirus Response and Relief Supplemental Appropriations (CRRSA) Act (P.L. 116-260) appropriated another $3 billion. Within the Provider Relief Fund, a portion of the funding from these sources will be used to support healthcare-related expenses attributable to the treatment of uninsured individuals with COVID-19 and vaccination of uninsured individuals. To learn more about the program, visit: https://www.hrsa.gov/CovidUninsuredClaim
This dataset represents the list of health care entities who have agreed to the Terms and Conditions and received claims reimbursement for COVID-19 testing of uninsured individuals, vaccine administration and treatment for uninsured individuals with a COVID-19 diagnosis.
For Provider Relief Fund Data - https://data.cdc.gov/Administrative/HHS-Provider-Relief-Fund/kh8y-3es6
This dataset contains the ICD-10 code lists used to test the sensitivity and specificity of the Clinical Practice Research Datalink (CPRD) medical code lists for dementia subtypes. The provided code lists are used to define dementia subtypes in linked data from the Hospital Episode Statistic (HES) inpatient dataset and the Office of National Statistics (ONS) death registry, which are then used as the 'gold standard' for comparison against dementia subtypes defined using the CPRD medical code lists. The CPRD medical code lists used in this comparison are available here: Venexia Walker, Neil Davies, Patrick Kehoe, Richard Martin (2017): CPRD codes: neurodegenerative diseases and commonly prescribed drugs. https://doi.org/10.5523/bris.1plm8il42rmlo2a2fqwslwckm2 Complete download (zip, 3.9 KiB)
The MarketScan Medicare Supplemental Database provides detailed cost, use and outcomes data for healthcare services performed in both inpatient and outpatient settings.
It Include Medicare Supplemental records for all years, and Medicare Advantage records starting in 2020. This page also contains the MarketScan Medicare Lab Database starting in 2018.
Starting in 2026, there will be a data access fee for using the full dataset. Please refer to the 'Usage Notes' section of this page for more information.
MarketScan Research Databases are a family of data sets that fully integrate many types of data for healthcare research, including:
%3C!-- --%3E
%3C!-- --%3E
%3C!-- --%3E
The MarketScan Databases track millions of patients throughout the healthcare system. The data are contributed by large employers, managed care organizations, hospitals, EMR providers and Medicare.
This page contains the MarketScan Medicare Database.
We also have the following on other pages:
%3C!-- --%3E
**Starting in 2026, there will be a data access fee for using the full dataset **
(though the 1% sample will remain free to use). The pricing structure and other
**relevant information can be found in this **FAQ Sheet.
All manuscripts (and other items you'd like to publish) must be submitted to
support@stanfordphs.freshdesk.com for approval prior to journal submission.
We will check your cell sizes and citations.
For more information about how to cite PHS and PHS datasets, please visit:
https:/phsdocs.developerhub.io/need-help/citing-phs-data-core
Data access is required to view this section.
Metadata access is required to view this section.
Metadata access is required to view this section.
This dataset contains Hospital General Information from the U.S. Department of Health & Human Services. This is the BigQuery COVID-19 public dataset. This data contains a list of all hospitals that have been registered with Medicare. This list includes addresses, phone numbers, hospital types and quality of care information. The quality of care data is provided for over 4,000 Medicare-certified hospitals, including over 130 Veterans Administration (VA) medical centers, across the country. You can use this data to find hospitals and compare the quality of their care
You can use the BigQuery Python client library to query tables in this dataset in Kernels. Note that methods available in Kernels are limited to querying data. Tables are at bigquery-public-data.cms_medicare.hospital_general_info.
How do the hospitals in Mountain View, CA compare to the average hospital in the US? With the hospital compare data you can quickly understand how hospitals in one geographic location compare to another location. In this example query we compare Google’s home in Mountain View, California, to the average hospital in the United States. You can also modify the query to learn how the hospitals in your city compare to the US national average.
“#standardSQL
SELECT
MTV_AVG_HOSPITAL_RATING,
US_AVG_HOSPITAL_RATING
FROM (
SELECT
ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS MTV_AVG_HOSPITAL_RATING
FROM
bigquery-public-data.cms_medicare.hospital_general_info
WHERE
city = 'MOUNTAIN VIEW'
AND state = 'CA'
AND hospital_overall_rating <> 'Not Available') MTV
JOIN (
SELECT
ROUND(AVG(CAST(hospital_overall_rating AS int64)),2) AS US_AVG_HOSPITAL_RATING
FROM
bigquery-public-data.cms_medicare.hospital_general_info
WHERE
hospital_overall_rating <> 'Not Available')
ON
1 = 1”
What are the most common diseases treated at hospitals that do well in the category of patient readmissions?
For hospitals that achieved “Above the national average” in the category of patient readmissions, it might be interesting to review the types of diagnoses that are treated at those inpatient facilities. While this query won’t provide the granular detail that went into the readmission calculation, it gives us a quick glimpse into the top disease related groups (DRG)
, or classification of inpatient stays that are found at those hospitals. By joining the general hospital information to the inpatient charge data, also provided by CMS, you could quickly identify DRGs that may warrant additional research. You can also modify the query to review the top diagnosis related groups for hospital metrics you might be interested in.
“#standardSQL
SELECT
drg_definition,
SUM(total_discharges) total_discharge_per_drg
FROM
bigquery-public-data.cms_medicare.hospital_general_info
gi
INNER JOIN
bigquery-public-data.cms_medicare.inpatient_charges_2015
ic
ON
gi.provider_id = ic.provider_id
WHERE
readmission_national_comparison = 'Above the national average'
GROUP BY
drg_definition
ORDER BY
total_discharge_per_drg DESC
LIMIT
10;”
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Dataset Card for [Dataset Name]
Dataset Summary
This data set contains over 6,000 medical terms and their wikipedia text. It is intended to be used on a downstream task that requires medical terms and their wikipedia explanation.
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
[More… See the full description on the dataset page: https://huggingface.co/datasets/shankarsubramony/medwikidataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.
We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].
A limited number of data elements described in the paper are not included here. The following elements are excluded:
The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.
The free-text comments written by raters during the ratings process.
Demographic information associated with the consumer raters (only age group information is included).
Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).
Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2
Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z
Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.
Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.
Independent Ratings [ratings_independent.csv
]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence
with three possible values (No bias
, Minor bias
, Severe bias
). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes
). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.
Paired Ratings [ratings_pairwise.csv
]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias
). Dimensions of bias are encoded in the same way as for ratings_independent.csv
. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.
Counterfactual Paired Ratings [ratings_counterfactual.csv
]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence
), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff
, how_answers_diff
). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.
Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv
]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].
Equity in Health AI (EHAI) [equitymedqa_ehai.csv
]: Contains questions that compose the EHAI dataset.
Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv
]: Contains questions that compose the FBRT-Manual dataset.
Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv
]: Contains questions that compose the extended FBRT-LLM dataset.
Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv
]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.
TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv
]: Contains questions that compose the TRINDS dataset.
Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv
]: Contains pairs of questions that compose the CC-Manual dataset.
Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv
]: Contains pairs of questions that compose the CC-LLM dataset.
HealthSearchQA [other_datasets_healthsearchqa.csv
]: Contains questions sampled from the HealthSearchQA dataset [1,2].
Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq
]: Contains questions that compose the Mixed MMQA-OMAQ dataset.
Omiye et al. [other datasets_omiye_et_al
]: Contains questions proposed in Omiye et al. [3].
Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)
WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.
NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
The Medical Care Cost Recovery National Database (MCCR NDB) provides a repository of summary Medical Care Collections Fund (MCCF) billing and collection information used by program management to compare facility performance. It stores summary information for Veterans Health Administration (VHA) receivables including the number of receivables and their summarized status information. This database is used to monitor the status of the VHA's collection process and to provide visibility on the types of bills and collections being done by the Department. The objective of the VA MCCF Program is to collect reimbursement from third party health insurers and co-payments from certain non-service-connected (NSC) Veterans for the cost of medical care furnished to Veterans. Legislation has authorized VHA to: submit claims to and recover payments from Veterans' third party health insurance carriers for treatment of non-service-connected conditions; recover co-payments from certain Veterans for treatment of non-service-connected conditions; and recover co-payments for medications from certain Veterans for treatment of non-service-connected conditions. All of the information captured in the MCCR NDB is derived from the Accounts Receivable (AR) modules running at each medical center. MCCR NDB is not used for official collections figures; instead, the Department uses the Financial Management System (FMS).
Archived as of 5/30/2025: The datasets will no longer receive updates but the historical data will continue to be available for download. This dataset provides information related to mothers with a live birth during the time period 07/2016 to 07/2020 and their claims 2 years prior and 2 years post delivery. It contains information about the overall number of claims and overall total dollar amount, total claims prebirthing and postbirthing, and total dollar amount prebirthing and postbirthing, by mother’s county of residence at the time of delivery. Maternal health claims are defined as claims with mothers diagnosed with at least one of the following ICD codes: 650, V270, V272, V273, V275, V276, V3000, V3100, V3200, V3300, V3400, V3500, V3600, V3700, V3900, O80, Z370, Z372, Z373, Z3750, Z3751, Z3752, Z3753, Z3754, Z3759, Z3760, Z3761, Z3762, Z3763, Z3764, Z3769, Z3800, Z382, Z385, Z3830, Z3830, Z3861, Z3863, Z3865, Z3868, Z388, V7242, V220, V239, V221, V222, V230, V232, V234, V2341, V2342, V724, V237, V279, V6511, V241, V242, V251, V723, V762, Z37, Z370, Z371, Z372, Z373, Z374, Z375, Z3750, Z3751, Z3752, Z3753, Z3754, Z3759, Z376, Z3760, Z3761, Z3762, Z3763, Z3764, Z3769, Z377, Z379, Z34, Z340, Z3400, Z3401, Z3402, Z3403, Z348, Z3480, Z3481, Z3482, Z3483, Z349, Z3490, Z3491, Z3492, Z3493, O09, O090, O0900, O0901, O0902, O0903, O091, O0910, O0911, O0912, O0913, O09A, O09A0, O09A1, O09A2, O09A3, O092, O0921, O09211, O09212, O09213, O09219, O0929, O09291, O09292, O09293, O09299, O093, O0930, O0931, O0932, O0933, O094, O0940, O0941, O0942, O0943, O095, O0951, O09511, O09512, O09513, O09519, O0952, O09521, O09522, O09523, O09529, O096, O0961, O09611, O09612, O09613, O09619, O0962, O09621, O09622, O09623, O09629, O097, O0970, O0971, O0972, O0973, O098, O0981, O09811, O09812, O09813, O09819, O0982, O09821, O09822, O09823, O09829, O0989, O09891, O09892, O09893, O09899, O099, O0990, O0991, O0992, O0993. A maternal health claim is also defined as claims with at least one of the following CPT codes: 59025, 59424, 59425, 59426, 76818, 88291, 59400, 59409, 59410, 59510, 59514, 59515, 59610, 59612, 59614, 59618, 59620, 59622, 57170, 58300, 59430, 88141, 88142, 88143, 88147, 88148, 88150, 88152, 88153, 88154, 88155, 88164, 88165, 88166, 88167, 88174, 88175. Prebirthing is restricted to claims with a service date within 2 years before to the delivery date of the child. Postbirthing is restricted to claims with a service date within 2 years after the delivery date of the child. This data is for research purposes and is not intended to be used for reporting. Due to differences in geographic aggregation, time period considerations, and units of analysis, these numbers may differ from those reported by FSSA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Background: In Brazil, studies that map electronic healthcare databases in order to assess their suitability for use in pharmacoepidemiologic research are lacking. We aimed to identify, catalogue, and characterize Brazilian data sources for Drug Utilization Research (DUR).Methods: The present study is part of the project entitled, “Publicly Available Data Sources for Drug Utilization Research in Latin American (LatAm) Countries.” A network of Brazilian health experts was assembled to map secondary administrative data from healthcare organizations that might provide information related to medication use. A multi-phase approach including internet search of institutional government websites, traditional bibliographic databases, and experts’ input was used for mapping the data sources. The reviewers searched, screened and selected the data sources independently; disagreements were resolved by consensus. Data sources were grouped into the following categories: 1) automated databases; 2) Electronic Medical Records (EMR); 3) national surveys or datasets; 4) adverse event reporting systems; and 5) others. Each data source was characterized by accessibility, geographic granularity, setting, type of data (aggregate or individual-level), and years of coverage. We also searched for publications related to each data source.Results: A total of 62 data sources were identified and screened; 38 met the eligibility criteria for inclusion and were fully characterized. We grouped 23 (60%) as automated databases, four (11%) as adverse event reporting systems, four (11%) as EMRs, three (8%) as national surveys or datasets, and four (11%) as other types. Eighteen (47%) were classified as publicly and conveniently accessible online; providing information at national level. Most of them offered more than 5 years of comprehensive data coverage, and presented data at both the individual and aggregated levels. No information about population coverage was found. Drug coding is not uniform; each data source has its own coding system, depending on the purpose of the data. At least one scientific publication was found for each publicly available data source.Conclusions: There are several types of data sources for DUR in Brazil, but a uniform system for drug classification and data quality evaluation does not exist. The extent of population covered by year is unknown. Our comprehensive and structured inventory reveals a need for full characterization of these data sources.
The Medicare Physician & Other Practitioners by Provider and Service dataset provides information on use, payments, and submitted charges organized by National Provider Identifier (NPI), Healthcare Common Procedure Coding System (HCPCS) code, and place of service. Note: This full dataset contains more records than most spreadsheet programs can handle, which will result in an incomplete load of data. Use of a database or statistical software is required.
Background Our aim was to compare access to effective care among elderly Medicare patients in a Staff Model and Group Model HMO and in Fee-for-Service (FFS) care.
Methods
We used a retrospective cohort study design, using claims and automated medical record data to compare achievement on quality indicators for elderly Medicare recipients. Secondary data were collected from 1) HMO data sets and 2) Medicare claims files for the time period 1994–95. All subjects were Medicare enrollees in a defined area of New England: those enrolled in two divisions of a managed care plan with different physician payment arrangements: a staff model, and a group model; and the Medicare FFS population. We abstracted information on indicators covering several domains: preventive, diagnosis-specific, and chronic disease care.
Results
On the indicators we created and tested, access in the single managed care plan under study was comparable to or better than FFS care in the same geographic region. Percent of Medicare recipients with breast cancer screening was 36 percentage points higher in the staff model versus FFS (95% confidence interval 34–38 percentage points). Follow up after hospitalization for myocardial infarction was 20 percentage points higher in the group model than in FFS (95% confidence interval 14–26 percentage points).
Conclusion
According to indicators developed for use in both claims and automated medical record data, access to care for elderly Medicare beneficiaries in one large managed care organization was as good as or better than that in FFS care in the same geographic area.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Here, you will find resources to use the Bynum-Standard 1-Year Algorithm including a README file that accompanies SAS and Stata scripts for the 1-Year Standard Method for identifying Alzheimer’s Disease and Related Dementias (ADRD) in Medicare Claims data. There are seven script files (plus a parameters file for SAS [parm.sas]) for both SAS and Stata. The files are numbered in the order in which they should be run; the five “1” files may be run in any order.The full algorithm requires access to a single year of Medicare Claims data for (1) MedPAR, (2) Home Health Agency (HHA) Claims File, (3) Hospice Claims File, (4) Carrier Claims and Line Files, and (5) Hospital Outpatient File (HOF) Claims and Revenue Files. All Medicare Claims files are expected to be in SAS format (.sas7bdat).For each data source, the script will output three files*:Diagnosis-level file: Lists individual ADRD diagnoses for each beneficiary for a given visit. This file allows researchers to identify which ICD-9-CM or ICD-10-CM codes are used in the claims data.Service Date-level file: Aggregated from the Diagnosis-level file, this file includes all beneficiaries with an ADRD diagnosis by Service Date (date of a claim with at least one ADRD diagnosis).Beneficiary-level file: Aggregated from the Service Date-level file, this file includes all beneficiaries with at least one* ADRD diagnosis at any point in the year within a specific file* The algorithm combines the Carrier and HOF files at the Service Date-level. The final combined Carrier and HOF Beneficiary-level file includes those with at least two (2) claims that are seven (7) or more days apart.A final combined file is created by merging all Beneficiary-level files. This file is used to identify beneficiaries with ADRD and can be merged onto other files by the Beneficiary ID (BENE_ID).With appreciation & acknowledgement to colleagues from a grant funded by the NIA for their involvement in development & validation of the Bynum-Standard 1-Year Algorithm
The Medicare Home Health Agency tables provide use and payment data for home health agencies. The tables include use and expenditure data from home health Part A (Hospital Insurance) and Part B (Medical Insurance) claims. For additional information on enrollment, providers, and Medicare use and payment, visit the CMS Program Statistics page. These data do not exist in a machine-readable format, so the view data and API options are not available. Please use the download function to access the data. Below is the list of tables: MDCR HHA 1. Medicare Home Health Agencies: Utilization and Program Payments for Original Medicare Beneficiaries, by Type of Entitlement, Yearly Trend MDCR HHA 2. Medicare Home Health Agencies: Utilization and Program Payments for Original Medicare Beneficiaries, by Demographic Characteristics and Medicare-Medicaid Enrollment Status MDCR HHA 3. Medicare Home Health Agencies: Utilization and Program Payments for Original Medicare Beneficiaries, by Area of Residence MDCR HHA 4. Medicare Home Health Agencies: Persons with Utilization and Total Service Visits for Original Medicare Beneficiaries, Type of Agency and Type of Service Visit MDCR HHA 5. Medicare Home Health Agencies: Persons with Utilization and Total Service Visits for Original Medicare Beneficiaries, by Type of Control and Type of Service Visit MDCR HHA 6. Medicare Home Health Agencies: Persons with Utilization, Total Service Visits, and Program Payments for Original Medicare Beneficiaries, by Number of Service Visits and Number of Episodes
Archived as of 6/26/2025: The datasets will no longer receive updates but the historical data will continue to be available for download. This dataset provides information related to the major services for patients. It contains information about the total number of patients, total number of claims, and dollar amount paid, grouped by recipient zip code. Restricted to claims with service date between 01/2012 to 12/2017. Service categories considered are: 01 - Inpatient Service 03 - Outpatient Service 06 - Physician Service 11 - Lab Service 12 - X-Ray Service 17 - Clinic Service 26 - Mental Health Service 27 - Dental Service/Child 28 - Dental Service/Adult 31 - Eye Care and Exams 38 - EPSDT Service Provider is billing provider. This data is for research purposes and is not intended to be used for reporting. Due to differences in geographic aggregation, time period considerations, and units of analysis, these numbers may differ from those reported by FSSA. Distance between recipient and provider is a straight-line distance calculated and not the physical distance.