Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: these datasets were uploaded on March 19th, 2024 by Hugging Face staff from the ancillary files attached to the original arXiv submission.
Supplementary Material for "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models"
https://arxiv.org/abs/2403.12025 We include adversarial questions for each of the seven EquityMedQA datasets: OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM. For FBRT-LLM, we include both the full set and the subset… See the full description on the dataset page: https://huggingface.co/datasets/katielink/EquityMedQA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We include the sets of adversarial questions for each of the seven EquityMedQA datasets (OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM), the three other non-EquityMedQA datasets used in this work (HealthSearchQA, Mixed MMQA-OMAQ, and Omiye et al.), as well as the data generated as a part of the empirical study, including the generated model outputs (Med-PaLM 2 [1] primarily, with Med-PaLM [2] answers for pairwise analyses) and ratings from human annotators (physicians, health equity experts, and consumers). See the paper for details on all datasets.
We include other datasets evaluated in this work: HealthSearchQA [2], Mixed MMQA-OMAQ, and Omiye et al [3].
A limited number of data elements described in the paper are not included here. The following elements are excluded:
The reference answers written by physicians to HealthSearchQA questions, introduced in [2], and the set of corresponding pairwise ratings. This accounts for 2,122 rated instances.
The free-text comments written by raters during the ratings process.
Demographic information associated with the consumer raters (only age group information is included).
Singhal, K., et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).
Singhal, K., Azizi, S., Tu, T. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). https://doi.org/10.1038/s41586-023-06291-2
Omiye, J.A., Lester, J.C., Spichak, S. et al. Large language models propagate race-based medicine. npj Digit. Med. 6, 195 (2023). https://doi.org/10.1038/s41746-023-00939-z
Abacha, Asma Ben, et al. "Overview of the medical question answering task at TREC 2017 LiveQA." TREC. 2017.
Abacha, Asma Ben, et al. "Bridging the gap between consumers’ medication questions and trusted answers." MEDINFO 2019: Health and Wellbeing e-Networks for All. IOS Press, 2019. 25-29.
Independent Ratings [ratings_independent.csv
]: Contains ratings of the presence of bias and its dimensions in Med-PaLM 2 outputs using the independent assessment rubric for each of the datasets studied. The primary response regarding the presence of bias is encoded in the column bias_presence
with three possible values (No bias
, Minor bias
, Severe bias
). Binary assessments of the dimensions of bias are encoded in separate columns (e.g., inaccuracy_for_some_axes
). Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Instances were missing for five instances in MMQA-OMAQ and two instances in CC-Manual. This file contains 7,519 rated instances.
Paired Ratings [ratings_pairwise.csv
]: Contains comparisons of the presence or degree of bias and its dimensions in Med-PaLM and Med-PaLM 2 outputs for each of the datasets studied. Pairwise responses are encoded in terms of two binary columns corresponding to which of the answers was judged to contain a greater degree of bias (e.g., Med-PaLM-2_answer_more_bias
). Dimensions of bias are encoded in the same way as for ratings_independent.csv
. Instances for the Mixed MMQA-OMAQ dataset are triple-rated for each rater group; other datasets are single-rated. Four ratings were missing (one for EHAI, two for FRT-Manual, one for FBRT-LLM). This file contains 6,446 rated instances.
Counterfactual Paired Ratings [ratings_counterfactual.csv
]: Contains ratings under the counterfactual rubric for pairs of questions defined in the CC-Manual and CC-LLM datasets. Contains a binary assessment of the presence of bias (bias_presence
), columns for each dimension of bias, and categorical columns corresponding to other elements of the rubric (ideal_answers_diff
, how_answers_diff
). Instances for the CC-Manual dataset are triple-rated, instances for CC-LLM are single-rated. Due to a data processing error, we removed questions that refer to `Natal'' from the analysis of the counterfactual rubric on the CC-Manual dataset. This affects three questions (corresponding to 21 pairs) derived from one seed question based on the TRINDS dataset. This file contains 1,012 rated instances.
Open-ended Medical Adversarial Queries (OMAQ) [equitymedqa_omaq.csv
]: Contains questions that compose the OMAQ dataset. The OMAQ dataset was first described in [1].
Equity in Health AI (EHAI) [equitymedqa_ehai.csv
]: Contains questions that compose the EHAI dataset.
Failure-Based Red Teaming - Manual (FBRT-Manual) [equitymedqa_fbrt_manual.csv
]: Contains questions that compose the FBRT-Manual dataset.
Failure-Based Red Teaming - LLM (FBRT-LLM); full [equitymedqa_fbrt_llm.csv
]: Contains questions that compose the extended FBRT-LLM dataset.
Failure-Based Red Teaming - LLM (FBRT-LLM) [equitymedqa_fbrt_llm_661_sampled.csv
]: Contains questions that compose the sampled FBRT-LLM dataset used in the empirical study.
TRopical and INfectious DiseaseS (TRINDS) [equitymedqa_trinds.csv
]: Contains questions that compose the TRINDS dataset.
Counterfactual Context - Manual (CC-Manual) [equitymedqa_cc_manual.csv
]: Contains pairs of questions that compose the CC-Manual dataset.
Counterfactual Context - LLM (CC-LLM) [equitymedqa_cc_llm.csv
]: Contains pairs of questions that compose the CC-LLM dataset.
HealthSearchQA [other_datasets_healthsearchqa.csv
]: Contains questions sampled from the HealthSearchQA dataset [1,2].
Mixed MMQA-OMAQ [other_datasets_mixed_mmqa_omaq
]: Contains questions that compose the Mixed MMQA-OMAQ dataset.
Omiye et al. [other datasets_omiye_et_al
]: Contains questions proposed in Omiye et al. [3].
Version 2: Updated to include ratings and generated model outputs. Dataset files were updated to include unique ids associated with each question. Version 1: Contained datasets of questions without ratings. Consistent with v1 available as a preprint on Arxiv (https://arxiv.org/abs/2403.12025)
WARNING: These datasets contain adversarial questions designed specifically to probe biases in AI systems. They can include human-written and model-generated language and content that may be inaccurate, misleading, biased, disturbing, sensitive, or offensive.
NOTE: the content of this research repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: these datasets were uploaded on March 19th, 2024 by Hugging Face staff from the ancillary files attached to the original arXiv submission.
Supplementary Material for "A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models"
https://arxiv.org/abs/2403.12025 We include adversarial questions for each of the seven EquityMedQA datasets: OMAQ, EHAI, FBRT-Manual, FBRT-LLM, TRINDS, CC-Manual, and CC-LLM. For FBRT-LLM, we include both the full set and the subset… See the full description on the dataset page: https://huggingface.co/datasets/katielink/EquityMedQA.