MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.
MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data
This multiple choice question dataset on a fictional organ, the Glianorex, is used to assess the capabilities of models to answer questions on knowledge they have never encountered. We only provide a test dataset as training models on this dataset would defeat the purpose of isolating linguistic capabilities from knowledge.
Motivation
We designed this dataset to evaluate the… See the full description on the dataset page: https://huggingface.co/datasets/maximegmd/glianorex.
This paper introduces FrenchMedMCQA, the first publicly available Multiple-Choice Question Answering (MCQA) dataset in French for medical domain. It is composed of 3,105 questions taken from real exams of the French medical specialization diploma in pharmacy, mixing single and multiple answers. Each instance of the dataset contains an identifier, a question, five possible answers and their manual correction(s). We also propose first baseline models to automatically process this MCQA task in order to report on the current performances and to highlight the difficulty of the task. A detailed analysis of the results showed that it is necessary to have representations adapted to the medical domain or to the MCQA task: in our case, English specialized models yielded better results than generic French ones, even though FrenchMedMCQA is in French. Corpus, models and tools are available online.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Greek Medical Multiple Choice QA
The Greek Medical Multiple Choice QA dataset is a set of 2034 multiple choice questions in Greek for the medical exams of the Hellenic National Acadenic Recognition and Information Center (DOATAP-ΔΟΑΤΑΠ). The questions were extracted from past exams available at https://www.doatap.gr.
Dataset Details
Dataset Description
Curated by: ILSP/Athena RC
Language(s) (NLP): el License: cc-by-nc-sa-4.0… See the full description on the dataset page: https://huggingface.co/datasets/ilsp/medical_mcqa_greek.
Dataset Card for "MedQuAD"
This dataset is the converted version of MedQuAD. Some notes about the data:
Multiple values in the umls_cui, umls_semantic_types, synonyms columns are separated by | character. Answers for [GARD, MPlusHerbsSupplements, ADAM, MPlusDrugs] sources (31,034 records) are removed from the original dataset to respect the MedlinePlus copyright. UMLS (umls): Unified Medical Language System CUI (cui): Concept Unique Identifier
Question type discrepancies… See the full description on the dataset page: https://huggingface.co/datasets/lavita/MedQuAD.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
agentlans/finewebedu-multiple-choice dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Open domain question answering (OpenQA) tasks have been recently attracting more and more attention from the natural language processing (NLP) community. In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. We implement both rule-based and popular neural methods by sequentially combining a document retriever and a machine comprehension model. Through experiments, we find that even the current best method can only achieve 36.7%, 42.0%, and 70.1% of test accuracy on the English, traditional Chinese, and simplified Chinese questions, respectively. We expect MedQA to present great challenges to existing OpenQA systems and hope that it can serve as a platform to promote much stronger OpenQA models from the NLP community in the future.
https://choosealicense.com/licenses/llama3.1/https://choosealicense.com/licenses/llama3.1/
Medprompt-MedMCQA-ToT
Dataset Summary
Medprompt-MedMCQA-ToT is a retrieval-augmented database designed to enhance contextual reasoning in multiple-choice medical question answering (MCQA). The dataset follows a Tree-of-Thoughts (ToT) reasoning format, where multiple independent reasoning paths are explored collaboratively before arriving at the correct answer. This structured… See the full description on the dataset page: https://huggingface.co/datasets/HPAI-BSC/Medprompt-MedMCQA-ToT.
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MedMCQA MCQA Dataset
This dataset contains the MedMCQA dataset converted to Multiple Choice Question Answering (MCQA) format.
Dataset Description
MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. It covers various medical subjects and topics, making it ideal for evaluating AI systems on medical knowledge.
Dataset Structure
Each example contains:
question: The medical… See the full description on the dataset page: https://huggingface.co/datasets/RikoteMaster/medmcqa-mcqa.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
A Comparative Study of Open-Source Large Language Models
Dataset Overview
Welcome to the dataset repository for our paper, "A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology." The preprint of the paper can be accessed here.
Files
This repository contains two key files:
NEJM_All_Questions_And_Answers.csv: This file includes all the questions and corresponding answers used in the… See the full description on the dataset page: https://huggingface.co/datasets/SeanWu25/NEJM-AI_Benchmarking_Medical_Language_Models.
Overview MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of Large Language Models (LLMs) across 11 Indic languages. It spans 8 domains and 42 subjects, reflecting both general and culturally specific knowledge from India.
Key Features
Languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu, and English Domains: 8 diverse domains including Arts & Humanities, Social Sciences, STEM, and more Subjects: 42 subjects covering a wide range of topics Questions: ~85,000 multiple-choice questions Cultural Relevance: Incorporates India-specific knowledge from regional and state-level examinations
Dataset Statistics | Language | Total Questions | Translated Questions | Avg Words Per Question | |----------|-----------------|----------------------|------------------------| | Bengali | 7138 | 1601 | 15.72 | | Gujarati | 5327 | 2755 | 16.69 | | Hindi | 15450 | 115 | 20.63 | | Kannada | 6734 | 1522 | 12.83 | | Malayalam| 4670 | 1534 | 12.82 | | Marathi | 7424 | 1235 | 18.8 | | Odia | 5025 | 1452 | 15.63 | | Punjabi | 4363 | 2341 | 19.9 | | Tamil | 7059 | 1524 | 13.32 | | Telugu | 7847 | 1298 | 16.13 | | English | 14036 | - | 22.01 | | Total| 85073 | 15377 | 16.77 (avg) |
Dataset Structure Test Set The test set consists of the MILU (Multi-task Indic Language Understanding) benchmark, which contains approximately 85,000 multiple-choice questions across 11 Indic languages.
Validation Set The dataset includes a separate validation set of 9,157 samples that can be used for few-shot examples during evaluation. This validation set was created by sampling from each of the 42 subject tags, which were then condensed into 8 broader domains. This approach ensures a balanced representation across subjects and domains, allowing for consistent few-shot prompting across different models and experiments.
Subjects spanning MILU | Domain | Subjects | |--------|----------| | Arts & Humanities | Architecture and Design, Arts and Culture, Education, History, Language Studies, Literature and Linguistics, Media and Communication, Music and Performing Arts, Religion and Spirituality | | Business Studies | Business and Management, Economics, Finance and Investment | | Engineering & Tech | Energy and Power, Engineering, Information Technology, Materials Science, Technology and Innovation, Transportation and Logistics | | Environmental Sciences | Agriculture, Earth Sciences, Environmental Science, Geography | | Health & Medicine | Food Science, Health and Medicine | | Law & Governance | Defense and Security, Ethics and Human Rights, Law and Ethics, Politics and Governance | | Math and Sciences | Astronomy and Astrophysics, Biology, Chemistry, Computer Science, Logical Reasoning, Mathematics, Physics | | Social Sciences | Anthropology, International Relations, Psychology, Public Administration, Social Welfare and Development, Sociology, Sports and Recreation |
Usage Since this is a gated dataset, after your request for accessing the dataset is accepted, you can set your HuggingFace token:
bash export HF_TOKEN=YOUR_TOKEN_HERE
To load the MILU dataset for a Language:
from datasets import load_dataset
language = 'Hindi'
Use 'test' split for evaluation & 'validation' split for few-shot
split = 'test'
language_data = load_dataset("ai4bharat/MILU", data_dir=language, split=split, token=True)
print(language_data[0])
Evaluation We evaluated 45 different LLMs on MILU, including:
Closed proprietary models (e.g., GPT-4o, Gemini-1.5) Open-source multilingual models Language-specific fine-tuned models
Key findings:
GPT-4o achieved the highest average accuracy at 72% Open multilingual models outperformed language-specific fine-tuned models Models performed better in high-resource languages compared to low-resource ones Performance was lower in culturally relevant areas (e.g., Arts & Humanities) compared to general fields like STEM
For detailed results and analysis, please refer to our paper.
Citation If you use MILU in your research, please cite our paper:
@misc{verma2024milumultitaskindiclanguage, title={MILU: A Multi-task Indic Language Understanding Benchmark}, author={Sshubam Verma and Mohammed Safi Ur Rahman Khan and Vishwajeet Kumar and Rudra Murthy and Jaydeep Sen}, year={2024}, eprint={2411.02538}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.02538}, }
License This dataset is released under the MIT License.
Contact For any questions or feedback, please contact: - Sshubam Verma (sshubamverma@ai4bharat.org) - Mohammed Safi Ur Rahman Khan (safikhan@ai4bharat.org) - Rudra Murthy (rmurthyv@in.ibm.com) - Vishwajeet Kumar (vishk024@in.ibm.com)
Links
GitHub Repository Paper Hugging Face Dataset
Dataset Card for Swedish Medical Exam MCQs
Dataset Description
This dataset contains multiple-choice questions from Swedish medical exams.
Languages
The dataset is in Swedish (sv).
Dataset Structure
Each entry in the dataset contains the following fields:
question: The question options: An array of possible answers answer: The correct answer language: The language of the question (always "sv" for Swedish) country: The country of… See the full description on the dataset page: https://huggingface.co/datasets/serhany/swedish-medical-exams-mcq-1002-json.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Medprompt-MedQA-R1
Medprompt-MedQA-R1 is a reasoning-augmented database designed for context retrieval in multiple-choice medical question answering. The dataset supports the development and evaluation of AI systems tailored to healthcare, particularly in tasks requiring enhanced contextual reasoning and retrieval-based assistance. By including structured reasoning and verified responses… See the full description on the dataset page: https://huggingface.co/datasets/HPAI-BSC/Medprompt-MedQA-R1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Internal Medicine MCQ
Dataset Details
Dataset Description
This dataset consists of 41 high-quality, two-choice multiple-choice questions (MCQs) focused on core biomedical knowledge and clinical scenarios from internal medicine. These questions were specifically curated for research evaluating medical knowledge, clinical reasoning, and confidence-based interactions among medical trainees and large language models (LLMs).
Curated by: Tom Sheffer… See the full description on the dataset page: https://huggingface.co/datasets/tomshe/Internal_Medicine_questions_binary.
Medical textbook question answering
This corpus contains multiple-choice quiz questions for 13 commonly-used medical textbooks. The questions are designed to examine understanding of the main concepts in the textbooks. The QA data is used to evaluate knowledge learning of language models in the following paper:
Paper: Conditional language learning with context
Data Splits
subjects: anatomy, biochemistry, cell biology, gynecology, histology, immunology… See the full description on the dataset page: https://huggingface.co/datasets/winder-hybrids/MedicalTextbook_QA.
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
KorMedMCQA : Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations
We present KorMedMCQA, the first Korean Medical Multiple-Choice Question Answering benchmark, derived from professional healthcare licensing examinations conducted in Korea between 2012 and 2024. The dataset contains 7,469 questions from examinations for doctor, nurse, pharmacist, and dentist, covering a wide range of medical disciplines. We evaluate the performance of 59… See the full description on the dataset page: https://huggingface.co/datasets/sean0042/KorMedMCQA.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MedQA with Prompts
Dataset Summary
This dataset is a modified version of the MedQA dataset, enhanced with additional prompts to provide context or instructions for each question. It is designed to improve the performance of models on medical question-answering tasks.
Dataset Structure
Data Fields
question: The medical question posed. options: A dictionary containing multiple-choice options. answer: The correct answer to the… See the full description on the dataset page: https://huggingface.co/datasets/jaswanth27/test.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Pediatrics MCQ
Dataset Details
Dataset Description
This dataset comprises high-quality multiple-choice questions (MCQs) covering core biomedical knowledge and clinical scenarios from pediatrics. It includes 50 questions, each with four possible answer choices. These questions were specifically curated for research evaluating pediatric medical knowledge, clinical reasoning, and confidence-based interactions among medical trainees and large… See the full description on the dataset page: https://huggingface.co/datasets/tomshe/Pediatrics_questions.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for MMLU
Dataset Summary
Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021). This is a massive multitask test consisting of multiple-choice questions from various branches of knowledge. The test spans subjects in the humanities, social sciences, hard sciences, and other areas that are important for some people to learn. This covers 57 tasks… See the full description on the dataset page: https://huggingface.co/datasets/cais/mmlu.
MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.
MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity.