Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.
Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.
Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!
- Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
- Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
- Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Facebook
Twitterall-processed dataset is a concatenation of of medical-meadow-* and chatdoctor_healthcaremagic datasets The Chat Doctor term is replaced by the chatbot term in the chatdoctor_healthcaremagic dataset Similar to the literature the medical_meadow_cord19 dataset is subsampled to 50,000 samples truthful-qa-* is a benchmark dataset for evaluating the truthfulness of models in text generation, which is used in Llama 2 paper. Within this dataset, there are 55 and 16 questions related to Health and… See the full description on the dataset page: https://huggingface.co/datasets/lavita/medical-qa-datasets.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Have you ever wondered where medical chatbots or intelligent search engines for health information get their knowledge? The answer lies in large datasets like MedQuAD! This rich resource provides a treasure trove of real-world medical questions and informative answers, paving the way for advancements in Natural Language Processing (NLP) and Information Retrieval (IR) within the healthcare domain.
MedQuAD, short for Medical Question Answering Dataset, is a collection of question-answer pairs meticulously curated from 12 trusted National Institutes of Health (NIH) websites. These websites cover a wide range of health topics, from cancer.gov to GARD (Genetic and Rare Diseases Information Resource).
Beyond the sheer volume of data, MedQuAD offers unique features that empower researchers and developers:
MedQuAD serves as a valuable springboard for various applications in the medical NLP and IR field. Here are some potential uses:
In essence, MedQuAD is a powerful tool for unlocking the potential of NLP and IR in the medical domain. By leveraging this rich dataset, researchers and developers are paving the way for a future where individuals can access accurate and comprehensive health information with increasing ease and efficiency.
Reference:
If you use the MedQuAD dataset or the associated QA test collection, please cite the following paper: Ben Abacha, A., & Demner-Fushman, D. (2019). A Question-Entailment Approach to Question Answering. BMC Bioinformatics, 20(1), 511. https://doi.org/10.1186/s12859-019-3119-4
Facebook
TwitterAjayaadhi/Medical-QA dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
About the dataset
The dataset was created from two dataset MedQuAD and doctor_qa_bangla
Dataset making procedure:
Took 500 samples from both MedQuaD and doctor_qa_bangla dataset. Merged the samples Randomly shuffled the samples
Facebook
TwitterDataset Card for "medical-qa-shared-task-v1-toy-eval"
More Information needed
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Dataset Details
Dataset Description
Curated by: [More Information Needed] Funded by [optional]: [More Information Needed] Shared by [optional]: [More Information Needed] Language(s) (NLP): [More Information Needed] License: [More Information Needed]
Dataset Sources [optional]
Repository: [More… See the full description on the dataset page: https://huggingface.co/datasets/TUDB-Labs/medical-qa.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This comprehensive dataset contains 87,930 medical questions and answers, meticulously compiled from the "medical" website. It offers a unique focus on Arabic language, catering specifically to research and development in medical natural language processing and AI in Arabic-speaking regions.
Arabic Language Focus: As an Arabic dataset, it offers a valuable resource for developing and testing AI models in a language that is underrepresented in medical NLP research.
Structured for Machine Learning: The data is organized into three distinct sets:
Training Data: The largest portion, designed for AI models to learn and identify patterns. Validation Data: A separate set for fine-tuning and optimizing model parameters. Test Data: A final set to evaluate the performance and accuracy of models in a realistic setting.
Facebook
Twitterhttps://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts
This dataset was designed and created to enable advancements in healthcare-focused large language models, particularly in the context of retrieval-augmented clinical question-answering capabilities. Developed using a self-constructed pipeline based on the 13-billion parameter Meta Llama 2 model, this dataset encompasses 21466 medical discharge summaries extracted from the MIMIC-IV-Note dataset, with 156599 synthetically generated question-and-answer pairs, a subset of which was verified for accuracy by a physician. These pairs were generated by providing the model with a discharge summary and instructing it to generate question-and-answer pairs based on the contextual information present in the summaries. This work aims to generate data in support of the development of compact large language models capable of efficiently extracting information from medical notes and discharge summaries, thus enabling potential improvements for real-time decision-making processes in clinical settings. Additionally, accompanying the dataset is code facilitating question-and-answer pair generation from any medical and non-medical text. Despite the robustness of the presented dataset, it has certain limitations. The generation process was confined to a maximum context length of 6000 input tokens, owing to hardware constraints. The large language model's nature in generating these question-and-answer pairs may introduce an underlying bias or a lack in diversity and complexity. Future iterations should focus on rectifying these issues, possibly through diversified training and expanded verification procedures as well as the employment of more powerful large language models.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by mouadenna
Released under Apache 2.0
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by umama aleem
Released under Apache 2.0
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The XBMU Chinese–Tibetan Medical QA Dataset is the first large-scale bilingual medical question-answering dataset constructed by Northwest Minzu University. It contains a total of 40,274 parallel question-answer pairs, covering six major clinical fields: otorhinolaryngology, ophthalmology, internal medicine, neurology, surgery, and nutrition and healthcare. The data are derived from real medical consultation texts and have undergone multiple rounds of cleaning, de-identification, standardization, and expert review to ensure privacy compliance and semantic accuracy. Each sample includes the fields question_zh, answer_zh, and the corresponding question_bo, answer_bo, maintaining a consistent structure. The dataset supports both Chinese medical question-answering tasks and Chinese-Tibetan machine translation research and can be used for multilingual QA generation, terminology alignment, cross-lingual knowledge transfer, and model robustness evaluation.
Facebook
TwitterStarlord1010/Medical-QA-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterThis dataset was created by mohamed ardif
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Henry41148
Released under MIT
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
MedQuAD Dataset (11 medical websites via XML files). MedQuAD (Medical Question Answering Dataset), released by the U.S. National Library of Medicine (NLM). The dataset contains approximately 47,000 question–answer pairs, covering symptoms, causes, diagnosis, treatment, prevention, prognosis, and follow-up, each paired with authoritative answers from NIH/NLM websites. Domain: Biomedical / Healthcare. It spans multiple diseases, conditions, and treatments, suitable for training a general-purpose medical QA system. • Each file contains multiple
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MedQuAD - Medical Question Answering Dataset
Dataset Overview
MedQuAD (Medical Question Answering Dataset) is a collection of 16,407 medical question-answer pairs derived from 9 NIH websites. It covers 37 question types related to diseases, drugs, medical tests, and treatments. This dataset is useful for building medical question-answering models, retrieval-augmented generation (RAG) systems, and other NLP applications in the healthcare domain.
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/mukulb/clustered_MEDQUAD_dataset_with_groups.
Facebook
TwitterMIRAGE consists of 7,663 questions from 5 commonly used medical QA datasets:
MMLU-Med: A medical QA dataset with 1,089 questions, extracted from six biomedical topics in MMLU, including anatomy, clinical knowledge, professional medicine, human genetics, university medicine, and 996 questions on university-level biology.
MedQA-US: A medical QA dataset focusing on the real-world English subset of MedQA. The questions are sourced from the United States Medical Licensing Examination (USMLE), totaling 1,273 multiple-choice questions with four answer options.
MedMCQA: A medical QA dataset containing 4,183 questions from Indian medical entrance exams. The dataset is selected from the development set of the original MedMCQA.
PubMedQA: A biomedical research QA dataset. PubMedQA* is constructed by removing some contexts in 500 expert-annotated test samples from PubMedQA. The questions may have answers as "Yes," "No," or "Maybe," reflecting the degree of certainty based on scientific literature.
BioASQ-Y/N: A biomedical research QA dataset containing 618 Yes/No questions extracted from the BioASQ Task B test set over the past five years (2019-2023). In this dataset, the original factual snippets have been removed.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Henry41148
Released under MIT
Facebook
TwitterDataset Card for "medical-qa-shared-task-v1-half"
More Information needed
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Huggingface Hub [source]
The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.
Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.
Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!
- Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
- Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
- Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.