Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Details
The data set has about 1 Million Tokens for Training and about 1500 question answers.
Dataset Description
This dataset is a comprehensive compilation of questions related to dermatology, spanning inquiries about various skin diseases, their symptoms, recommended medications, and available treatment modalities. Each question is paired with a concise and informative response, making it an ideal resource for training and fine-tuning language models in the… See the full description on the dataset page: https://huggingface.co/datasets/Mreeb/Dermatology-Question-Answer-Dataset-For-Fine-Tuning.
Malikeh1375/medical-question-answering-datasets dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
SubjQA is a question answering dataset that focuses on subjective questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.
Supported Tasks and Leaderboards
Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Quora Question Answer Dataset (Quora-QuAD) contains 56,402 question-answer pairs scraped from Quora.
Usage:
For instructions on fine-tuning a model (Flan-T5) with this dataset, please check out the article: https://www.toughdata.net/blog/post/finetune-flan-t5-question-answer-quora-dataset
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for "wiki_qa"
Dataset Summary
Wiki Question Answering corpus from Microsoft. The WikiQA corpus is a publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset Structure
Data Instances
default
Size of downloaded dataset files: 7.10 MB Size… See the full description on the dataset page: https://huggingface.co/datasets/microsoft/wiki_qa.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GooAQ is a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections.
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
FQuAD: French Question Answering Dataset We introduce FQuAD, a native French Question Answering Dataset. FQuAD contains 25,000+ question and answer pairs. Finetuning CamemBERT on FQuAD yields a F1 score of 88% and an exact match of 77.9%.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ArXiv QA
(TBD) Automated ArXiv question answering via large language models Github | Homepage | Simple QA - Hugging Face Space
Automated Question Answering with ArXiv Papers
Latest 25 Papers
LIME: Localized Image Editing via Attention Regularization in Diffusion Models - [Arxiv] [QA]
Revisiting Depth Completion from a Stereo Matching Perspective for Cross-domain Generalization - [Arxiv] [QA]
VL-GPT: A Generative Pre-trained Transformer for Vision and… See the full description on the dataset page: https://huggingface.co/datasets/taesiri/arxiv_qa.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for TweetQA
Dataset Summary
With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure… See the full description on the dataset page: https://huggingface.co/datasets/ucsbnlp/tweet_qa.
CNCF QA Dataset for LLM Tuning
Description
This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model. The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training.
Dataset Card for Natural Questions
This dataset is a collection of question-answer pairs from the Natural Questions dataset. See Natural Questions for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "question", "answer" Column types: str, str Examples:{ 'query': 'the si unit of the electric field is', 'answer': 'Electric field An electric field is a field… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/natural-questions.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for MedMCQA
Dataset Summary
MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which require… See the full description on the dataset page: https://huggingface.co/datasets/openlifescienceai/medmcqa.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "squad"
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Supported Tasks and Leaderboards
More Information Needed
Languages
More Information Needed
Dataset… See the full description on the dataset page: https://huggingface.co/datasets/badokorach/NewQA.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.
Dataset Card for QA-Expert-multi-hop-qa-V1.0
This dataset aims to provide multi-domain training data for the task: Question Answering, with a focus on Multi-hop Question Answering. In total, this dataset contains 25.5k for training and 3.19k for evaluation. You can take a look at the model we trained on this data: https://huggingface.co/khaimaitien/qa-expert-7B-V1.0 The dataset is mostly generated using the OpenAPI model (gpt-3.5-turbo-instruct). Please read more information about… See the full description on the dataset page: https://huggingface.co/datasets/khaimaitien/qa-expert-multi-hop-qa-V1.0.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CovidQA is the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD 2.0
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.
Dataset Card for Yahoo Answers
This dataset is a collection of pairs containing titles, questions, and answers collected from Yahoo Answers. See the Yahoo Answers dataset for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
title-question-answer-pair subset
Columns: "question", "answer" Column types: str, str Examples:{ 'question': "why doesn't an optical mouse work on a glass… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/yahoo-answers.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Details
The data set has about 1 Million Tokens for Training and about 1500 question answers.
Dataset Description
This dataset is a comprehensive compilation of questions related to dermatology, spanning inquiries about various skin diseases, their symptoms, recommended medications, and available treatment modalities. Each question is paired with a concise and informative response, making it an ideal resource for training and fine-tuning language models in the… See the full description on the dataset page: https://huggingface.co/datasets/Mreeb/Dermatology-Question-Answer-Dataset-For-Fine-Tuning.