The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.
Supported Tasks and Leaderboards
Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.
Dataset Card for Dataset Name
Dataset Summary
This dataset card aims to be a base template for new datasets. It has been generated using this raw template.
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/autoevaluate/squad-sample.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD 2.0
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.
MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.
Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.
After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.
In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.
Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:
The PIAF project is one of the shared tools of the IA Lab.
The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.
We deeply thank our contributors who have made this project live on a voluntary basis to this day.
Information on the protocol followed, the project news, the annotation platform and the related code are here:
TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card
Dataset Summary
This dataset contains multilingual, parallel SQuAD dataset examples across EN, DE, ES, and IT. To construct the dataset, identifiers were aligned across the following SQuAD-related datasets:
EN, DE, ES: XQuAD (Cross-lingual Question Answering Dataset) IT: SQuAD-it
See citation information below.
Citation Information
XQuAD: @article{Artetxe:etal:2019, author = {Mikel Artetxe and Sebastian Ruder and Dani Yogatama}… See the full description on the dataset page: https://huggingface.co/datasets/kyle-obrien/multilingual-squad.
The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 (SQuAD). SQuAD v1.1 consists of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The dataset was converted into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue. The QNLI dataset is part of GLUE benchmark.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Questions, answers and documents are stored in the dataset. Every question has an answer and the answer comes from a page of Rijksportaal Personnel (intranet central government). With this dataset a question-and-answer model can be trained. The computer thus learns to answer questions in the context of P-Direkt. A total of 322 questions were used that were once asked by e-mail to the contact center of P-Direkt. The questions are very general and never ask about personal circumstances. The aim of the dataset was to test whether question-and-answer models could possibly be used in a P-Direkt environment. The structure of the dataset corresponds to the Squad 2.0 dataset. ### Example: #### Question: Is it true that my SCV hours of 2020 expire if I don't take them? #### Answer: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire #### Source*: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire. You cannot have your IKB savings leave paid out. Payment is only made in the event of termination of employment or death. You can save up to 1800 hours. Do you work part-time or more than an average of 36 hours per week? In that case, the maximum number of hours to be saved is calculated proportionally and rounded down to whole hours. Any remaining holiday hours from 2015 and extra-statutory holiday hours that you had left over from 2016 up to and including 2019 will be converted into IKB hours on 1 January 2020 and these have been added to your IKB savings leave. * Please note, source is a snapshot of National Portal Personnel from April 2021. Go to National Portal Personnel on the intranet for up-to-date information about personnel matters.
Dataset Card for SQuAD
This dataset is a collection of question-answer pairs from the SQuAD dataset. See SQuAD for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "question", "answer" Column types: str, str Examples:{ 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answer': 'Architecturally, the school has a Catholic… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/squad.
The MRQA 2019 Shared Task focuses on generalization in question answering. An effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples — a significantly harder challenge.
MRQA adapts and unifies multiple distinct question answering datasets (carefully selected subsets of existing datasets) into the same format (SQuAD format). Among them, six datasets were made available for training, and six datasets were made available for testing. Small portions of the training datasets were held-out as in-domain data that may be used for development. The testing datasets only contain out-of-domain data. This benchmark is released as part of the MRQA 2019 Shared Task.
More information can be found at: https://mrqa.github.io/2019/shared.html
.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mrqa', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Squad-it
This dataset is an adapted version of that squad-it to train on HuggingFace models. It contains:
train samples: 87599 test samples : 10570
This dataset is for question answering and his format is the following: [ { "answers": [ { "answer_start": [1], "text": ["Questo è un testo"] }, ], "context": "Questo è un testo relativo al contesto.", "id": "1", "question": "Questo è un testo?", "title": "train test" } ]
It can… See the full description on the dataset page: https://huggingface.co/datasets/z-uo/squad-it.
This dataset provides information on 27 in Ohio, United States as of June, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.
In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.
For full details on how the dataset was created, kindly refer to the paper.
This dataset provides information on 27 in California, United States as of May, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.
This dataset provides information on 57 in Australia as of June, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.
SQUAD - Smart Qualitative Data: Methods and Community Tools for Data Mark-Up is a demonstrator project that will explore methodological and technical solutions for exposing digital qualitative data to make them fully shareable, exploitable and archivable for the longer term. Such tools are required to exploit fully the potential of qualitative data for adventurous collaborative research using web-based and e-science systems. An example of the latter might be linking multiple data and information sources, such as text, statistics and maps. Initially, the project deals with specifying and testing flexible means of storing and marking-up, or annotating, qualitative data using universal standards and technologies, through eXtensible Mark-up Language (XML).A community standard, or schema, will be proposed that will be applicable to most kinds of qualitative data. The second strand investigates optimal requirements for describing or 'contextualising' research data (e.g. interview setting or interviewer characteristics), aiming to develop standards for data documentation. The third strand aims to use natural language processing technologies to develop and implement user-friendly tools for semi-automating processes to prepare marked-up qualitative data. Finally, the project will investigate tools for publishing the enriched data and contextual information to web-based systems and for exporting to preservation formats.
Le recours à l’intelligence artificielle au sein de l’action publique est souvent identifié comme une opportunité pour interroger des textes documentaires et réaliser des outils de questions/réponses (QR) automatiques à destination des usagers. Interroger le code du travail en langage naturel, mettre à disposition un agent conversationnel pour un service donné, développer des moteurs de recherche performants, améliorer la gestion des connaissances, autant d’activités qui nécessitent de disposer de corpus de données d’entraînement de qualité afin de développer des algorithmes de questions/réponses. Le dataset PIAF est un jeu de données d’entraînement francophone public et ouvert qui permet d’entraîner ces algorithmes.
En nous inspirant de SQuAD, le jeu de données bien connu de QR anglais, nous avons eu l’ambition de construire un jeu de données similaire qui serait ouvert à tous. Le protocole que nous avons suivi est très similaire à celui de la première version de SQuAD (SQuAD v1.1). Néanmoins, quelques modifications ont dû être apportées pour s’adapter aux caractéristiques du Wikipédia français. Une autre grande différence est que nous n’employons pas de micro-travailleurs via des plateformes de crowd-sourcing.
Après plusieurs mois d’annotation, nous avons une plateforme d’annotation robuste et libre, une quantité suffisante d’annotations et une démarche d’animation de communauté et de participation collaborative bien calée et innovante au sein de l’administration française.
En mars 2018, la France a lancé sa stratégie nationale pour l’intelligence artificielle. Pilotée au sein de la Direction interministérielle du numérique, cette stratégie comprend trois volets : la recherche, l’économie et la transformation publique.
La politique de la donnée étant un axe majeur du développement de l’intelligence artificielle, la mission Etalab pilote la mise en place d’un “Lab IA” interministériel, dont la mission est d’accélérer le déploiement de l’IA dans les administrations via 3 activités principales :
Le projet PIAF est l'un des outils mutualisés du Lab IA.
Le dataset suive le format de SQuAD v1.1. PIAFv1.2 contient 9225 pairs des questions/réponses. Il s'agit d'un fichier type JSON. Un fichier texte exemplifiant le schéma est inclus ci-dessous. Ce fichier peut être utilisé pour générer et évaluer des modèles de Question-Réponse. Par exemple, en suivant ces instructions.
Nous remercions profondément nos contributeurs qui ont fait vivre ce projet bénévolement jusqu’à aujourd’hui.
Des informations sur le protocole suivi, sur les actualités du projet, sur la plateforme d'annotation et le code lié, sont ici :
The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.