The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.
Supported Tasks and Leaderboards
Question… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.
bayes-group-diffusion/squad-2.0 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Czech translation of SQuAD 2.0 and SQuAD 1.1 datasets contains automatically translated texts, questions and answers from the training set and the development set of the respective datasets.
The test set is missing, because it is not publicly available.
The data is released under the CC BY-NC-SA 4.0 license.
If you use the dataset, please cite the following paper (the exact format was not available during the submission of the dataset): Kateřina Macková and Straka Milan: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer, presented at TSD 2020, Brno, Czech Republic, September 8-11 2020.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Automatic translation of the Stanford Question Answering Dataset (SQuAD) v2 into Spanish
This dataset was created by Pi Esposito
Released under Data files © Original Authors
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
SQuAD-NL v2.0 [translated SQuAD / XQuAD]
SQuAD-NL v2.0 is a translation of The Stanford Question Answering Dataset (SQuAD) v2.0. Since the original English SQuAD test data is not public, we reserve the same documents that were used for XQuAD for testing purposes. These documents are sampled from the original dev data split. The English data is automatically translated using Google Translate (February 2023) and the test data is manually post-edited. This version of SQuAD-NL also… See the full description on the dataset page: https://huggingface.co/datasets/GroNLP/squad-nl-v2.0.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The E-Commerce Question Answering Dataset (ECQuAD) is a reading comprehension dataset for question answering in brazilian e-commerce platforms. It consists of questions annotated by crowdworkers on a set of products' descriptions. It follows the SQuAD-v2 format, so questions might be unanswerable.
This is a development set, for public usage, powered by GoBots.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by Vissarion Moutafis
Released under CC0: Public Domain
This dataset was created by sy8200
Dataset Card for squad-v2-modified
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/fshala/squad-v2-modified/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/fshala/squad-v2-modified.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset was created by KungKaching
Released under CC0: Public Domain
PageTurnIO/squad-v2-reference-task dataset hosted on Hugging Face and contributed by the HF Datasets community
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Disfl-QA is a targeted dataset for contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages. Disfl-QA builds upon the SQuAD-v2 dataset, where each question in the dev set is annotated to add a contextual disfluency using the paragraph as a source of distractors.
The final dataset consists of ~12k (disfluent question, answer) pairs. Over 90% of the disfluencies are corrections or restarts, making it a much harder test set for disfluency correction. Disfl-QA aims to fill a major gap between speech and NLP research community. We hope the dataset can serve as a benchmark dataset for testing robustness of models against disfluent inputs.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
https://data.overheid.nl/dataset/vraag-en-antwoord-dataset-rijksportaal-personeelhttps://data.overheid.nl/dataset/vraag-en-antwoord-dataset-rijksportaal-personeel
In de dataset zijn vragen, antwoorden en documenten opgeslagen. Elke vraag heeft een antwoord en het antwoord komt van een pagina van Rijksportaal Personeel (intranet Rijksoverheid) . Met deze dataset kan een vraag-en-antwoordmodel getrained worden. De computer leert zo om vragen te beantwoorden in de context van P-Direkt. In totaal zijn er 322 vragen gebruikt die ooit per e-mail zijn gesteld aan het contact center van P-Direkt. De vragen zijn zeer algemeen en vragen nooit naar persoonlijke omstandigheden. Doel van de dataset was om uit te proberen of vraag-en-antwoordmodellen eventueel in een P-Direkt omgeving gebruikt kunnen worden. De structuur van de dataset komt overeen met de Squad 2.0 dataset. ### Voorbeeld: #### Vraag: Klopt dat mijn IKB uren van 2020 vervallen als ik ze niet opneem? #### Antwoord: U kunt uw IKB-uren opsparen in uw IKB-spaarverlof. IKB-uren die u niet heeft opgenomen als verlof en niet heeft laten uitbetalen, worden eind december toegevoegd aan uw IKB-spaarverlof. Uw IKB-spaarverlof kan niet vervallen #### Bron*: U kunt uw IKB-uren opsparen in uw IKB-spaarverlof. IKB-uren die u niet heeft opgenomen als verlof en niet heeft laten uitbetalen, worden eind december toegevoegd aan uw IKB-spaarverlof. Uw IKB-spaarverlof kan niet vervallen. U kunt uw IKB-spaarverlof niet laten uitbetalen. Uitbetaling vindt alleen plaats bij uitdiensttreding of overlijden. U kunt maximaal 1800 uur sparen. Werkt u in deeltijd of meer dan gemiddeld 36 uur per week? Dan wordt het maximaal aantal te sparen uren naar verhouding berekend en naar beneden afgerond op hele uren. Uw eventuele restant vakantie-uren 2015 en bovenwettelijke vakantie-uren die u over had uit 2016 tot en met 2019 worden op 1 januari 2020 omgezet in IKB-uren en deze zijn toegevoegd aan uw IKB-spaarverlof. * Let op, bron is een momentopname van Rijksportaal Personeel van april 2021. Ga naar Rijksportaal Personeel op het intranet voor actuele informatie over personeelszaken.
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Clean SQuAD v2
This is a refined version of the SQuAD v2 dataset. It has been preprocessed to ensure higher data quality and usability for NLP tasks such as Question Answering.
Description
The Clean SQuAD v2 dataset was created by applying preprocessing steps to the original SQuAD v2 dataset, including:
Trimming whitespace: All leading and trailing spaces have been removed from the question field. Minimum question length: Questions with fewer than 12 characters were… See the full description on the dataset page: https://huggingface.co/datasets/decodingchris/clean_squad_v2.
The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.