Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD 2.0
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
There are two files to help you get started with the dataset and evaluate your models:
The original datasets can be found here.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.
Supported Tasks and Leaderboards
Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Rust QA is a dataset for training and evaluating QA systems. The dataset consists of 1068 questions to "The Rust Programming Language" book (https://doc.rust-lang.org/stable/book/) with the answers provided as text spans from the book. The dataset is released in SQuAD 2.0 format.
bayes-group-diffusion/squad-2.0 dataset hosted on Hugging Face and contributed by the HF Datasets community
This dataset was created by Asrst
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
SQuAD 2.0 train dataset converted from JSON to CSV data. The dataset can be used to built complex open QA systems.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Czech translation of SQuAD 2.0 and SQuAD 1.1 datasets contains automatically translated texts, questions and answers from the training set and the development set of the respective datasets. The test set is missing, because it is not publicly available. The data is released under the CC BY-NC-SA 4.0 license. If you use the dataset, please cite the following paper (the exact format was not available during the submission of the dataset): Kateřina Macková and Straka Milan: Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer, presented at TSD 2020, Brno, Czech Republic, September 8-11 2020.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
automatic translation of the Stanford Question Answering Dataset (SQuAD) v2 into Spanish
I(n)dontKnow-MRC (IDK-MRC) is an Indonesian Machine Reading Comprehension dataset that covers answerable and unanswerable questions. Based on the combination of the existing answerable questions in TyDiQA, the new unanswerable question in IDK-MRC is generated using a question generation model and human-written question. Each paragraph in the dataset has a set of answerable and unanswerable questions with the corresponding answer.
Besides IDK-MRC (idk_mrc) dataset, several baseline datasets also provided: 1. Trans SQuAD (trans_squad): machine translated SQuAD 2.0 (Muis and Purwarianti, 2020) 2. TyDiQA (tydiqa): Indonesian answerable questions set from the TyDiQA-GoldP (Clark et al., 2020) 3. Model Gen (model_gen): TyDiQA + the unanswerable questions output from the question generation model 4. Human Filt (human_filt): Model Gen dataset that has been filtered by human annotator
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Dataset Card for "squad-v2-fi"
Dataset Summary
Machine translated and normalized Finnish version of the SQuAD-v2.0 dataset. Details about the translation and normalization processes can be found here. Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the… See the full description on the dataset page: https://huggingface.co/datasets/ilmariky/SQuAD_v2_fi.
This dataset was created by Vijender Singh
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
This BERT baseline model was trained from scratch on the SQuAD 2.0 dataset for 3 epochs. Due to computational resource limitations, further regularization and optimizations weren't added.
Overview
This dataset contains the data for the paper Deep learning based question answering system in Bengali. It is a translated version of SQuAD 2.0 dataset to bengali language. Preprocessing details can be found in the paper.
This model checkpoint was trained using the Huggingface Transformers library. To reproduce, use the script run_squad.py
from the provided examples with the following command:
python run_squad.py \
--model_type bert \
--model_name_or_path monologg/biobert_v1.1_pubmed \
--do_train \
--do_eval \
--train_file $SQUAD_DIR/train-v2.0.json \
--predict_file $SQUAD_DIR/dev-v2.0.json \
--per_gpu_train_batch_size 8 \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/biobert_squad2_cased/ \
--version_2_with_negative
Load the model checkpoint in just a few lines of code: ``` from transformers import AutoTokenizer, AutoModelForQuestionAnswering
model_path = '/kaggle/input/biobert-qa/biobert_squad2_cased' model = AutoModelForQuestionAnswering.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) ```
This model was used to power a Q&A engine for the CORD-19 challenge in the following submissions: Transmission Risk factors Vaccines and therapeutics Medical care
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
SQuAD-NL v2.0 [translated SQuAD / XQuAD]
SQuAD-NL v2.0 is a translation of The Stanford Question Answering Dataset (SQuAD) v2.0. Since the original English SQuAD test data is not public, we reserve the same documents that were used for XQuAD for testing purposes. These documents are sampled from the original dev data split. The English data is automatically translated using Google Translate (February 2023) and the test data is manually post-edited. This version of SQuAD-NL also… See the full description on the dataset page: https://huggingface.co/datasets/GroNLP/squad-nl-v2.0.
eanderson/dutch-squad-v2.0 dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Shengxiang Lin
Released under MIT
Dataset Card for squad-v2-mod
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/fshala/squad-v2-mod/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/fshala/squad-v2-mod.
M2QA (Multi-domain Multilingual Question Answering) is an extractive question answering benchmark for evaluating joint language and domain transfer. M2QA includes 13,500 SQuAD 2.0-style question-answer instances in German, Turkish, and Chinese for the domains of product reviews, news, and creative writing.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for SQuAD 2.0
Dataset Summary
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.