Dataset Card for Natural Questions
This dataset is a collection of question-answer pairs from the Natural Questions dataset. See Natural Questions for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "question", "answer" Column types: str, str Examples:{ 'query': 'the si unit of the electric field is', 'answer': 'Electric field An electric field is a field… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/natural-questions.
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Dataset Card for "nq"
Dataset Summary
This is a modified version of the original Natural Questions (nq) dataset for qa tasks. The original is availabe here. Each sample was preprocessed into a squadlike format. The context was shortened from an entire wikipedia article into the passage containing the answer.
Dataset Structure
Data Instances
An example of 'train' looks as follows. { "context": "The 2017 Major League Baseball All - Star Game was… See the full description on the dataset page: https://huggingface.co/datasets/LLukas22/nq-simplified.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ScandiQA is a dataset of questions and answers in the Danish, Norwegian, and Swedish languages. All samples come from the Natural Questions (NQ) dataset, which is a large question answering dataset from Google searches. The Scandinavian questions and answers come from the MKQA dataset, where 10,000 NQ samples were manually translated into, among others, Danish, Norwegian, and Swedish. However, this did not include a translated context, hindering the training of extractive question answering models.
We merged the NQ dataset with the MKQA dataset, and extracted contexts as either "long answers" from the NQ dataset, being the paragraph in which the answer was found, or otherwise we extract the context by locating the paragraphs which have the largest cosine similarity to the question, and which contains the desired answer.
Further, many answers in the MKQA dataset were "language normalised": for instance, all date answers were converted to the format "YYYY-MM-DD", meaning that in most cases these answers are not appearing in any paragraphs. We solve this by extending the MKQA answers with plausible "answer candidates", being slight perturbations or translations of the answer.
With the contexts extracted, we translated these to Danish, Swedish and Norwegian using the DeepL translation service for Danish and Swedish, and the Google Translation service for Norwegian. After translation we ensured that the Scandinavian answers do indeed occur in the translated contexts.
As we are filtering the MKQA samples at both the "merging stage" and the "translation stage", we are not able to fully convert the 10,000 samples to the Scandinavian languages, and instead get roughly 8,000 samples per language. These have further been split into a training, validation and test split, with the former two containing roughly 750 samples. The splits have been created in such a way that the proportion of samples without an answer is roughly the same in each split.
Tevatron/wikipedia-nq dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Serbian Natural Questions (Subset)
Dataset Summary
This dataset is a Serbian translation of the first 8,000 examples from Google's Natural Questions (NQ) dataset. It contains real user questions and corresponding Wikipedia articles, automatically translated from English to Serbian. The dataset is designed for evaluating embedding models on Question Answering (QA) and Information Retrieval (IR) tasks in the Serbian language, offering a more realistic and… See the full description on the dataset page: https://huggingface.co/datasets/smartcat/natural_quesions_sr.
📘 Dataset Card for ReaRAG-20k
🤗 Model • 💻 GitHub • 📃 Paper
ReaRAG-20k is a reasoning-focused dataset designed for training the ReaRAG model. It contains approximately 20,000 multi-turn retrieval examples constructed from the QA datasets such as HotpotQA, MuSiQue, and Natural Questions (NQ). Each instance follows a conversational format supporting reasoning and retrieval steps: { "messages": [{"role": "user", "content": "..."}, {"role": "assistant"… See the full description on the dataset page: https://huggingface.co/datasets/THU-KEG/ReaRAG-20k.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Dataset Card for Natural Questions
This dataset is a collection of question-answer pairs from the Natural Questions dataset. See Natural Questions for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.
Dataset Subsets
pair subset
Columns: "question", "answer" Column types: str, str Examples:{ 'query': 'the si unit of the electric field is', 'answer': 'Electric field An electric field is a field… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/natural-questions.