6 datasets found

h
natural-questions
huggingface.co
Updated Jan 30, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2018). natural-questions [Dataset]. https://huggingface.co/datasets/sentence-transformers/natural-questions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 30, 2018
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for Natural Questions

This dataset is a collection of question-answer pairs from the Natural Questions dataset. See Natural Questions for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

Dataset Subsets pair subset

Columns: "question", "answer" Column types: str, str Examples:{ 'query': 'the si unit of the electric field is', 'answer': 'Electric field An electric field is a field… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/natural-questions.
h
nq-simplified
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Kreussel, nq-simplified [Dataset]. https://huggingface.co/datasets/LLukas22/nq-simplified
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Lukas Kreussel
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
Dataset Card for "nq"

Dataset Summary

This is a modified version of the original Natural Questions (nq) dataset for qa tasks. The original is availabe here. Each sample was preprocessed into a squadlike format. The context was shortened from an entire wikipedia article into the passage containing the answer.

Dataset Structure Data Instances

An example of 'train' looks as follows. { "context": "The 2017 Major League Baseball All - Star Game was… See the full description on the dataset page: https://huggingface.co/datasets/LLukas22/nq-simplified.
h
scandi-qa
huggingface.co
Updated Aug 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Institute (2023). scandi-qa [Dataset]. http://doi.org/10.57967/hf/6061
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/6061
Dataset updated
Aug 10, 2023
Dataset authored and provided by
Alexandra Institute
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ScandiQA is a dataset of questions and answers in the Danish, Norwegian, and Swedish languages. All samples come from the Natural Questions (NQ) dataset, which is a large question answering dataset from Google searches. The Scandinavian questions and answers come from the MKQA dataset, where 10,000 NQ samples were manually translated into, among others, Danish, Norwegian, and Swedish. However, this did not include a translated context, hindering the training of extractive question answering models.

We merged the NQ dataset with the MKQA dataset, and extracted contexts as either "long answers" from the NQ dataset, being the paragraph in which the answer was found, or otherwise we extract the context by locating the paragraphs which have the largest cosine similarity to the question, and which contains the desired answer.

Further, many answers in the MKQA dataset were "language normalised": for instance, all date answers were converted to the format "YYYY-MM-DD", meaning that in most cases these answers are not appearing in any paragraphs. We solve this by extending the MKQA answers with plausible "answer candidates", being slight perturbations or translations of the answer.

With the contexts extracted, we translated these to Danish, Swedish and Norwegian using the DeepL translation service for Danish and Swedish, and the Google Translation service for Norwegian. After translation we ensured that the Scandinavian answers do indeed occur in the translated contexts.

As we are filtering the MKQA samples at both the "merging stage" and the "translation stage", we are not able to fully convert the 10,000 samples to the Scandinavian languages, and instead get roughly 8,000 samples per language. These have further been split into a training, validation and test split, with the former two containing roughly 750 samples. The splits have been created in such a way that the proportion of samples without an answer is roughly the same in each split.
h
wikipedia-nq
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tevatron, wikipedia-nq [Dataset]. https://huggingface.co/datasets/Tevatron/wikipedia-nq
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Tevatron
Description
Tevatron/wikipedia-nq dataset hosted on Hugging Face and contributed by the HF Datasets community
h
natural_quesions_sr
huggingface.co
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SmartCat (2024). natural_quesions_sr [Dataset]. https://huggingface.co/datasets/smartcat/natural_quesions_sr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2024
Dataset authored and provided by
SmartCat
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Serbian Natural Questions (Subset)

Dataset Summary

This dataset is a Serbian translation of the first 8,000 examples from Google's Natural Questions (NQ) dataset. It contains real user questions and corresponding Wikipedia articles, automatically translated from English to Serbian. The dataset is designed for evaluating embedding models on Question Answering (QA) and Information Retrieval (IR) tasks in the Serbian language, offering a more realistic and… See the full description on the dataset page: https://huggingface.co/datasets/smartcat/natural_quesions_sr.
h
ReaRAG-20k
huggingface.co
Updated Apr 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Knowledge Engineer Group @ Tsinghua University (2025). ReaRAG-20k [Dataset]. https://huggingface.co/datasets/THU-KEG/ReaRAG-20k
Explore at:
Dataset updated
Apr 18, 2025
Dataset authored and provided by
Knowledge Engineer Group @ Tsinghua University
Description
📘 Dataset Card for ReaRAG-20k

🤗 Model • 💻 GitHub • 📃 Paper

ReaRAG-20k is a reasoning-focused dataset designed for training the ReaRAG model. It contains approximately 20,000 multi-turn retrieval examples constructed from the QA datasets such as HotpotQA, MuSiQue, and Natural Questions (NQ). Each instance follows a conversational format supporting reasoning and retrieval steps: { "messages": [{"role": "user", "content": "..."}, {"role": "assistant"… See the full description on the dataset page: https://huggingface.co/datasets/THU-KEG/ReaRAG-20k.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sentence Transformers (2018). natural-questions [Dataset]. https://huggingface.co/datasets/sentence-transformers/natural-questions

natural-questions

Natural Questions

sentence-transformers/natural-questions

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 30, 2018

Dataset authored and provided by

Sentence Transformers

Description

Dataset Card for Natural Questions

This dataset is a collection of question-answer pairs from the Natural Questions dataset. See Natural Questions for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

  Dataset Subsets





  pair subset

Columns: "question", "answer" Column types: str, str Examples:{ 'query': 'the si unit of the electric field is', 'answer': 'Electric field An electric field is a field… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/natural-questions.

Clear search

Close search

Google apps

Main menu

natural-questions

nq-simplified

scandi-qa

wikipedia-nq

natural_quesions_sr

ReaRAG-20k

natural-questions

Natural Questions

sentence-transformers/natural-questions