6 datasets found
  1. h

    natural-questions

    • huggingface.co
    Updated Jan 30, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2018). natural-questions [Dataset]. https://huggingface.co/datasets/sentence-transformers/natural-questions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 30, 2018
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for Natural Questions

    This dataset is a collection of question-answer pairs from the Natural Questions dataset. See Natural Questions for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

      Dataset Subsets
    
    
    
    
    
      pair subset
    

    Columns: "question", "answer" Column types: str, str Examples:{ 'query': 'the si unit of the electric field is', 'answer': 'Electric field An electric field is a field… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/natural-questions.

  2. h

    nq-simplified

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lukas Kreussel, nq-simplified [Dataset]. https://huggingface.co/datasets/LLukas22/nq-simplified
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Lukas Kreussel
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for "nq"

      Dataset Summary
    

    This is a modified version of the original Natural Questions (nq) dataset for qa tasks. The original is availabe here. Each sample was preprocessed into a squadlike format. The context was shortened from an entire wikipedia article into the passage containing the answer.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    An example of 'train' looks as follows. { "context": "The 2017 Major League Baseball All - Star Game was… See the full description on the dataset page: https://huggingface.co/datasets/LLukas22/nq-simplified.

  3. h

    scandi-qa

    • huggingface.co
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Institute (2023). scandi-qa [Dataset]. http://doi.org/10.57967/hf/6061
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2023
    Dataset authored and provided by
    Alexandra Institute
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ScandiQA is a dataset of questions and answers in the Danish, Norwegian, and Swedish languages. All samples come from the Natural Questions (NQ) dataset, which is a large question answering dataset from Google searches. The Scandinavian questions and answers come from the MKQA dataset, where 10,000 NQ samples were manually translated into, among others, Danish, Norwegian, and Swedish. However, this did not include a translated context, hindering the training of extractive question answering models.

    We merged the NQ dataset with the MKQA dataset, and extracted contexts as either "long answers" from the NQ dataset, being the paragraph in which the answer was found, or otherwise we extract the context by locating the paragraphs which have the largest cosine similarity to the question, and which contains the desired answer.

    Further, many answers in the MKQA dataset were "language normalised": for instance, all date answers were converted to the format "YYYY-MM-DD", meaning that in most cases these answers are not appearing in any paragraphs. We solve this by extending the MKQA answers with plausible "answer candidates", being slight perturbations or translations of the answer.

    With the contexts extracted, we translated these to Danish, Swedish and Norwegian using the DeepL translation service for Danish and Swedish, and the Google Translation service for Norwegian. After translation we ensured that the Scandinavian answers do indeed occur in the translated contexts.

    As we are filtering the MKQA samples at both the "merging stage" and the "translation stage", we are not able to fully convert the 10,000 samples to the Scandinavian languages, and instead get roughly 8,000 samples per language. These have further been split into a training, validation and test split, with the former two containing roughly 750 samples. The splits have been created in such a way that the proportion of samples without an answer is roughly the same in each split.

  4. h

    wikipedia-nq

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tevatron, wikipedia-nq [Dataset]. https://huggingface.co/datasets/Tevatron/wikipedia-nq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Tevatron
    Description

    Tevatron/wikipedia-nq dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    natural_quesions_sr

    • huggingface.co
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SmartCat (2024). natural_quesions_sr [Dataset]. https://huggingface.co/datasets/smartcat/natural_quesions_sr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Dataset authored and provided by
    SmartCat
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Serbian Natural Questions (Subset)

      Dataset Summary
    

    This dataset is a Serbian translation of the first 8,000 examples from Google's Natural Questions (NQ) dataset. It contains real user questions and corresponding Wikipedia articles, automatically translated from English to Serbian. The dataset is designed for evaluating embedding models on Question Answering (QA) and Information Retrieval (IR) tasks in the Serbian language, offering a more realistic and… See the full description on the dataset page: https://huggingface.co/datasets/smartcat/natural_quesions_sr.

  6. h

    ReaRAG-20k

    • huggingface.co
    Updated Apr 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Knowledge Engineer Group @ Tsinghua University (2025). ReaRAG-20k [Dataset]. https://huggingface.co/datasets/THU-KEG/ReaRAG-20k
    Explore at:
    Dataset updated
    Apr 18, 2025
    Dataset authored and provided by
    Knowledge Engineer Group @ Tsinghua University
    Description

    📘 Dataset Card for ReaRAG-20k

    🤗 Model • 💻 GitHub • 📃 Paper

    ReaRAG-20k is a reasoning-focused dataset designed for training the ReaRAG model. It contains approximately 20,000 multi-turn retrieval examples constructed from the QA datasets such as HotpotQA, MuSiQue, and Natural Questions (NQ). Each instance follows a conversational format supporting reasoning and retrieval steps: { "messages": [{"role": "user", "content": "..."}, {"role": "assistant"… See the full description on the dataset page: https://huggingface.co/datasets/THU-KEG/ReaRAG-20k.

  7. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sentence Transformers (2018). natural-questions [Dataset]. https://huggingface.co/datasets/sentence-transformers/natural-questions

natural-questions

Natural Questions

sentence-transformers/natural-questions

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 30, 2018
Dataset authored and provided by
Sentence Transformers
Description

Dataset Card for Natural Questions

This dataset is a collection of question-answer pairs from the Natural Questions dataset. See Natural Questions for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

  Dataset Subsets





  pair subset

Columns: "question", "answer" Column types: str, str Examples:{ 'query': 'the si unit of the electric field is', 'answer': 'Electric field An electric field is a field… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/natural-questions.

Search
Clear search
Close search
Google apps
Main menu