72 datasets found
  1. h

    tweet_qa

    • huggingface.co
    • opendatalab.com
    Updated Jun 30, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UC Santa Barbara NLP Group (2021). tweet_qa [Dataset]. https://huggingface.co/datasets/ucsbnlp/tweet_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 30, 2021
    Dataset authored and provided by
    UC Santa Barbara NLP Group
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for TweetQA

      Dataset Summary
    

    With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure… See the full description on the dataset page: https://huggingface.co/datasets/ucsbnlp/tweet_qa.

  2. openbookqa

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Mar 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). openbookqa [Dataset]. https://huggingface.co/datasets/allenai/openbookqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for OpenBookQA

      Dataset Summary
    

    OpenBookQA aims to promote research in advanced question-answering, probing a deeper understanding of both the topic (with salient facts summarized as an open book, also provided with the dataset) and the language it is expressed in. In particular, it contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich text comprehension. OpenBookQA is a new kind of… See the full description on the dataset page: https://huggingface.co/datasets/allenai/openbookqa.

  3. covid_qa_deepset

    • huggingface.co
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    deepset (2023). covid_qa_deepset [Dataset]. https://huggingface.co/datasets/deepset/covid_qa_deepset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2023
    Dataset authored and provided by
    deepsethttps://www.deepset.ai/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for COVID-QA

      Dataset Summary
    

    COVID-QA is a Question Answering dataset consisting of 2,019 question/answer pairs annotated by volunteer biomedical experts on scientific articles related to COVID-19. A total of 147 scientific articles from the CORD-19 dataset were annotated by 15 experts.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    The text in the dataset is in English.

      Dataset Structure
    
    
    
    
    
      Data… See the full description on the dataset page: https://huggingface.co/datasets/deepset/covid_qa_deepset.
    
  4. h

    squad

    • huggingface.co
    • tensorflow.org
    • +1more
    Updated Jun 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav R (2020). squad [Dataset]. https://huggingface.co/datasets/rajpurkar/squad
    Explore at:
    Dataset updated
    Jun 12, 2020
    Authors
    Pranav R
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for SQuAD

      Dataset Summary
    

    Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.

      Supported Tasks and Leaderboards
    

    Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.

  5. narrativeqa

    • huggingface.co
    Updated Jun 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepmind (2024). narrativeqa [Dataset]. https://huggingface.co/datasets/deepmind/narrativeqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 3, 2024
    Dataset provided by
    DeepMindhttp://deepmind.com/
    Authors
    Deepmind
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Narrative QA

      Dataset Summary
    

    NarrativeQA is an English-lanaguage dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents.

      Supported Tasks and Leaderboards
    

    The dataset is used to test reading comprehension. There are 2 tasks proposed in the paper: "summaries only" and "stories only", depending on whether the human-generated summary or the full story text is used to answer the question.… See the full description on the dataset page: https://huggingface.co/datasets/deepmind/narrativeqa.

  6. h

    coqa

    • huggingface.co
    • tensorflow.org
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford NLP (2024). coqa [Dataset]. https://huggingface.co/datasets/stanfordnlp/coqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 24, 2024
    Dataset authored and provided by
    Stanford NLP
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for "coqa"

      Dataset Summary
    

    CoQA is a large-scale dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage.

      Supported Tasks and Leaderboards
    

    More Information Needed

      Languages… See the full description on the dataset page: https://huggingface.co/datasets/stanfordnlp/coqa.
    
  7. h

    hotpotqa

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Mar 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2024). hotpotqa [Dataset]. https://huggingface.co/datasets/mteb/hotpotqa
    Explore at:
    Dataset updated
    Mar 2, 2024
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    HotpotQA An MTEB dataset Massive Text Embedding Benchmark

    HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.

    Task category t2t

    Domains Web, Written

    Reference https://hotpotqa.github.io/

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/hotpotqa.

  8. h

    search_qa

    • huggingface.co
    Updated May 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyunghyun Cho (2024). search_qa [Dataset]. https://huggingface.co/datasets/kyunghyuncho/search_qa
    Explore at:
    Dataset updated
    May 19, 2024
    Authors
    Kyunghyun Cho
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering.

  9. Data from: quac

    • huggingface.co
    • tensorflow.org
    • +1more
    Updated Dec 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2020). quac [Dataset]. https://huggingface.co/datasets/allenai/quac
    Explore at:
    Dataset updated
    Dec 12, 2020
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.

  10. h

    squad_v2

    • huggingface.co
    Updated Jun 15, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav R (2005). squad_v2 [Dataset]. https://huggingface.co/datasets/rajpurkar/squad_v2
    Explore at:
    Dataset updated
    Jun 15, 2005
    Authors
    Pranav R
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for SQuAD 2.0

      Dataset Summary
    

    Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.

  11. h

    pile-of-law

    • huggingface.co
    • opendatalab.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset authored and provided by
    Pile of Law
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  12. h

    alpaca

    • huggingface.co
    • opendatalab.com
    Updated Mar 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tatsu Lab (2023). alpaca [Dataset]. https://huggingface.co/datasets/tatsu-lab/alpaca
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2023
    Dataset authored and provided by
    Tatsu Lab
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for Alpaca

      Dataset Summary
    

    Alpaca is a dataset of 52,000 instructions and demonstrations generated by OpenAI's text-davinci-003 engine. This instruction data can be used to conduct instruction-tuning for language models and make the language model follow instruction better. The authors built on the data generation pipeline from Self-Instruct framework and made the following modifications:

    The text-davinci-003 engine to generate the instruction data instead… See the full description on the dataset page: https://huggingface.co/datasets/tatsu-lab/alpaca.

  13. h

    multi-signals-QA-dataset

    • huggingface.co
    Updated Aug 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    p b (2024). multi-signals-QA-dataset [Dataset]. https://huggingface.co/datasets/bobox/multi-signals-QA-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2024
    Authors
    p b
    Description

    Dataset Card for LLM-Generated QA Dataset for Sentence Transformers

      Dataset Summary
    

    This dataset contains question-answer pairs generated by a large language model (LLM) for training sentence transformer models. Each entry includes a query, a main response, and various metadata fields to provide context and facilitate different downstream tasks.

      Supported Tasks and Leaderboards
    

    The dataset is primarily designed for:

    Open-domain question answering Text… See the full description on the dataset page: https://huggingface.co/datasets/bobox/multi-signals-QA-dataset.

  14. h

    GenerativeAI-QA-Dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nglif, GenerativeAI-QA-Dataset [Dataset]. https://huggingface.co/datasets/nglif/GenerativeAI-QA-Dataset
    Explore at:
    Authors
    nglif
    Description

    🧠 Generative AI QA Dataset

      📝 Dataset Overview
    

    This dataset consists of 10,000 high-quality question-answer pairs focused on Generative AI. It is designed for instruction tuning of text-to-text generation models, enabling improved performance on AI-related question-answering tasks.

      ⚠️ Important Note
    

    This dataset was originally intended for use in AWS AI Singapore 2025. However, the fine-tuned model performed very inaccurately because of my lack of prompt… See the full description on the dataset page: https://huggingface.co/datasets/nglif/GenerativeAI-QA-Dataset.

  15. cosmos_qa

    • huggingface.co
    • paperswithcode.com
    • +2more
    Updated May 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). cosmos_qa [Dataset]. https://huggingface.co/datasets/allenai/cosmos_qa
    Explore at:
    Dataset updated
    May 23, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cosmos QA is a large-scale dataset of 35.6K problems that require commonsense-based reading comprehension, formulated as multiple-choice questions. It focuses on reading between the lines over a diverse collection of people's everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context

  16. h

    ambig_qa

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sewon Min, ambig_qa [Dataset]. https://huggingface.co/datasets/sewon/ambig_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Sewon Min
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    Dataset Card for AmbigQA: Answering Ambiguous Open-domain Questions

      Dataset Summary
    

    AmbigNQ, a dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark. We find that over half of the questions in NQ-open are ambiguous. The types of ambiguity are diverse and sometimes subtle, many of which are only apparent after examining evidence provided by a very large text corpus. AMBIGNQ, a dataset with 14,042 annotations on NQ-OPEN questions containing… See the full description on the dataset page: https://huggingface.co/datasets/sewon/ambig_qa.

  17. h

    principles-qa-llama-formatted-text

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhay Sheshadri, principles-qa-llama-formatted-text [Dataset]. https://huggingface.co/datasets/abhayesian/principles-qa-llama-formatted-text
    Explore at:
    Authors
    Abhay Sheshadri
    Description

    abhayesian/principles-qa-llama-formatted-text dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    webnlg-qa

    • huggingface.co
    Updated Jun 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Orange (2025). webnlg-qa [Dataset]. https://huggingface.co/datasets/Orange/webnlg-qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 11, 2025
    Dataset authored and provided by
    Orange
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for WEBNLG-QA

      Dataset Summary
    

    WEBNLG-QA is a conversational question answering dataset grounded on WEBNLG. It consists in a set of question-answering dialogues (follow-up question-answer pairs) based on short paragraphs of text. Each paragraph is associated a knowledge graph (from WEBNLG). The questions are associated with SPARQL queries.

      Supported tasks
    

    Knowledge-based question-answering SPARQL-to-Text conversion

      Knowledge based… See the full description on the dataset page: https://huggingface.co/datasets/Orange/webnlg-qa.
    
  19. h

    Hebrew_Squad_v1

    • huggingface.co
    Updated Oct 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technion Data and Knowledge Lab (2022). Hebrew_Squad_v1 [Dataset]. https://huggingface.co/datasets/tdklab/Hebrew_Squad_v1
    Explore at:
    Dataset updated
    Oct 6, 2022
    Dataset authored and provided by
    Technion Data and Knowledge Lab
    Description

    SStanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. This Hebrew dataset is an automatic translation of the English SQuAD dataset.

  20. h

    gpqa

    • huggingface.co
    • opendatalab.com
    Updated Nov 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Rein (2023). gpqa [Dataset]. https://huggingface.co/datasets/Idavidrein/gpqa
    Explore at:
    Dataset updated
    Nov 21, 2023
    Authors
    David Rein
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for GPQA

    GPQA is a multiple-choice, Q&A dataset of very hard questions written and validated by experts in biology, physics, and chemistry. When attempting questions out of their own domain (e.g., a physicist answers a chemistry question), these experts get only 34% accuracy, despite spending >30m with full access to Google. We request that you do not reveal examples from this dataset in plain text or images online, to reduce the risk of leakage into foundation model… See the full description on the dataset page: https://huggingface.co/datasets/Idavidrein/gpqa.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
UC Santa Barbara NLP Group (2021). tweet_qa [Dataset]. https://huggingface.co/datasets/ucsbnlp/tweet_qa

tweet_qa

TweetQA

ucsbnlp/tweet_qa

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 30, 2021
Dataset authored and provided by
UC Santa Barbara NLP Group
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Dataset Card for TweetQA

  Dataset Summary

With social media becoming increasingly popular on which lots of news and real-time events are reported, developing automated question answering systems is critical to the effectiveness of many applications that rely on real-time knowledge. While previous question answering (QA) datasets have concentrated on formal text like news and Wikipedia, the first large-scale dataset for QA over social media data is presented. To make sure… See the full description on the dataset page: https://huggingface.co/datasets/ucsbnlp/tweet_qa.

Search
Clear search
Close search
Google apps
Main menu