95 datasets found

O
Advice_Seeking_Questions
opendatalab.com
paperswithcode.com
zip
Updated Sep 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cornell University (2023). Advice_Seeking_Questions [Dataset]. https://opendatalab.com/OpenDataLab/Advice_Seeking_Questions
Explore at:
zip(19891364 bytes)Available download formats
Dataset updated
Sep 30, 2023
Dataset provided by
Cornell University
Description
The Advice-Seeking Questions (ASQ) dataset is a collection of personal narratives with advice-seeking questions. The dataset has been split into train, test, heldout sets, with 8865, 2500, 10000 test instances each. This dataset is used to train and evaluate methods that can infer what is the advice-seeking goal behind a personal narrative. This task is formulated as a cloze test, where the goal is to identify which of two advice-seeking questions was removed from a given narrative.
o
NLP Expert QA Dataset
opendatabay.com
.undefined
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). NLP Expert QA Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/c030902d-7b02-48a2-b32f-8f7140dd1de7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset, QASPER: NLP Questions and Evidence, is an exceptional collection of over 5,000 questions and answers focused on Natural Language Processing (NLP) papers. It has been crowdsourced from experienced NLP practitioners, with each question meticulously crafted based solely on the titles and abstracts of the respective papers. The answers provided are expertly enriched with evidence taken directly from the full text of each paper. QASPER features structured fields including 'qas' for questions and answers, 'evidence' for supporting information, paper titles, abstracts, figures and tables, and full text. This makes it a valuable resource for researchers aiming to understand how practitioners interpret NLP topics and to validate solutions for problems found in existing literature. The dataset contains 5,049 questions spanning 1,585 distinct papers.

Columns

title: The title of the paper. (String)

abstract: A summary of the paper. (String)

full_text: The full text of the paper. (String)

qas: Questions and answers about the paper. (Object)

figures_and_tables: Figures and tables from the paper. (Object)

id: Unique identifier for the paper.

Distribution

The QASPER dataset comprises 5,049 questions across 1,585 papers. It is distributed across five files in .csv format, with one additional .json file for figures and tables. These include two test datasets (test.csv and validation.csv), two train datasets (train-v2-0_lessons_only_.csv and trainv2-0_unsplit.csv), and a figures dataset (figures_and_tables_.json). Each CSV file contains distinct datasets with columns dedicated to titles, abstracts, full texts, and Q&A fields, along with evidence for each paper mentioned in the respective rows.

Usage

This dataset is ideal for various applications, including: * Developing AI models to automatically generate questions and answers from paper titles and abstracts. * Enhancing machine learning algorithms by combining answers with evidence to discover relationships between papers. * Creating online forums for NLP practitioners, using dataset questions to spark discussion within the community. * Conducting basic descriptive statistics or advanced predictive analytics, such as logistic regression or naive Bayes models. * Summarising basic crosstabs between any two variables, like titles and abstracts. * Correlating title lengths with the number of words in their corresponding abstracts to identify patterns. * Utilising text mining technologies like topic modelling, machine learning techniques, or automated processes to summarise underlying patterns. * Filtering terms relevant to specific research hypotheses and processing them via web crawlers, search engines, or document similarity algorithms.

Coverage

The dataset has a GLOBAL region scope. It focuses on papers within the field of Natural Language Processing. The questions and answers are crowdsourced from experienced NLP practitioners. The dataset was listed on 22/06/2025.

License

CC0

Who Can Use It

This dataset is highly suitable for: * Researchers seeking insights into how NLP practitioners interpret complex topics. * Those requiring effective validation for developing clear-cut solutions to problems encountered in existing NLP literature. * NLP practitioners looking for a resource to stimulate discussions within their community. * Data scientists and analysts interested in exploring NLP datasets through descriptive statistics or advanced predictive analytics. * Developers and researchers working with text mining, machine learning techniques, or automated text processing.

Dataset Name Suggestions

NLP Expert QA Dataset

QASPER: NLP Paper Questions and Evidence

Academic NLP Q&A Corpus

Natural Language Processing Research Questions

Attributes

Original Data Source: QASPER: NLP Questions and Evidence
Trojan Detection Software Challenge - nlp-question-answering-sep2021-test
catalog.data.gov
data.nist.gov
Updated Sep 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Trojan Detection Software Challenge - nlp-question-answering-sep2021-test [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-round-8-test-dataset
Explore at:
Dataset updated
Sep 30, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Round 8 Test DatasetThis is the training data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform extractive question answering. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 360 QA AI models using a small set of model architectures. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.
h
EYE-TEST-2
huggingface.co
Updated Apr 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
QIAIUNCC (2024). EYE-TEST-2 [Dataset]. http://doi.org/10.57967/hf/5228
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/5228
Dataset updated
Apr 24, 2024
Authors
QIAIUNCC
Description
20 Questions Dataset

Dataset Overview

20 Questions Dataset has been used in the EYE-Llama paper ([https://www.biorxiv.org/content/10.1101/2024.04.26.591355v1]) as a test set for evaluating ophthalmic language models. This dataset contains a collection of questions and answers specifically tailored for the ophthalmic domain. It serves as a valuable resource for assessing the performance of models in answering domain-specific queries.

License

This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/QIAIUNCC/EYE-TEST-2.
h
cinepile
huggingface.co
Updated Aug 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Goldstein's Lab at University of Maryland, College Park (2024). cinepile [Dataset]. https://huggingface.co/datasets/tomg-group-umd/cinepile
Explore at:
Dataset updated
Aug 24, 2024
Dataset authored and provided by
Tom Goldstein's Lab at University of Maryland, College Park
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
CinePile: A Long Video Question Answering Dataset and Benchmark

CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points. If you have any comments or questions, reach out to: Ruchit Rawal or Gowthami Somepalli Other links - Website… See the full description on the dataset page: https://huggingface.co/datasets/tomg-group-umd/cinepile.
Take the Test Sample Questions from OECD's PISA Assessments
catalog.data.gov
s.cnmilf.com
+1more
Updated Mar 30, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Department of State (2021). Take the Test Sample Questions from OECD's PISA Assessments [Dataset]. https://catalog.data.gov/dataset/take-the-test-sample-questions-from-oecds-pisa-assessments
Explore at:
Dataset updated
Mar 30, 2021
Dataset provided by
United States Department of Statehttp://state.gov/
Description
What does PISA actually assess? This book presents all the publicly available questions from the PISA surveys. Some of these questions were used in the PISA 2000, 2003 and 2006 surveys and others were used in developing and trying out the assessment. After a brief introduction to the PISA assessment, the book presents three chapters, including PISA questions for the reading, mathematics and science tests, respectively. Each chapter presents an overview of what exactly the questions assess. The second section of each chapter presents questions which were used in the PISA 2000, 2003 and 2006 surveys, that is, the actual PISA tests for which results were published. The third section presents questions used in trying out the assessment. Although these questions were not used in the PISA 2000, 2003 and 2006 surveys, they are nevertheless illustrative of the kind of question PISA uses. The final section shows all the answers, along with brief comments on each question.
o
German Language Understanding Dataset
opendatabay.com
.undefined
Updated Jul 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). German Language Understanding Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/eebd00ea-b68f-4b29-9aa1-7fe980e60502
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides a collection of German question-answer pairs along with their corresponding context [1]. It is designed to enhance and facilitate natural language processing (NLP) tasks in the German language [1]. The dataset includes two main files, train.csv and test.csv, each containing numerous entries of various contexts with associated questions and answers in German [1]. The contextual information can range from paragraphs to concise sentences, offering a well-rounded representation of different scenarios [1]. It serves as a valuable resource for training machine learning models to improve question-answering systems or other NLP applications specific to the German language [1].

Columns

The dataset consists of the following columns [1, 2]: * id: An identifier for each entry [2]. * context: This column contains the context in which the question is being asked. It can be a paragraph, a sentence, or any other relevant information [1]. * question: The question related to the given context [2]. * answers: This column contains the answer or answers to the given question within the corresponding context [1]. The answers could be single or multiple [1]. * Label Count: Numerical ranges with corresponding counts [2].

Distribution

The dataset is provided in CSV format [1, 3], comprising two main files: train.csv and test.csv [1]. Both files contain a significant number of question-answer pairs and their respective contexts [1]. While specific total row or record counts are not explicitly stated, the source material indicates substantial amounts of data [1]. For instance, certain label counts range from 36,419.00 to 45,662.00, with varying numbers of entries within those ranges, such as 529, 508, or 29 unique values for specific segments [2].

Usage

This dataset is ideal for a variety of applications and use cases, including [1]: * Building question-answering systems in German. * Training models for German language understanding and translation tasks. * Developing information retrieval systems that can process German user queries and return relevant information from provided contexts. * Enhancing NLP models for accuracy and robustness in German. * Exploring state-of-the-art methodologies or developing novel approaches for natural language understanding in German [1].

Coverage

The dataset's linguistic scope is specifically the German language [1]. Geographically, it is intended for global use [4]. There are no specific notes on time range or demographic availability within the provided sources.

License

CC0

Who Can Use It

The dataset is intended for [1]: * Researchers working on advancements in machine learning techniques applied to natural language understanding in German. * Developers building and refining NLP applications for the German language. * Enthusiasts exploring and implementing machine learning models for language processing.

Dataset Name Suggestions

German QA Context Dataset

German NLP Question-Answer Pairs

Contextual German Questions & Answers

German Language Understanding Dataset

Attributes

Original Data Source: German Question-Answer Context Dataset
4
Multimodal WEDAR dataset for attention regulation behaviors, self-reported...
data.4tu.nl
zip
Updated May 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoon Lee; Marcus Specht (2023). Multimodal WEDAR dataset for attention regulation behaviors, self-reported distractions, reaction time, and knowledge gain in e-reading [Dataset]. http://doi.org/10.4121/8f730aa3-ad04-4419-8a5b-325415d2294b.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/8f730aa3-ad04-4419-8a5b-325415d2294b.v1
Dataset updated
May 9, 2023
Dataset provided by
4TU.ResearchData
Authors
Yoon Lee; Marcus Specht
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Diverse learning theories have been constructed to understand learners' internal states through various tangible predictors. We focus on self-regulatory actions that are subconscious and habitual actions triggered by behavior agents' 'awareness' of their attention loss. We hypothesize that self-regulatory behaviors (i.e., attention regulation behaviors) also occur in e-reading as 'regulators' as found in other behavior models (Ekman, P., & Friesen, W. V., 1969). In this work, we try to define the types and frequencies of attention regulation behaviors in e-reading. We collected various cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading.
The text 'How to make the most of your day at Disneyland Resort Paris' has been implemented on a screen-based e-reader, which we developed in a pdf-reader format. An informative, entertaining text was adopted to capture learners' attentional shifts during knowledge acquisition. The text has 2685 words, distributed over ten pages, with one subtopic on each page. A built-in webcam on Mac Pro and a mouse have been used for the data collection, aiming for real-world implementation only with essential computational devices. A height-adjustable laptop stand has been used to compensate for participants' eye levels.
Thirty learners in higher education have been invited for a screen-based e-reading task (M=16.2, SD=5.2 minutes). A pre-test questionnaire with ten multiple-choice questions was given before the reading to check their prior knowledge level about the topic. There was no specific time limit to finish the questionnaire. We collected cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading. Learners were asked to report their distractions on two levels during the reading: 1) In-text distraction (e.g., still reading the text with low attentiveness) or 2) out-of-text distraction (e.g., thinking of something else while not reading the text anymore). We implemented two noticeably-designed buttons on the right-hand side of the screen interface to minimize possible distraction from the reporting task. After triggering a new page, we implemented blur stimuli on the text in the random range of 20 seconds. It ensures that the blur stimuli occur at least once on each page. Participants were asked to click the de-blur button on the text area of the screen to proceed with the reading. The button has been implemented in the whole text area, so participants can minimize the effort to find and click the button. Reaction time for de-blur has been measured, too, to grasp the arousal of learners during the reading. We asked participants to answer pre-test and post-test questionnaires about the reading material. Participants were given ten multiple-choice questions before the session, while the same set of questions was given after the reading session (i.e., formative questions) with added subtopic summarization questions (i.e., summative questions). It can provide insights into the quantitative and qualitative knowledge gained through the session and different learning outcomes based on individual differences. A video dataset of 931,440 frames has been annotated with the attention regulator behaviors using an annotation tool that plays the long sequence clip by clip, which contains 30 frames. Two annotators (doctoral students) have done two stages of labeling. In the first stage, the annotators were trained on the labeling criteria and annotated the attention regulator behaviors separately based on their judgments. The labels were summarized and cross-checked in the second round to address the inconsistent cases, resulting in five attention regulation behaviors and one neutral state. See WEDAR_readme.csv for detailed descriptions of features.
The dataset has been uploaded 1) raw data, which has formed as we collected, and 2) preprocessed, that we extracted useful features for further learning analytics based on real-time and post-hoc data.
Reference
Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica, 1(1), 49-98.
AI2 ARC - Advanced Science Question
kaggle.com
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). AI2 ARC - Advanced Science Question [Dataset]. https://www.kaggle.com/datasets/thedevastator/advanced-science-question-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 30, 2023
Dataset provided by
Kaggle
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
AI2 ARC - Advanced Science Question

Promoting research in advanced question-answering

By ai2_arc (From Huggingface) [source]

About this dataset

The ai2_arc dataset, also known as the A Challenge Dataset for Advanced Question-Answering in Grade-School Level Science, is a comprehensive and valuable resource created to facilitate research in advanced question-answering. This dataset consists of a collection of 7,787 genuine grade-school level science questions presented in multiple-choice format.

The primary objective behind assembling this dataset was to provide researchers with a powerful tool to explore and develop question-answering models capable of tackling complex scientific inquiries typically encountered at a grade-school level. The questions within this dataset are carefully crafted to test the knowledge and understanding of various scientific concepts in an engaging manner.

The ai2_arc dataset is further divided into two primary sets: the Challenge Set and the Easy Set. Each set contains numerous highly curated science questions that cover a wide range of topics commonly taught at a grade-school level. These questions are designed specifically for advanced question-answering research purposes, offering an opportunity for model evaluation, comparison, and improvement.

In terms of data structure, the ai2_arc dataset features several columns providing vital information about each question. These include columns such as question, which contains the text of the actual question being asked; choices, which presents the multiple-choice options available for each question; and answerKey, which indicates the correct answer corresponding to each specific question.

Researchers can utilize this comprehensive dataset not only for developing advanced algorithms but also for training machine learning models that exhibit sophisticated cognitive capabilities when it comes to comprehending scientific queries from a grade-school perspective. Moreover, by leveraging these meticulously curated questions, researchers can analyze performance metrics such as accuracy or examine biases within their models' decision-making processes.

In conclusion, the ai2_arc dataset serves as an invaluable resource for anyone involved in advanced question-answering research within grade-school level science education. With its extensive collection of genuine multiple-choice science questions spanning various difficulty levels, researchers can delve into the intricate nuances of scientific knowledge acquisition, processing, and reasoning, ultimately unlocking novel insights and innovations in the field

Research Ideas

Developing advanced question-answering models: The ai2_arc dataset provides a valuable resource for training and evaluating advanced question-answering models. Researchers can use this dataset to develop and test algorithms that can accurately answer grade-school level science questions.

Evaluating natural language processing (NLP) models: NLP models that aim to understand and generate human-like responses can be evaluated using this dataset. The multiple-choice format of the questions allows for objective evaluation of the model's ability to comprehend and provide correct answers.

Assessing human-level performance: The dataset can be used as a benchmark to measure the performance of human participants in answering grade-school level science questions. By comparing the accuracy of humans with that of AI systems, researchers can gain insights into the strengths and weaknesses of both approaches

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: ARC-Challenge_test.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------------| | question | The text content of each question being asked. (Text) | | choices | A list of multiple-choice options associated with each question. (List of Text) | | answerKey | The correct answer option (choice) for a particular question. (Text) |

File: ARC-Easy_test.csv | Column name | Description ...
m
Facebook bAbI Tasks for Malayalam Language
data.mendeley.com
Updated Nov 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bibin P A (2024). Facebook bAbI Tasks for Malayalam Language [Dataset]. http://doi.org/10.17632/h26g4n9w5j.1
Explore at:
Unique identifier
https://doi.org/10.17632/h26g4n9w5j.1
Dataset updated
Nov 1, 2024
Authors
Bibin P A
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A Malayalam Question Answering dataset of 5,000 training samples and 5,000 testing samples was generated by translating Facebook bAbI tasks. Facebook's bAbI tasks was originally created in English, some of the languages it has been translated are French, German, Hindi, Chinese, Russian. Twenty fictitious tasks that test a system's capacity for responding to a range of themes, including text comprehension and reasoning, are included in the dataset. Five task-oriented usability questions with comparable sentence patterns are also included in the collection. The questions here range in difficulty. Every job has 1000 test samples and 1000 training samples in the dataset. we created the dataset for the proposed work by utilizing the bAbI dataset to translate the English dataset into Malayalam for five tasks (tasks 1, 4, 11, 12, and 13), represented as tasks 1 through 5. Titles such as "Single Supporting Facts," "Two Argument Relations," "Basic Coreference," "Conjunction," and "Compound Coreference" relate to the tasks. Every sample in the dataset comprises a series of statements (sometimes called stories) about people's movements around things, a question, a suitable answer. Tasks: Task 1: Single supporting fact: This task tests whether a model can identify a single important fact from a story to answer a question. The story usually contains several sentences, but only one sentence is directly useful in answering the question. Task 2: Relationships with two arguments: This task involves understanding the relationship between two entities. The model must infer relationships between pairs of objects, people or places. Task 3: Core co-reference: Co-reference resolution is the task of linking pronouns or phrases to the correct entities. In this task, the model must resolve simple pronominal references. Task 4: Conjunctions: This task tests the model's ability to understand sentences in which several actions or facts are joined by conjunctions such as "and" or "or". The model must process these linked statements to answer the questions correctly. Task 5: Compound Reference: This task is more complex because it requires the model to solve the conjunctions in the sentence with composite entities or more complex structures.
h
sat-reading
huggingface.co
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Quesnelle (2023). sat-reading [Dataset]. https://huggingface.co/datasets/emozilla/sat-reading
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 4, 2023
Authors
Jeffrey Quesnelle
Description
Dataset Card for "sat-reading"

This dataset contains the passages and questions from the Reading part of ten publicly available SAT Practice Tests. For more information see the blog post Language Models vs. The SAT Reading Test. For each question, the reading passage from the section it is contained in is prefixed. Then, the question is prompted with Question #:, followed by the four possible answers. Each entry ends with Answer:. Questions which reference a diagram, chart, table… See the full description on the dataset page: https://huggingface.co/datasets/emozilla/sat-reading.
P
RACE Dataset
paperswithcode.com
Updated Jan 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guokun Lai; Qizhe Xie; Hanxiao Liu; Yiming Yang; Eduard Hovy (2022). RACE Dataset [Dataset]. https://paperswithcode.com/dataset/race
Explore at:
Dataset updated
Jan 27, 2022
Authors
Guokun Lai; Qizhe Xie; Hanxiao Liu; Yiming Yang; Eduard Hovy
Description
The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct. The data generation process of RACE differs from most machine reading comprehension datasets - instead of generating questions and answers by heuristics or crowd-sourcing, questions in RACE are specifically designed for testing human reading skills, and are created by domain experts.
COVID-19 Case Surveillance Public Use Data
data.cdc.gov
healthdata.gov
+5more
application/rdfxml +5
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CDC Data, Analytics and Visualization Task Force (2024). COVID-19 Case Surveillance Public Use Data [Dataset]. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf
Explore at:
application/rdfxml, tsv, csv, json, xml, application/rssxmlAvailable download formats
Dataset updated
Jul 9, 2024
Dataset provided by
Centers for Disease Control and Preventionhttp://www.cdc.gov/
Authors
CDC Data, Analytics and Visualization Task Force
License
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Description
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.

Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.

This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.

CDC has three COVID-19 case surveillance datasets:
COVID-19 Case Surveillance Public Use Data with Geography: Public use, patient-level dataset with clinical data (including symptoms), demographics, and county and state of residence. (19 data elements)
COVID-19 Case Surveillance Public Use Data: Public use, patient-level dataset with clinical and symptom data and demographics, with no geographic data. (12 data elements)
COVID-19 Case Surveillance Restricted Access Detailed Data: Restricted access, patient-level dataset with clinical and symptom data, demographics, and state and county of residence. Access requires a registration process and a data use agreement. (33 data elements)
The following apply to all three datasets:
Data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
Data are considered provisional by CDC and are subject to change until the data are reconciled and verified with the state and territorial data providers.
Some data cells are suppressed to protect individual privacy.
The datasets will include all cases with the earliest date available in each record (date received by CDC or date related to illness/specimen collection) at least 14 days prior to the creation of the current datasets. This 14-day lag allows case reporting to be stabilized and ensures that time-dependent outcome data are accurately captured.
Datasets are updated monthly.
Datasets are created using CDC’s Policy on Public Health Research and Nonresearch Data Management and Access and include protections designed to protect individual privacy.
For more information about data collection and reporting, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html.
For more information about the COVID-19 case surveillance data, please see https://www.cdc.gov/coronavirus/2019-ncov/covid-data/faq-surveillance.html

Overview

The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.

For more information: NNDSS Supports the COVID-19 Response | CDC.

The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.

COVID-19 Case Reports

COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.

All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.

Data are Considered Provisional

The COVID-19 case surveillance data are dynamic; case reports can be modified at any time by the jurisdictions sharing COVID-19 data with CDC. CDC may update prior cases shared with CDC based on any updated information from jurisdictions. For instance, as new information is gathered about previously reported cases, health departments provide updated data to CDC. As more information and data become available, analyses might find changes in surveillance data and trends during a previously reported time window. Data may also be shared late with CDC due to the volume of COVID-19 cases.
Annual finalized data: To create the final NNDSS data used in the annual tables, CDC works carefully with the reporting jurisdictions to reconcile the data received during the year until each state or territorial epidemiologist confirms that the data from their area are correct.
Access Addressing Gaps in Public Health Reporting of Race and Ethnicity for COVID-19, a report from the Council of State and Territorial Epidemiologists, to better understand the challenges in completing race and ethnicity data for COVID-19 and recommendations for improvement.

Data Limitations

To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.

Data Quality Assurance Procedures

CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
Questions that have been left unanswered (blank) on the case report form are reclassified to a Missing value, if applicable to the question. For example, in the question “Was the individual hospitalized?” where the possible answer choices include “Yes,” “No,” or “Unknown,” the blank value is recoded to Missing because the case report form did not include a response to the question.
Logic checks are performed for date data. If an illogical date has been provided, CDC reviews the data with the reporting jurisdiction. For example, if a symptom onset date in the future is reported to CDC, this value is set to null until the reporting jurisdiction updates the date appropriately.
Additional data quality processing to recode free text data is ongoing. Data on symptoms, race and ethnicity, and healthcare worker status have been prioritized.

Data Suppression

To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.

For questions, please contact Ask SRRG (eocevent394@cdc.gov).

Additional COVID-19 Data

COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
C
Question-and-answer dataset Rijksportaal Personnel
ckan.mobidatalab.eu
Updated Jul 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OverheidNl (2023). Question-and-answer dataset Rijksportaal Personnel [Dataset]. https://ckan.mobidatalab.eu/dataset/vraag-en-antwoord-dataset-rijksportaal-personeel
Explore at:
http://publications.europa.eu/resource/authority/file-type/json, http://publications.europa.eu/resource/authority/file-type/pdfAvailable download formats
Dataset updated
Jul 13, 2023
Dataset provided by
OverheidNl
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Questions, answers and documents are stored in the dataset. Every question has an answer and the answer comes from a page of Rijksportaal Personnel (intranet central government). With this dataset a question-and-answer model can be trained. The computer thus learns to answer questions in the context of P-Direkt. A total of 322 questions were used that were once asked by e-mail to the contact center of P-Direkt. The questions are very general and never ask about personal circumstances. The aim of the dataset was to test whether question-and-answer models could possibly be used in a P-Direkt environment. The structure of the dataset corresponds to the Squad 2.0 dataset. ### Example: #### Question: Is it true that my SCV hours of 2020 expire if I don't take them? #### Answer: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire #### Source*: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire. You cannot have your IKB savings leave paid out. Payment is only made in the event of termination of employment or death. You can save up to 1800 hours. Do you work part-time or more than an average of 36 hours per week? In that case, the maximum number of hours to be saved is calculated proportionally and rounded down to whole hours. Any remaining holiday hours from 2015 and extra-statutory holiday hours that you had left over from 2016 up to and including 2019 will be converted into IKB hours on 1 January 2020 and these have been added to your IKB savings leave. * Please note, source is a snapshot of National Portal Personnel from April 2021. Go to National Portal Personnel on the intranet for up-to-date information about personnel matters.
h
aqua-rat-mcqa
huggingface.co
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rico ibañez (2025). aqua-rat-mcqa [Dataset]. https://huggingface.co/datasets/RikoteMaster/aqua-rat-mcqa
Explore at:
Dataset updated
Jun 12, 2025
Authors
Rico ibañez
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
AQUA-RAT MCQA Dataset

This dataset contains the AQUA-RAT dataset converted to Multiple Choice Question Answering (MCQA) format with modifications.

Dataset Description

AQUA-RAT is a dataset of algebraic word problems with rationales. This version has been processed to:

Remove all questions where the correct answer was option "E" (5th choice) Remove the "E" option from all remaining questions (4 choices: A, B, C, D) Merge validation and test splits into a single test split… See the full description on the dataset page: https://huggingface.co/datasets/RikoteMaster/aqua-rat-mcqa.
h
MMXU
huggingface.co
Updated May 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LinjieMu (2025). MMXU [Dataset]. https://huggingface.co/datasets/LinjieMu/MMXU
Explore at:
Dataset updated
May 16, 2025
Authors
LinjieMu
Description
MMXU benchmark

MMXU (Multimodal and MultiX-ray Understanding) is a novel dataset for MedVQA that focuses on identifying changes in specific regions between two patient visits. Unlike previous datasets that primarily address single-image questions, MMXU enables multi-image questions, incorporating both current and historical patient data.

News

(05/16/2025) Our paper has been accepted by ACL 2025 Findings!(02/22/2025) The MMXU-test benchmark has been released… See the full description on the dataset page: https://huggingface.co/datasets/LinjieMu/MMXU.
h
ContextToQuestions
huggingface.co
Updated Mar 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moustapha (2024). ContextToQuestions [Dataset]. https://huggingface.co/datasets/mito0o852/ContextToQuestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 16, 2024
Authors
Moustapha
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Context-Based Question Generation Dataset

This dataset is designed for context-based question generation, where questions of different types (true/false, multiple-choice, open-ended) are generated based on a given context. The dataset is synthetically created using ChatGPT, providing a diverse set of questions to test comprehension and reasoning skills.

Dataset Structure

The dataset is structured with the following fields for each example:

context: The context provided… See the full description on the dataset page: https://huggingface.co/datasets/mito0o852/ContextToQuestions.
o
AI Question Answering Data
opendatabay.com
.undefined
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). AI Question Answering Data [Dataset]. https://www.opendatabay.com/data/ai-ml/d3c37fed-f830-444b-a988-c893d3396fd7
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 5, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset provides essential information for entries related to question answering tasks using AI models. It is designed to offer valuable insights for researchers and practitioners, enabling them to effectively train and rigorously evaluate their machine learning models. The dataset serves as a valuable resource for building and assessing question-answering systems. It is available free of charge.

Columns

instruction: Contains the specific instructions given to a model to generate a response.

responses: Includes the responses generated by the model based on the given instructions.

next_response: Provides the subsequent response from the model, following a previous response, which facilitates a conversational interaction.

answer: Lists the correct answer for each question presented in the instruction, acting as a reference for assessing the model's accuracy.

is_human_response: A boolean column that indicates whether a particular response was created by a human or by a machine learning model, helping to differentiate between the two. Out of nearly 19,300 entries, 254 are human-generated responses, while 18,974 were generated by models.

Distribution

The data files are typically in CSV format, with a dedicated train.csv file for training data and a test.csv file for testing purposes. The training file contains a large number of examples. Specific dates are not included within this dataset description, focusing solely on providing accurate and informative details about its content and purpose. Specific numbers for rows or records are not detailed in the available information.

Usage

This dataset is ideal for a variety of applications and use cases: * Training and Testing: Utilise train.csv to train question-answering models or algorithms, and test.csv to evaluate their performance on unseen questions. * Machine Learning Model Creation: Develop machine learning models specifically for question-answering by leveraging the instructional components, including instructions, responses, next responses, and human-generated answers, along with their is_human_response labels. * Model Performance Evaluation: Assess model performance by comparing predicted responses with actual human-generated answers from the test.csv file. * Data Augmentation: Expand existing data by paraphrasing instructions or generating alternative responses within similar contexts. * Conversational Agents: Build conversational agents or chatbots by utilising the instruction-response pairs for training. * Language Understanding: Train models to understand language and generate responses based on instructions and previous responses. * Educational Materials: Develop interactive quizzes or study guides, with models providing instant feedback to students. * Information Retrieval Systems: Create systems that help users find specific answers from large datasets. * Customer Support: Train customer support chatbots to provide quick and accurate responses to inquiries. * Language Generation Research: Develop novel algorithms for generating coherent responses in question-answering scenarios. * Automatic Summarisation Systems: Train systems to generate concise summaries by understanding main content through question answering. * Dialogue Systems Evaluation: Use the instruction-response pairs as a benchmark for evaluating dialogue system performance. * NLP Algorithm Benchmarking: Establish baselines against which other NLP tools and methods can be measured.

Coverage

The dataset's geographic scope is global. There is no specific time range or demographic scope noted within the available details, as specific dates are not included.

License

CC0

Who Can Use It

This dataset is highly suitable for: * Researchers and Practitioners: To gain insights into question answering tasks using AI models. * Developers: To train models, create chatbots, and build conversational agents. * Students: For developing educational materials and enhancing their learning experience through interactive tools. * Individuals and teams working on Natural Language Processing (NLP) projects. * Those creating information retrieval systems or customer support solutions. * Experts in natural language generation (NLG) and automatic summarisation systems. * Anyone involved in the evaluation of dialogue systems and machine learning model training.

Dataset Name Suggestions

AI Question Answering Data

Conversational AI Training Data

NLP Question-Answering Dataset

Model Evaluation QA Data

Dialogue Response Dataset

Attributes

Original Data Source: Question-Answering Training and Testing Data
P
Natural Questions Dataset
paperswithcode.com
opendatalab.com
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Kwiatkowski; Jennimaria Palomaki; Olivia Redfield; Michael Collins; Ankur Parikh; Chris Alberti; Danielle Epstein; Illia Polosukhin; Jacob Devlin; Kenton Lee; Kristina Toutanova; Llion Jones; Matthew Kelcey; Ming-Wei Chang; Andrew M. Dai; Jakob Uszkoreit; Quoc Le; Slav Petrov (2024). Natural Questions Dataset [Dataset]. https://paperswithcode.com/dataset/natural-questions
Explore at:
Dataset updated
Dec 26, 2024
Authors
Tom Kwiatkowski; Jennimaria Palomaki; Olivia Redfield; Michael Collins; Ankur Parikh; Chris Alberti; Danielle Epstein; Illia Polosukhin; Jacob Devlin; Kenton Lee; Kristina Toutanova; Llion Jones; Matthew Kelcey; Ming-Wei Chang; Andrew M. Dai; Jakob Uszkoreit; Quoc Le; Slav Petrov
Description
The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.
f
Dataset
figshare.com
application/x-gzip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moynuddin Ahmed Shibly (2023). Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.13577873.v1
Explore at:
application/x-gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13577873.v1
Dataset updated
May 31, 2023
Dataset provided by
figshare
Authors
Moynuddin Ahmed Shibly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.

Facebook

Twitter

Click to copy link

Link copied

Cite

Cornell University (2023). Advice_Seeking_Questions [Dataset]. https://opendatalab.com/OpenDataLab/Advice_Seeking_Questions

Advice_Seeking_Questions

OpenDataLab/Advice_Seeking_Questions

Explore at:

zip(19891364 bytes)Available download formats

Dataset updated

Sep 30, 2023

Dataset provided by

Cornell University

Description

The Advice-Seeking Questions (ASQ) dataset is a collection of personal narratives with advice-seeking questions. The dataset has been split into train, test, heldout sets, with 8865, 2500, 10000 test instances each. This dataset is used to train and evaluate methods that can infer what is the advice-seeking goal behind a personal narrative. This task is formulated as a cloze test, where the goal is to identify which of two advice-seeking questions was removed from a given narrative.

Clear search

Close search

Google apps

Main menu

Advice_Seeking_Questions

NLP Expert QA Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Trojan Detection Software Challenge - nlp-question-answering-sep2021-test

EYE-TEST-2

cinepile

Take the Test Sample Questions from OECD's PISA Assessments

German Language Understanding Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Multimodal WEDAR dataset for attention regulation behaviors, self-reported...

AI2 ARC - Advanced Science Question

AI2 ARC - Advanced Science Question

Promoting research in advanced question-answering

About this dataset

Research Ideas

Acknowledgements

License

Columns

Facebook bAbI Tasks for Malayalam Language

sat-reading

RACE Dataset

COVID-19 Case Surveillance Public Use Data

CDC has three COVID-19 case surveillance datasets:

Overview

COVID-19 Case Reports

Data are Considered Provisional

Data Limitations

Data Quality Assurance Procedures

Data Suppression

Additional COVID-19 Data

Question-and-answer dataset Rijksportaal Personnel

aqua-rat-mcqa

MMXU

ContextToQuestions

AI Question Answering Data

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Natural Questions Dataset

Dataset

Advice_Seeking_Questions

OpenDataLab/Advice_Seeking_Questions