The Advice-Seeking Questions (ASQ) dataset is a collection of personal narratives with advice-seeking questions. The dataset has been split into train, test, heldout sets, with 8865, 2500, 10000 test instances each. This dataset is used to train and evaluate methods that can infer what is the advice-seeking goal behind a personal narrative. This task is formulated as a cloze test, where the goal is to identify which of two advice-seeking questions was removed from a given narrative.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset, QASPER: NLP Questions and Evidence, is an exceptional collection of over 5,000 questions and answers focused on Natural Language Processing (NLP) papers. It has been crowdsourced from experienced NLP practitioners, with each question meticulously crafted based solely on the titles and abstracts of the respective papers. The answers provided are expertly enriched with evidence taken directly from the full text of each paper. QASPER features structured fields including 'qas' for questions and answers, 'evidence' for supporting information, paper titles, abstracts, figures and tables, and full text. This makes it a valuable resource for researchers aiming to understand how practitioners interpret NLP topics and to validate solutions for problems found in existing literature. The dataset contains 5,049 questions spanning 1,585 distinct papers.
The QASPER dataset comprises 5,049 questions across 1,585 papers. It is distributed across five files in .csv
format, with one additional .json
file for figures and tables. These include two test datasets (test.csv
and validation.csv
), two train datasets (train-v2-0_lessons_only_.csv
and trainv2-0_unsplit.csv
), and a figures dataset (figures_and_tables_.json
). Each CSV file contains distinct datasets with columns dedicated to titles, abstracts, full texts, and Q&A fields, along with evidence for each paper mentioned in the respective rows.
This dataset is ideal for various applications, including: * Developing AI models to automatically generate questions and answers from paper titles and abstracts. * Enhancing machine learning algorithms by combining answers with evidence to discover relationships between papers. * Creating online forums for NLP practitioners, using dataset questions to spark discussion within the community. * Conducting basic descriptive statistics or advanced predictive analytics, such as logistic regression or naive Bayes models. * Summarising basic crosstabs between any two variables, like titles and abstracts. * Correlating title lengths with the number of words in their corresponding abstracts to identify patterns. * Utilising text mining technologies like topic modelling, machine learning techniques, or automated processes to summarise underlying patterns. * Filtering terms relevant to specific research hypotheses and processing them via web crawlers, search engines, or document similarity algorithms.
The dataset has a GLOBAL region scope. It focuses on papers within the field of Natural Language Processing. The questions and answers are crowdsourced from experienced NLP practitioners. The dataset was listed on 22/06/2025.
CC0
This dataset is highly suitable for: * Researchers seeking insights into how NLP practitioners interpret complex topics. * Those requiring effective validation for developing clear-cut solutions to problems encountered in existing NLP literature. * NLP practitioners looking for a resource to stimulate discussions within their community. * Data scientists and analysts interested in exploring NLP datasets through descriptive statistics or advanced predictive analytics. * Developers and researchers working with text mining, machine learning techniques, or automated text processing.
Original Data Source: QASPER: NLP Questions and Evidence
Round 8 Test DatasetThis is the training data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform extractive question answering. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 360 QA AI models using a small set of model architectures. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.
20 Questions Dataset
Dataset Overview
20 Questions Dataset has been used in the EYE-Llama paper ([https://www.biorxiv.org/content/10.1101/2024.04.26.591355v1]) as a test set for evaluating ophthalmic language models. This dataset contains a collection of questions and answers specifically tailored for the ophthalmic domain. It serves as a valuable resource for assessing the performance of models in answering domain-specific queries.
License
This dataset is… See the full description on the dataset page: https://huggingface.co/datasets/QIAIUNCC/EYE-TEST-2.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
CinePile: A Long Video Question Answering Dataset and Benchmark
CinePile is a question-answering-based, long-form video understanding dataset. It has been created using advanced large language models (LLMs) with human-in-the-loop pipeline leveraging existing human-generated raw data. It consists of approximately 300,000 training data points and 5,000 test data points. If you have any comments or questions, reach out to: Ruchit Rawal or Gowthami Somepalli Other links - Website… See the full description on the dataset page: https://huggingface.co/datasets/tomg-group-umd/cinepile.
What does PISA actually assess? This book presents all the publicly available questions from the PISA surveys. Some of these questions were used in the PISA 2000, 2003 and 2006 surveys and others were used in developing and trying out the assessment. After a brief introduction to the PISA assessment, the book presents three chapters, including PISA questions for the reading, mathematics and science tests, respectively. Each chapter presents an overview of what exactly the questions assess. The second section of each chapter presents questions which were used in the PISA 2000, 2003 and 2006 surveys, that is, the actual PISA tests for which results were published. The third section presents questions used in trying out the assessment. Although these questions were not used in the PISA 2000, 2003 and 2006 surveys, they are nevertheless illustrative of the kind of question PISA uses. The final section shows all the answers, along with brief comments on each question.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides a collection of German question-answer pairs along with their corresponding context [1]. It is designed to enhance and facilitate natural language processing (NLP) tasks in the German language [1]. The dataset includes two main files, train.csv
and test.csv
, each containing numerous entries of various contexts with associated questions and answers in German [1]. The contextual information can range from paragraphs to concise sentences, offering a well-rounded representation of different scenarios [1]. It serves as a valuable resource for training machine learning models to improve question-answering systems or other NLP applications specific to the German language [1].
The dataset consists of the following columns [1, 2]: * id: An identifier for each entry [2]. * context: This column contains the context in which the question is being asked. It can be a paragraph, a sentence, or any other relevant information [1]. * question: The question related to the given context [2]. * answers: This column contains the answer or answers to the given question within the corresponding context [1]. The answers could be single or multiple [1]. * Label Count: Numerical ranges with corresponding counts [2].
The dataset is provided in CSV format [1, 3], comprising two main files: train.csv
and test.csv
[1]. Both files contain a significant number of question-answer pairs and their respective contexts [1]. While specific total row or record counts are not explicitly stated, the source material indicates substantial amounts of data [1]. For instance, certain label counts range from 36,419.00 to 45,662.00, with varying numbers of entries within those ranges, such as 529, 508, or 29 unique values for specific segments [2].
This dataset is ideal for a variety of applications and use cases, including [1]: * Building question-answering systems in German. * Training models for German language understanding and translation tasks. * Developing information retrieval systems that can process German user queries and return relevant information from provided contexts. * Enhancing NLP models for accuracy and robustness in German. * Exploring state-of-the-art methodologies or developing novel approaches for natural language understanding in German [1].
The dataset's linguistic scope is specifically the German language [1]. Geographically, it is intended for global use [4]. There are no specific notes on time range or demographic availability within the provided sources.
CC0
The dataset is intended for [1]: * Researchers working on advancements in machine learning techniques applied to natural language understanding in German. * Developers building and refining NLP applications for the German language. * Enthusiasts exploring and implementing machine learning models for language processing.
Original Data Source: German Question-Answer Context Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Diverse learning theories have been constructed to understand learners' internal states through various tangible predictors. We focus on self-regulatory actions that are subconscious and habitual actions triggered by behavior agents' 'awareness' of their attention loss. We hypothesize that self-regulatory behaviors (i.e., attention regulation behaviors) also occur in e-reading as 'regulators' as found in other behavior models (Ekman, P., & Friesen, W. V., 1969). In this work, we try to define the types and frequencies of attention regulation behaviors in e-reading. We collected various cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading.
The text 'How to make the most of your day at Disneyland Resort Paris' has been implemented on a screen-based e-reader, which we developed in a pdf-reader format. An informative, entertaining text was adopted to capture learners' attentional shifts during knowledge acquisition. The text has 2685 words, distributed over ten pages, with one subtopic on each page. A built-in webcam on Mac Pro and a mouse have been used for the data collection, aiming for real-world implementation only with essential computational devices. A height-adjustable laptop stand has been used to compensate for participants' eye levels.
Thirty learners in higher education have been invited for a screen-based e-reading task (M=16.2, SD=5.2 minutes). A pre-test questionnaire with ten multiple-choice questions was given before the reading to check their prior knowledge level about the topic. There was no specific time limit to finish the questionnaire. We collected cues that reflect learners' moment-to-moment and page-to-page cognitive states to understand the learners' attention in e-reading. Learners were asked to report their distractions on two levels during the reading: 1) In-text distraction (e.g., still reading the text with low attentiveness) or 2) out-of-text distraction (e.g., thinking of something else while not reading the text anymore). We implemented two noticeably-designed buttons on the right-hand side of the screen interface to minimize possible distraction from the reporting task. After triggering a new page, we implemented blur stimuli on the text in the random range of 20 seconds. It ensures that the blur stimuli occur at least once on each page. Participants were asked to click the de-blur button on the text area of the screen to proceed with the reading. The button has been implemented in the whole text area, so participants can minimize the effort to find and click the button. Reaction time for de-blur has been measured, too, to grasp the arousal of learners during the reading. We asked participants to answer pre-test and post-test questionnaires about the reading material. Participants were given ten multiple-choice questions before the session, while the same set of questions was given after the reading session (i.e., formative questions) with added subtopic summarization questions (i.e., summative questions). It can provide insights into the quantitative and qualitative knowledge gained through the session and different learning outcomes based on individual differences. A video dataset of 931,440 frames has been annotated with the attention regulator behaviors using an annotation tool that plays the long sequence clip by clip, which contains 30 frames. Two annotators (doctoral students) have done two stages of labeling. In the first stage, the annotators were trained on the labeling criteria and annotated the attention regulator behaviors separately based on their judgments. The labels were summarized and cross-checked in the second round to address the inconsistent cases, resulting in five attention regulation behaviors and one neutral state. See WEDAR_readme.csv for detailed descriptions of features.
The dataset has been uploaded 1) raw data, which has formed as we collected, and 2) preprocessed, that we extracted useful features for further learning analytics based on real-time and post-hoc data.
Reference
Ekman, P., & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. semiotica, 1(1), 49-98.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By ai2_arc (From Huggingface) [source]
The ai2_arc dataset, also known as the A Challenge Dataset for Advanced Question-Answering in Grade-School Level Science, is a comprehensive and valuable resource created to facilitate research in advanced question-answering. This dataset consists of a collection of 7,787 genuine grade-school level science questions presented in multiple-choice format.
The primary objective behind assembling this dataset was to provide researchers with a powerful tool to explore and develop question-answering models capable of tackling complex scientific inquiries typically encountered at a grade-school level. The questions within this dataset are carefully crafted to test the knowledge and understanding of various scientific concepts in an engaging manner.
The ai2_arc dataset is further divided into two primary sets: the Challenge Set and the Easy Set. Each set contains numerous highly curated science questions that cover a wide range of topics commonly taught at a grade-school level. These questions are designed specifically for advanced question-answering research purposes, offering an opportunity for model evaluation, comparison, and improvement.
In terms of data structure, the ai2_arc dataset features several columns providing vital information about each question. These include columns such as question, which contains the text of the actual question being asked; choices, which presents the multiple-choice options available for each question; and answerKey, which indicates the correct answer corresponding to each specific question.
Researchers can utilize this comprehensive dataset not only for developing advanced algorithms but also for training machine learning models that exhibit sophisticated cognitive capabilities when it comes to comprehending scientific queries from a grade-school perspective. Moreover, by leveraging these meticulously curated questions, researchers can analyze performance metrics such as accuracy or examine biases within their models' decision-making processes.
In conclusion, the ai2_arc dataset serves as an invaluable resource for anyone involved in advanced question-answering research within grade-school level science education. With its extensive collection of genuine multiple-choice science questions spanning various difficulty levels, researchers can delve into the intricate nuances of scientific knowledge acquisition, processing, and reasoning, ultimately unlocking novel insights and innovations in the field
- Developing advanced question-answering models: The ai2_arc dataset provides a valuable resource for training and evaluating advanced question-answering models. Researchers can use this dataset to develop and test algorithms that can accurately answer grade-school level science questions.
- Evaluating natural language processing (NLP) models: NLP models that aim to understand and generate human-like responses can be evaluated using this dataset. The multiple-choice format of the questions allows for objective evaluation of the model's ability to comprehend and provide correct answers.
- Assessing human-level performance: The dataset can be used as a benchmark to measure the performance of human participants in answering grade-school level science questions. By comparing the accuracy of humans with that of AI systems, researchers can gain insights into the strengths and weaknesses of both approaches
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: ARC-Challenge_test.csv | Column name | Description | |:--------------|:--------------------------------------------------------------------------------| | question | The text content of each question being asked. (Text) | | choices | A list of multiple-choice options associated with each question. (List of Text) | | answerKey | The correct answer option (choice) for a particular question. (Text) |
File: ARC-Easy_test.csv | Column name | Description ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Malayalam Question Answering dataset of 5,000 training samples and 5,000 testing samples was generated by translating Facebook bAbI tasks. Facebook's bAbI tasks was originally created in English, some of the languages it has been translated are French, German, Hindi, Chinese, Russian. Twenty fictitious tasks that test a system's capacity for responding to a range of themes, including text comprehension and reasoning, are included in the dataset. Five task-oriented usability questions with comparable sentence patterns are also included in the collection. The questions here range in difficulty. Every job has 1000 test samples and 1000 training samples in the dataset. we created the dataset for the proposed work by utilizing the bAbI dataset to translate the English dataset into Malayalam for five tasks (tasks 1, 4, 11, 12, and 13), represented as tasks 1 through 5. Titles such as "Single Supporting Facts," "Two Argument Relations," "Basic Coreference," "Conjunction," and "Compound Coreference" relate to the tasks. Every sample in the dataset comprises a series of statements (sometimes called stories) about people's movements around things, a question, a suitable answer. Tasks: Task 1: Single supporting fact: This task tests whether a model can identify a single important fact from a story to answer a question. The story usually contains several sentences, but only one sentence is directly useful in answering the question. Task 2: Relationships with two arguments: This task involves understanding the relationship between two entities. The model must infer relationships between pairs of objects, people or places. Task 3: Core co-reference: Co-reference resolution is the task of linking pronouns or phrases to the correct entities. In this task, the model must resolve simple pronominal references. Task 4: Conjunctions: This task tests the model's ability to understand sentences in which several actions or facts are joined by conjunctions such as "and" or "or". The model must process these linked statements to answer the questions correctly. Task 5: Compound Reference: This task is more complex because it requires the model to solve the conjunctions in the sentence with composite entities or more complex structures.
Dataset Card for "sat-reading"
This dataset contains the passages and questions from the Reading part of ten publicly available SAT Practice Tests. For more information see the blog post Language Models vs. The SAT Reading Test. For each question, the reading passage from the section it is contained in is prefixed. Then, the question is prompted with Question #:, followed by the four possible answers. Each entry ends with Answer:. Questions which reference a diagram, chart, table… See the full description on the dataset page: https://huggingface.co/datasets/emozilla/sat-reading.
The ReAding Comprehension dataset from Examinations (RACE) dataset is a machine reading comprehension dataset consisting of 27,933 passages and 97,867 questions from English exams, targeting Chinese students aged 12-18. RACE consists of two subsets, RACE-M and RACE-H, from middle school and high school exams, respectively. RACE-M has 28,293 questions and RACE-H has 69,574. Each question is associated with 4 candidate answers, one of which is correct. The data generation process of RACE differs from most machine reading comprehension datasets - instead of generating questions and answers by heuristics or crowd-sourcing, questions in RACE are specifically designed for testing human reading skills, and are created by domain experts.
https://www.usa.gov/government-workshttps://www.usa.gov/government-works
Note: Reporting of new COVID-19 Case Surveillance data will be discontinued July 1, 2024, to align with the process of removing SARS-CoV-2 infections (COVID-19 cases) from the list of nationally notifiable diseases. Although these data will continue to be publicly available, the dataset will no longer be updated.
Authorizations to collect certain public health data expired at the end of the U.S. public health emergency declaration on May 11, 2023. The following jurisdictions discontinued COVID-19 case notifications to CDC: Iowa (11/8/21), Kansas (5/12/23), Kentucky (1/1/24), Louisiana (10/31/23), New Hampshire (5/23/23), and Oklahoma (5/2/23). Please note that these jurisdictions will not routinely send new case data after the dates indicated. As of 7/13/23, case notifications from Oregon will only include pediatric cases resulting in death.
This case surveillance public use dataset has 12 elements for all COVID-19 cases shared with CDC and includes demographics, any exposure history, disease severity indicators and outcomes, presence of any underlying medical conditions and risk behaviors, and no geographic data.
The COVID-19 case surveillance database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and affiliates. On April 5, 2020, COVID-19 was added to the Nationally Notifiable Condition List and classified as “immediately notifiable, urgent (within 24 hours)” by a Council of State and Territorial Epidemiologists (CSTE) Interim Position Statement (Interim-20-ID-01). CSTE updated the position statement on August 5, 2020, to clarify the interpretation of antigen detection tests and serologic test results within the case classification (Interim-20-ID-02). The statement also recommended that all states and territories enact laws to make COVID-19 reportable in their jurisdiction, and that jurisdictions conducting surveillance should submit case notifications to CDC. COVID-19 case surveillance data are collected by jurisdictions and reported voluntarily to CDC.
For more information:
NNDSS Supports the COVID-19 Response | CDC.
The deidentified data in the “COVID-19 Case Surveillance Public Use Data” include demographic characteristics, any exposure history, disease severity indicators and outcomes, clinical data, laboratory diagnostic test results, and presence of any underlying medical conditions and risk behaviors. All data elements can be found on the COVID-19 case report form located at www.cdc.gov/coronavirus/2019-ncov/downloads/pui-form.pdf.
COVID-19 case reports have been routinely submitted using nationally standardized case reporting forms. On April 5, 2020, CSTE released an Interim Position Statement with national surveillance case definitions for COVID-19 included. Current versions of these case definitions are available here: https://ndc.services.cdc.gov/case-definitions/coronavirus-disease-2019-2021/.
All cases reported on or after were requested to be shared by public health departments to CDC using the standardized case definitions for laboratory-confirmed or probable cases. On May 5, 2020, the standardized case reporting form was revised. Case reporting using this new form is ongoing among U.S. states and territories.
To learn more about the limitations in using case surveillance data, visit FAQ: COVID-19 Data and Surveillance.
CDC’s Case Surveillance Section routinely performs data quality assurance procedures (i.e., ongoing corrections and logic checks to address data errors). To date, the following data cleaning steps have been implemented:
To prevent release of data that could be used to identify people, data cells are suppressed for low frequency (<5) records and indirect identifiers (e.g., date of first positive specimen). Suppression includes rare combinations of demographic characteristics (sex, age group, race/ethnicity). Suppressed values are re-coded to the NA answer option; records with data suppression are never removed.
For questions, please contact Ask SRRG (eocevent394@cdc.gov).
COVID-19 data are available to the public as summary or aggregate count files, including total counts of cases and deaths by state and by county. These
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Questions, answers and documents are stored in the dataset. Every question has an answer and the answer comes from a page of Rijksportaal Personnel (intranet central government). With this dataset a question-and-answer model can be trained. The computer thus learns to answer questions in the context of P-Direkt. A total of 322 questions were used that were once asked by e-mail to the contact center of P-Direkt. The questions are very general and never ask about personal circumstances. The aim of the dataset was to test whether question-and-answer models could possibly be used in a P-Direkt environment. The structure of the dataset corresponds to the Squad 2.0 dataset. ### Example: #### Question: Is it true that my SCV hours of 2020 expire if I don't take them? #### Answer: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire #### Source*: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire. You cannot have your IKB savings leave paid out. Payment is only made in the event of termination of employment or death. You can save up to 1800 hours. Do you work part-time or more than an average of 36 hours per week? In that case, the maximum number of hours to be saved is calculated proportionally and rounded down to whole hours. Any remaining holiday hours from 2015 and extra-statutory holiday hours that you had left over from 2016 up to and including 2019 will be converted into IKB hours on 1 January 2020 and these have been added to your IKB savings leave. * Please note, source is a snapshot of National Portal Personnel from April 2021. Go to National Portal Personnel on the intranet for up-to-date information about personnel matters.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
AQUA-RAT MCQA Dataset
This dataset contains the AQUA-RAT dataset converted to Multiple Choice Question Answering (MCQA) format with modifications.
Dataset Description
AQUA-RAT is a dataset of algebraic word problems with rationales. This version has been processed to:
Remove all questions where the correct answer was option "E" (5th choice) Remove the "E" option from all remaining questions (4 choices: A, B, C, D) Merge validation and test splits into a single test split… See the full description on the dataset page: https://huggingface.co/datasets/RikoteMaster/aqua-rat-mcqa.
MMXU benchmark
MMXU (Multimodal and MultiX-ray Understanding) is a novel dataset for MedVQA that focuses on identifying changes in specific regions between two patient visits. Unlike previous datasets that primarily address single-image questions, MMXU enables multi-image questions, incorporating both current and historical patient data.
News
(05/16/2025) Our paper has been accepted by ACL 2025 Findings!(02/22/2025) The MMXU-test benchmark has been released… See the full description on the dataset page: https://huggingface.co/datasets/LinjieMu/MMXU.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Context-Based Question Generation Dataset
This dataset is designed for context-based question generation, where questions of different types (true/false, multiple-choice, open-ended) are generated based on a given context. The dataset is synthetically created using ChatGPT, providing a diverse set of questions to test comprehension and reasoning skills.
Dataset Structure
The dataset is structured with the following fields for each example:
context: The context provided… See the full description on the dataset page: https://huggingface.co/datasets/mito0o852/ContextToQuestions.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset provides essential information for entries related to question answering tasks using AI models. It is designed to offer valuable insights for researchers and practitioners, enabling them to effectively train and rigorously evaluate their machine learning models. The dataset serves as a valuable resource for building and assessing question-answering systems. It is available free of charge.
The data files are typically in CSV format, with a dedicated train.csv
file for training data and a test.csv
file for testing purposes. The training file contains a large number of examples. Specific dates are not included within this dataset description, focusing solely on providing accurate and informative details about its content and purpose. Specific numbers for rows or records are not detailed in the available information.
This dataset is ideal for a variety of applications and use cases:
* Training and Testing: Utilise train.csv
to train question-answering models or algorithms, and test.csv
to evaluate their performance on unseen questions.
* Machine Learning Model Creation: Develop machine learning models specifically for question-answering by leveraging the instructional components, including instructions, responses, next responses, and human-generated answers, along with their is_human_response
labels.
* Model Performance Evaluation: Assess model performance by comparing predicted responses with actual human-generated answers from the test.csv
file.
* Data Augmentation: Expand existing data by paraphrasing instructions or generating alternative responses within similar contexts.
* Conversational Agents: Build conversational agents or chatbots by utilising the instruction-response pairs for training.
* Language Understanding: Train models to understand language and generate responses based on instructions and previous responses.
* Educational Materials: Develop interactive quizzes or study guides, with models providing instant feedback to students.
* Information Retrieval Systems: Create systems that help users find specific answers from large datasets.
* Customer Support: Train customer support chatbots to provide quick and accurate responses to inquiries.
* Language Generation Research: Develop novel algorithms for generating coherent responses in question-answering scenarios.
* Automatic Summarisation Systems: Train systems to generate concise summaries by understanding main content through question answering.
* Dialogue Systems Evaluation: Use the instruction-response pairs as a benchmark for evaluating dialogue system performance.
* NLP Algorithm Benchmarking: Establish baselines against which other NLP tools and methods can be measured.
The dataset's geographic scope is global. There is no specific time range or demographic scope noted within the available details, as specific dates are not included.
CC0
This dataset is highly suitable for: * Researchers and Practitioners: To gain insights into question answering tasks using AI models. * Developers: To train models, create chatbots, and build conversational agents. * Students: For developing educational materials and enhancing their learning experience through interactive tools. * Individuals and teams working on Natural Language Processing (NLP) projects. * Those creating information retrieval systems or customer support solutions. * Experts in natural language generation (NLG) and automatic summarisation systems. * Anyone involved in the evaluation of dialogue systems and machine learning model training.
Original Data Source: Question-Answering Training and Testing Data
The Natural Questions corpus is a question answering dataset containing 307,373 training examples, 7,830 development examples, and 7,842 test examples. Each example is comprised of a google.com query and a corresponding Wikipedia page. Each Wikipedia page has a passage (or long answer) annotated on the page that answers the question and one or more short spans from the annotated passage containing the actual answer. The long and the short answer annotations can however be empty. If they are both empty, then there is no answer on the page at all. If the long answer annotation is non-empty, but the short answer annotation is empty, then the annotated passage answers the question but no explicit short answer could be found. Finally 1% of the documents have a passage annotated with a short answer that is “yes” or “no”, instead of a list of short spans.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is an open source - publicly available dataset which can be found at https://shahariarrabby.github.io/ekush/ . We split the dataset into three sets - train, validation, and test. For our experiments, we created two other versions of the dataset. We have applied 10-fold cross validation on the train set and created ten folds. We also created ten bags of datasets using bootstrap aggregating method on the train and validation sets. Lastly, we created another dataset using pre-trained ResNet50 model as feature extractor. On the features extracted by ResNet50 we have applied PCA and created a tabilar dataset containing 80 features. pca_features.csv is the train set and pca_test_features.csv is the test set. Fold.tar.gz contains the ten folds of images described above. Those folds are also been compressed. Similarly, Bagging.tar.gz contains the ten compressed bags of images. The original train, validation, and test sets are in Train.tar.gz, Validation.tar.gz, and Test.tar.gz, respectively. The compression has been performed for speeding up the upload and download purpose and mostly for the sake of convenience. If anyone has any question about how the datasets are organized please feel free to ask me at shiblygnr@gmail.com .I will get back to you in earliest time possible.
The Advice-Seeking Questions (ASQ) dataset is a collection of personal narratives with advice-seeking questions. The dataset has been split into train, test, heldout sets, with 8865, 2500, 10000 test instances each. This dataset is used to train and evaluate methods that can infer what is the advice-seeking goal behind a personal narrative. This task is formulated as a cloze test, where the goal is to identify which of two advice-seeking questions was removed from a given narrative.