100+ datasets found
  1. Question Answering for Financial data (FinQA)

    • kaggle.com
    zip
    Updated Mar 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    VISALAKSHI IYER (2022). Question Answering for Financial data (FinQA) [Dataset]. https://www.kaggle.com/datasets/visalakshiiyer/question-answering-financial-data
    Explore at:
    zip(13416653 bytes)Available download formats
    Dataset updated
    Mar 29, 2022
    Authors
    VISALAKSHI IYER
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions about financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks in the general domain, the finance domain includes complex numerical reasoning and an understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. More details are provided here: Paper, Preview

    The dataset is stored as JSON files, each entry has the following format: ``` "pre_text": the texts before the table; "post_text": the text after the table; "table": the table; "id": unique example id. composed by the original report name plus example index for this report.

    "qa": { "question": the question; "program": the reasoning program; "gold_inds": the gold supporting facts; "exe_ans": the gold execution result; "program_re": the reasoning program in nested format; } ```

    This dataset is the first of its kind intending to enable significant, new community research into complex application domains. It was hosted for a competition at CodaLabs on FinQA where if given a financial report containing both text and table, the goal is to answer a question requiring numerical reasoning. The code is publicly available @GitHub/FinQA

  2. F

    Open Ended Question Answer Text Dataset in English

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Open Ended Question Answer Text Dataset in English [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/english-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The English Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the English language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in English. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native English people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled English Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in English are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy English Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  3. d

    Clinical Questions Collection

    • catalog.data.gov
    • data.virginia.gov
    • +3more
    Updated Jul 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Library of Medicine (2025). Clinical Questions Collection [Dataset]. https://catalog.data.gov/dataset/clinical-questions-collection-665af
    Explore at:
    Dataset updated
    Jul 11, 2025
    Dataset provided by
    National Library of Medicine
    Description

    The Clinical Questions Collection is a repository of questions that have been collected between 1991 – 2003 from healthcare providers in clinical settings across the country. The questions have been submitted by investigators who wish to share their data with other researchers. This dataset is no-longer updated with new content. The collection is used in developing approaches to clinical and consumer-health question answering, as well as researching information needs of clinicians and the language they use to express their information needs. All files are formatted in XML.

  4. F

    English Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/english-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The English Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the English language, advancing the field of artificial intelligence.

    Dataset Content

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in English. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native English people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled English Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The English versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy English Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  5. F

    Italian Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Italian Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/italian-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Italian Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Italian language, advancing the field of artificial intelligence.

    Dataset Content

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Italian. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Italian people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details

    This fully labeled Italian Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The Italian versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Italian Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  6. Financial Q&A - 10k

    • kaggle.com
    zip
    Updated Jun 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yousef Saeedian (2024). Financial Q&A - 10k [Dataset]. https://www.kaggle.com/datasets/yousefsaeedian/financial-q-and-a-10k
    Explore at:
    zip(753665 bytes)Available download formats
    Dataset updated
    Jun 17, 2024
    Authors
    Yousef Saeedian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset, titled "Financial-QA-10k", contains 10,000 question-answer pairs derived from company financial reports, specifically the 10-K filings. The questions are designed to cover a wide range of topics relevant to financial analysis, company operations, and strategic insights, making it a valuable resource for researchers, data scientists, and finance professionals. Each entry includes the question, the corresponding answer, the context from which the answer is derived, the company's stock ticker, and the specific filing year. The dataset aims to facilitate the development and evaluation of natural language processing models in the financial domain.

    About the Dataset Dataset Structure:

    • Rows: 7000
    • Columns: 5
    • question: The financial or operational question asked.
    • answer: The specific answer to the question.
    • context: The textual context extracted from the 10-K filing, providing additional information.
    • ticker: The stock ticker symbol of the company.
    • filing: The year of the 10-K filing from which the question and answer are derived.

    Sample Data:

    Question: What area did NVIDIA initially focus on before expanding into other markets? Answer: NVIDIA initially focused on PC graphics. Context: Since our original focus on PC graphics, we have expanded into various markets. Ticker: NVDA Filing: 2023_10K

    Potential Uses:

    Natural Language Processing (NLP): Develop and test NLP models for question answering, context understanding, and information retrieval. Financial Analysis: Extract and analyze specific financial and operational insights from large volumes of textual data. Educational Purposes: Serve as a training and testing resource for students and researchers in finance and data science.

  7. GPT-4o Game Play Data - LLM 20 Questions

    • kaggle.com
    zip
    Updated Sep 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sadhaklal (2024). GPT-4o Game Play Data - LLM 20 Questions [Dataset]. https://www.kaggle.com/datasets/sambitmukherjee/gpt-4o-game-play-data-llm-20-questions
    Explore at:
    zip(2125392 bytes)Available download formats
    Dataset updated
    Sep 2, 2024
    Authors
    Sadhaklal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    "20 questions" game play data generated by making a GPT-4o Guesser agent play with a GPT-4o Answerer agent.

    The dataset follows the structure of games in the LLM 20 Questions Kaggle competition. In particular, the Guesser has a maximum of 20 rounds to guess the secret keyword, which is only available to the Answerer. Each round has the following sequence of turns:

    1. Turn type "ask" by the Guesser: The Guesser asks a question.
    2. Turn type "answer" by the Answerer: The Answerer replies with a binary "no" / "yes" answer. (Any other answer is illegal.)
    3. Turn type "guess" by the Guesser: The Guesser guesses the secret keyword.

    The dataset was generated with the objective of cloning GPT-4o's behavior (on the successful games) into a smaller open source LLM such as "Meta-Llama-3.1-8B-Instruct".

  8. p

    Data from: RadQA: A Question Answering Dataset to Improve Comprehension of...

    • physionet.org
    • oppositeofnorth.com
    Updated Dec 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarvesh Soni; Kirk Roberts (2022). RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports [Dataset]. http://doi.org/10.13026/ckkp-6y19
    Explore at:
    Dataset updated
    Dec 9, 2022
    Authors
    Sarvesh Soni; Kirk Roberts
    License

    https://github.com/MIT-LCP/license-and-dua/tree/master/draftshttps://github.com/MIT-LCP/license-and-dua/tree/master/drafts

    Description

    We present a radiology question answering dataset, RadQA, with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines. In published work, we conducted a thorough analysis of the proposed dataset by examining the broad categories of disagreement in annotation (providing insights on the errors made by humans) and the reasoning requirements to answer a question (uncovering the huge dependence on medical knowledge for answering the questions). In that work, the best-performing transformer language model achieved an F1 of 63.55 on the test set. However, the top human-level performance on this dataset is 90.31 (with an average human performance of 84.52), which demonstrates the challenging nature of RadQA that leaves ample scope for future method research.

  9. Comprehensive Medical Q&A Dataset

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Comprehensive Medical Q&A Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/comprehensive-medical-q-a-dataset
    Explore at:
    zip(5126941 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Comprehensive Medical Q&A Dataset

    Unlocking Healthcare Data with Natural Language Processing

    By Huggingface Hub [source]

    About this dataset

    The MedQuad dataset provides a comprehensive source of medical questions and answers for natural language processing. With over 43,000 patient inquiries from real-life situations categorized into 31 distinct types of questions, the dataset offers an invaluable opportunity to research correlations between treatments, chronic diseases, medical protocols and more. Answers provided in this database come not only from doctors but also other healthcare professionals such as nurses and pharmacists, providing a more complete array of responses to help researchers unlock deeper insights within the realm of healthcare. This incredible trove of knowledge is just waiting to be mined - so grab your data mining equipment and get exploring!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    In order to make the most out of this dataset, start by having a look at the column names and understanding what information they offer: qtype (the type of medical question), Question (the question in itself), and Answer (the expert response). The qtype column will help you categorize the dataset according to your desired question topics. Once you have filtered down your criteria as much as possible using qtype, it is time to analyze the data. Start by asking yourself questions such as “What treatments do most patients search for?” or “Are there any correlations between chronic conditions and protocols?” Then use simple queries such as SELECT Answer FROM MedQuad WHERE qtype='Treatment' AND Question LIKE '%pain%' to get closer to answering those questions.

    Once you have obtained new insights about healthcare based on the answers provided in this dynmaic data set - now it’s time for action! Use all that newfound understanding about patient needs in order develop educational materials and implement any suggested changes necessary. If more criteria are needed for querying this data set see if MedQuad offers additional columns; sometimes extra columns may be added periodically that could further enhance analysis capabilities; look out for notifications if these happen.

    Finally once making an impact with the use case(s) - don't forget proper citation etiquette; give credit where credit is due!

    Research Ideas

    • Developing medical diagnostic tools that use natural language processing (NLP) to better identify and diagnose health conditions in patients.
    • Creating predictive models to anticipate treatment options for different medical conditions using machine learning techniques.
    • Leveraging the dataset to build chatbots and virtual assistants that are able to answer a broad range of questions about healthcare with expert-level accuracy

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | qtype | The type of medical question. (String) | | Question | The medical question posed by the patient. (String) | | Answer | The expert response to the medical question. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  10. F

    Bulgarian Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Bulgarian Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/bulgarian-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    The Bulgarian Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Bulgarian language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Bulgarian. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Bulgarian people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Bulgarian Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in Bulgarian are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Bulgarian Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  11. Data from: quac

    • huggingface.co
    • tensorflow.org
    • +1more
    Updated Dec 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2020). quac [Dataset]. https://huggingface.co/datasets/allenai/quac
    Explore at:
    Dataset updated
    Dec 12, 2020
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.

  12. Toloka Visual Question Answering Dataset

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    • +1more
    Updated Oct 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ustalov, Dmitry (2023). Toloka Visual Question Answering Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7057740
    Explore at:
    Dataset updated
    Oct 10, 2023
    Dataset provided by
    Tolokahttps://www.toloka.ai/
    Authors
    Ustalov, Dmitry
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Our dataset consists of the images associated with textual questions. One entry (instance) in our dataset is a question-image pair labeled with the ground truth coordinates of a bounding box containing the visual answer to the given question. The images were obtained from a CC BY-licensed subset of the Microsoft Common Objects in Context dataset, MS COCO. All data labeling was performed on the Toloka crowdsourcing platform, https://toloka.ai/.

    Our dataset has 45,199 instances split among three subsets: train (38,990 instances), public test (1,705 instances), and private test (4,504 instances). The entire train dataset was available for everyone since the start of the challenge. The public test dataset was available since the evaluation phase of the competition, but without any ground truth labels. After the end of the competition, public and private sets were released.

    The datasets will be provided as files in the comma-separated values (CSV) format containing the following columns.

        Column
        Type
        Description
    
    
        image
        string
        URL of an image on a public content delivery network
    
    
        width
        integer
        image width
    
    
        height
        integer
        image height
    
    
        left
        integer
        bounding box coordinate: left
    
    
        top
        integer
        bounding box coordinate: top
    
    
        right
        integer
        bounding box coordinate: right
    
    
        bottom
        integer
        bounding box coordinate: bottom
    
    
        question
        string
        question in English
    

    This upload also contains a ZIP file with the images from MS COCO.

  13. d

    FAQ

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Oct 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). FAQ [Dataset]. https://catalog.data.gov/dataset/faq
    Explore at:
    Dataset updated
    Oct 30, 2025
    Dataset provided by
    Dashlink
    Description

    Answers to frequently asked questions will be posted here for the benefit of all users of this data. Questions posed in the future may also be incorporated into the document.

  14. u

    Amazon Question and Answer Data

    • cseweb.ucsd.edu
    json
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCSD CSE Research Project, Amazon Question and Answer Data [Dataset]. https://cseweb.ucsd.edu/~jmcauley/datasets.html
    Explore at:
    jsonAvailable download formats
    Dataset authored and provided by
    UCSD CSE Research Project
    Description

    These datasets contain 1.48 million question and answer pairs about products from Amazon.

    Metadata includes

    • question and answer text

    • is the question binary (yes/no), and if so does it have a yes/no answer?

    • timestamps

    • product ID (to reference the review dataset)

    Basic Statistics:

    • Questions: 1.48 million

    • Answers: 4,019,744

    • Labeled yes/no questions: 309,419

    • Number of unique products with questions: 191,185

  15. 10.44M English Test QA Dataset – All Grades & Subjects

    • nexdata.ai
    Updated Aug 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2024). 10.44M English Test QA Dataset – All Grades & Subjects [Dataset]. https://www.nexdata.ai/datasets/llm/1572
    Explore at:
    Dataset updated
    Aug 30, 2024
    Dataset authored and provided by
    Nexdata
    Variables measured
    Format, Content, Language, Data Size, Data Fields, Data processing, Subject categories, Question type categories
    Description

    This dataset contains 10.44 million English-language test questions parsed and structured for large-scale educational AI and NLP applications. Each question record includes the title, answer, explanation (parse), subject, grade level, and question type. Covering a full range of academic stages from primary and middle school to high school and university, the dataset spans core subjects such as English, mathematics, biology, and accounting. The content follows the Anglo-American education system and supports tasks such as question answering, subject knowledge enhancement, educational chatbot training, and intelligent tutoring systems. All data are formatted for efficient machine learning use and comply with data protection regulations including GDPR, CCPA, and PIPL.

  16. Data from: Event-QA: A Dataset for Event-Centric Question Answering over...

    • data.europa.eu
    • data.niaid.nih.gov
    • +1more
    unknown
    Updated Jan 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Event-QA: A Dataset for Event-Centric Question Answering over Knowledge Graphs [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3568387?locale=en
    Explore at:
    unknown(826)Available download formats
    Dataset updated
    Jan 27, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Event-QA dataset contains 1000 semantic queries and the corresponding verbalisations for EventKG - a recently proposed event-centric knowledge graph containing over 970 thousand events.

  17. Data from: QuAC Question Answering in Context Dataset

    • kaggle.com
    • opendatalab.com
    zip
    Updated Jan 26, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jérøme E. Blanch∑xt (2020). QuAC Question Answering in Context Dataset [Dataset]. https://www.kaggle.com/datasets/jeromeblanchet/quac-question-answering-in-context-dataset
    Explore at:
    zip(18443228 bytes)Available download formats
    Dataset updated
    Jan 26, 2020
    Authors
    Jérøme E. Blanch∑xt
    Description

    Now that I have your attention, please up-vote this dataset and read the following!!!

    What is QuAC?

    Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.

    QuAC is meant to be an academic resource and has significant limitations. Please read our detailed datasheet before considering it for any practical application.

    Is QuAC exactly like SQuAD 2.0?

    No, QuAC shares many principles with SQuAD 2.0 such as span based evaluation and unanswerable questions (including website design principles! Big thanks for sharing the code!) but incorporates a new dialog component. We expect models can be easily evaluated on both resources and have tried to make our evaluation protocol as similar as possible to their own.

    QuAC Poster:

    https://quac.ai/quac_poster_pdf.pdf

    Paper:

    QuAC : Question Answering in Context (2018) https://arxiv.org/abs/1808.07036

    Data Source:

    https://quac.ai/

    https://media.giphy.com/media/YknAouVrcbkiDvWUOR/giphy.gif" alt="Alt Text"> https://media.giphy.com/media/26xBtSyoi5hUUkCEo/giphy.gif" alt="Alt Text"> https://media.giphy.com/media/4LiMmbAcvgTQs/giphy.gif" alt="Alt Text"> https://media.giphy.com/media/3o6Ztg5jGKDQSjaZ1K/giphy.gif" alt="Alt Text">

  18. F

    Data from: Evaluating SQuAD-based Question Answering for the Open Research...

    • data.uni-hannover.de
    csv, json
    Updated Dec 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2022). Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion [Dataset]. https://data.uni-hannover.de/dataset/evaluating-squad-based-question-answering-for-the-open-research-knowledge-graph-completion
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Dec 5, 2022
    Dataset authored and provided by
    TIB
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset is part of the bachelor thesis "Evaluating SQuAD-based Question Answering for the Open Research Knowledge Graph Completion". It was created for the finetuning of Bert Based models pre-trained on the SQUaD dataset. The Dataset was created using semi-automatic approach on the ORKG data.

    The dataset.csv file contains the entire data (all properties) in a tabular for and is unsplit. The json files contain only the necessary fields for training and evaluation, with additional fields (index of start and end of the answers in the abstracts). The data in the json files is split (training data) and evaluation data. We create 4 variants of the training and evaluation sets for each one of the question labels ("no label", "how", "what", "which")

    For detailed information on each of the fields in the dataset, refer to section 4.2 (Corpus) of the Thesis document that can be found in https://www.repo.uni-hannover.de/handle/123456789/12958.

    The script used to generate the dataset can be found in the public repository https://github.com/as18cia/thesis_work and https://gitlab.com/TIBHannover/orkg/nlp/experiments/orkg-fine-tuning-squad-based-models

  19. Data from: svq

    • huggingface.co
    Updated May 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2025). svq [Dataset]. https://huggingface.co/datasets/google/svq
    Explore at:
    Dataset updated
    May 23, 2025
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Simple Voice Questions

    Simple Voice Questions (SVQ) is a set of short audio questions recorded in 26 locales across 17 languages under multiple audio conditions.

      Data Collection
    

    Speakers were presented with recording instructions specifying the recording environment and text query to be recorded. They recorded using their own phones or tablets under four conditions:

    clean: Record in quiet environment background speech noise: Record while audio from sources like podcasts… See the full description on the dataset page: https://huggingface.co/datasets/google/svq.

  20. S

    MTR-QA:Multi-Type Reasoning Question Answering Dataset

    • scidb.cn
    Updated Apr 17, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang Qiang; Jiang Chenglin; Ma Ning; Li Yingjie; Wu Wenshe (2025). MTR-QA:Multi-Type Reasoning Question Answering Dataset [Dataset]. http://doi.org/10.57760/sciencedb.23774
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 17, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Wang Qiang; Jiang Chenglin; Ma Ning; Li Yingjie; Wu Wenshe
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The MTR-QA dataset contains 24312 inference data, including 8740 logical inference data, 9105 semantic inference data, 2647 mathematical inference data, and 3818 comprehensive knowledge inference data, with a size of 34.1MB. It is stored in JSON format, and each JSON object has 6 attributes: instruction, question, answer, target, label, and difficulty, corresponding to user provided instructions, user provided options, correct answers, thought chains, question inference types, and question difficulty levels. Among them, label corresponds to different types of reasoning, including logical reasoning, semantic reasoning, mathematical reasoning, and comprehensive knowledge reasoning; Difficulty is divided into three levels based on language difficulty, problem-solving steps, and complexity of knowledge points: beginner, intermediate, and advanced. The grading method can clearly evaluate the difficulty level.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
VISALAKSHI IYER (2022). Question Answering for Financial data (FinQA) [Dataset]. https://www.kaggle.com/datasets/visalakshiiyer/question-answering-financial-data
Organization logo

Question Answering for Financial data (FinQA)

FinQA: A Dataset of Numerical Reasoning over Financial Data

Explore at:
zip(13416653 bytes)Available download formats
Dataset updated
Mar 29, 2022
Authors
VISALAKSHI IYER
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions about financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks in the general domain, the finance domain includes complex numerical reasoning and an understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. More details are provided here: Paper, Preview

The dataset is stored as JSON files, each entry has the following format: ``` "pre_text": the texts before the table; "post_text": the text after the table; "table": the table; "id": unique example id. composed by the original report name plus example index for this report.

"qa": { "question": the question; "program": the reasoning program; "gold_inds": the gold supporting facts; "exe_ans": the gold execution result; "program_re": the reasoning program in nested format; } ```

This dataset is the first of its kind intending to enable significant, new community research into complex application domains. It was hosted for a competition at CodaLabs on FinQA where if given a financial report containing both text and table, the goal is to answer a question requiring numerical reasoning. The code is publicly available @GitHub/FinQA

Search
Clear search
Close search
Google apps
Main menu