100+ datasets found
  1. F

    Filipino Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Filipino Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/filipino-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.

    Dataset Content:

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  2. F

    Hindi Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Hindi Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/hindi-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Hindi Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Hindi language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Hindi. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Hindi people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Hindi Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in Hindi are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Hindi Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  3. E

    Data from: Fine-tuned models for extractive question answering in the...

    • live.european-language-grid.eu
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Fine-tuned models for extractive question answering in the Slovenian language [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/21436
    Explore at:
    Dataset updated
    Sep 21, 2022
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    6 different fine-tuned Transformer-based models that solve the downstream task of extractive question answering in the Slovenian language. The fine-tuned models included are: bert-base-cased-squad2-SLO, bert-base-multilingual-cased-squad2-SLO, electra-base-squad2-SLO, roberta-base-squad2-SLO, sloberta-squad2-SLO and xlm-roberta-base-squad2-SLO. The models were trained and evaluated using the Slovene translation of the SQuAD2.0 dataset (https://www.clarin.si/repository/xmlui/handle/11356/1756).

    The models achieve these metric values: sloberta-squad2-SLO: EM=67.1, F1=73.56 xlm-roberta-base-squad2-SLO: EM=62.52, F1=69.51 bert-base-multilingual-cased-squad2-SLO: EM=61.37, F1=68.1 roberta-base-squad2-SLO: EM=58.23, F1=64.62 bert-base-cased-squad2-SLO: EM=55.12, F1=60.52 electra-base-squad2-SLO: EM=53.69, F1=60.85

  4. h

    Data from: quora-question-answer-dataset

    • huggingface.co
    Updated Sep 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gregory Bizup (2023). quora-question-answer-dataset [Dataset]. https://huggingface.co/datasets/toughdata/quora-question-answer-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 2, 2023
    Authors
    Gregory Bizup
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    Quora Question Answer Dataset (Quora-QuAD) contains 56,402 question-answer pairs scraped from Quora.

      Usage:
    

    For instructions on fine-tuning a model (Flan-T5) with this dataset, please check out the article: https://www.toughdata.net/blog/post/finetune-flan-t5-question-answer-quora-dataset

  5. P

    Data from: QuAC Dataset

    • paperswithcode.com
    Updated Aug 22, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eunsol Choi; He He; Mohit Iyyer; Mark Yatskar; Wen-tau Yih; Yejin Choi; Percy Liang; Luke Zettlemoyer (2018). QuAC Dataset [Dataset]. https://paperswithcode.com/dataset/quac
    Explore at:
    Dataset updated
    Aug 22, 2018
    Authors
    Eunsol Choi; He He; Mohit Iyyer; Mark Yatskar; Wen-tau Yih; Yejin Choi; Percy Liang; Luke Zettlemoyer
    Description

    Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer pairs in total. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.

  6. g

    Data from: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question...

    • hotpotqa.github.io
    • explagraphs.github.io
    • +1more
    json
    Updated Jun 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carnegie Mellon University, Stanford University, Université de Montréal (2024). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Dataset]. https://hotpotqa.github.io/
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Jun 25, 2024
    Dataset authored and provided by
    Carnegie Mellon University, Stanford University, Université de Montréal
    Description

    HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems built based on Wikipedia.

  7. F

    English Closed Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/english-closed-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The English Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the English language, advancing the field of artificial intelligence.

    Dataset Content:

    This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in English. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native English people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled English Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    The English versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy English Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

  8. P

    Data from: SPIQA: A Dataset for Multimodal Question Answering on Scientific...

    • paperswithcode.com
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shraman Pramanick; Rama Chellappa; Subhashini Venugopalan (2024). SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Dataset [Dataset]. https://paperswithcode.com/dataset/spiqa-a-dataset-for-multimodal-question
    Explore at:
    Dataset updated
    Jul 12, 2024
    Authors
    Shraman Pramanick; Rama Chellappa; Subhashini Venugopalan
    Description

    SPIQA Dataset Card Dataset Details Dataset Name: SPIQA (Scientific Paper Image Question Answering)

    Paper: SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

    Github: SPIQA eval and metrics code repo

    Dataset Summary: SPIQA is a large-scale and challenging QA dataset focused on figures, tables, and text paragraphs from scientific research papers in various computer science domains. The figures cover a wide variety of plots, charts, schematic diagrams, result visualization etc. The dataset is the result of a meticulous curation process, leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures. We employ both automatic and manual curation to ensure the highest level of quality and reliability. SPIQA consists of more than 270K questions divided into training, validation, and three different evaluation splits. The purpose of the dataset is to evaluate the ability of Large Multimodal Models to comprehend complex figures and tables with the textual paragraphs of scientific papers.

    This Data Card describes the structure of the SPIQA dataset, divided into training, validation, and three different evaluation splits. The test-B and test-C splits are filtered from the QASA and QASPER datasets and contain human-written QAs. We collect all scientific papers published at top computer science conferences between 2018 and 2023 from arXiv.

    If you have any comments or questions, reach out to Shraman Pramanick or Subhashini Venugopalan.

    Supported Tasks: - Direct QA with figures and tables - Direct QA with full paper - CoT QA (retrieval of helpful figures, tables; then answering)

    Language: English

    Release Date: SPIQA is released in June 2024.

    Data Splits The statistics of different splits of SPIQA is shown below.

    SplitPapersQuestionsSchematicsPlots & ChartsVisualizationsOther figuresTables
    Train25,459262,52444,00870,04127,2976,450114,728
    Val2002,08536058217355915
    test-A11866615430113195434
    test-B6522814715613317341
    test-C31449341540426661,332

    Dataset Structure The contents of this dataset card are structured as follows:

    bash SPIQA ├── SPIQA_train_val_test-A_extracted_paragraphs.zip ├── Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits ├── SPIQA_train_val_test-A_raw_tex.zip └── The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research. ├── train_val ├── SPIQA_train_val_Images.zip └── Full resolution figures and tables from the papers in SPIQA train, val splits ├── SPIQA_train.json └── SPIQA train metadata ├── SPIQA_val.json └── SPIQA val metadata ├── test-A ├── SPIQA_testA_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA.json └── SPIQA test-A metadata ├── test-B ├── SPIQA_testB_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB.json └── SPIQA test-B metadata ├── test-C ├── SPIQA_testC_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC.json └── SPIQA test-C metadata

    The testA_data_viewer.json file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata.

    Metadata Structure The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are:

    arXiv ID Semantic scholar ID (for test-B) Figures and tables Name of the png file Caption Content type (figure or table) Figure type (schematic, plot, photo (visualization), others)

    QAs Question, answer and rationale Reference figures and tables Textual evidence (for test-B and test-C)

    Abstract and full paper text (for test-B and test-C; full paper for other splits are provided as a zip)

    Dataset Use and Starter Snippets Downloading the Dataset to Local We recommend the users to download the metadata and images to their local machine.

    Download the whole dataset (all splits). bash from huggingface_hub import snapshot_download snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path

    Download specific file. bash from huggingface_hub import hf_hub_download hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path

    Questions and Answers from a Specific Paper in test-A bash import json testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r')) paper_id = '1702.03584v3' print(testA_metadata[paper_id]['qa'])

    Questions and Answers from a Specific Paper in test-B bash import json testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r')) paper_id = '1707.07012' print(testB_metadata[paper_id]['question']) ## Questions print(testB_metadata[paper_id]['composition']) ## Answers

    Questions and Answers from a Specific Paper in test-C bash import json testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r')) paper_id = '1808.08780' print(testC_metadata[paper_id]['question']) ## Questions print(testC_metadata[paper_id]['answer']) ## Answers

    Annotation Overview Questions and answers for the SPIQA train, validation, and test-A sets were machine-generated. Additionally, the SPIQA test-A set was manually filtered and curated. Questions in the SPIQA test-B set are collected from the QASA dataset, while those in the SPIQA test-C set are from the QASPER dataset. Answering the questions in all splits requires holistic understanding of figures and tables with related text from the scientific papers.

    Personal and Sensitive Information We are not aware of any personal or sensitive information in the dataset.

    Licensing Information CC BY 4.0

    Citation Information bibtex @article{pramanick2024spiqa, title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers}, author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, journal={NeurIPS}, year={2024} }

  9. o

    Question Answering Dataset

    • opendatabay.com
    .csv
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Question Answering Dataset [Dataset]. https://www.opendatabay.com/data/dataset/f629f4eb-7708-4285-b55b-6766d9a1f15a
    Explore at:
    .csvAvailable download formats
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    This dataset is curated to support research and development in natural language processing (NLP), particularly in the area of question answering systems. Focused on the domain of Data Science and Analytics, it contains a diverse collection of question-answer pairs designed to reflect real-world inquiries about key concepts, tools, techniques, and trends within the field.

    Each entry includes:

    A natural language question related to data science topics such as machine learning, data wrangling, statistical analysis, data visualization, big data technologies, and analytics methods.

    A corresponding answer, verified for accuracy and clarity, suitable for use in both retrieval-based and generative QA models.

    Optional metadata such as topic category, difficulty level, and source context, where applicable.

    Use Cases:

    Training and evaluating QA models and chatbots focused on technical domains.

    Developing educational tools and intelligent tutoring systems for data science learners.

    Benchmarking NLP systems for domain-specific understanding and reasoning.

    Target Audience:

    AI/ML researchers

    Data science educators and students

    NLP developers working on domain-specific applications

    This dataset aims to bridge the gap between technical knowledge and natural language understanding by providing high-quality QA pairs tailored to one of today’s most in-demand fields.

    Original Data Source: Question Answering Dataset

  10. h

    cncf-question-and-answer-dataset-for-llm-training

    • huggingface.co
    Updated Nov 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kubermatic (2020). cncf-question-and-answer-dataset-for-llm-training [Dataset]. https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2020
    Dataset authored and provided by
    Kubermatic
    Description

    CNCF QA Dataset for LLM Tuning

      Description
    

    This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model. The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training.

  11. F

    Spanish Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Spanish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/spanish-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The Spanish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Spanish language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Spanish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Spanish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled Spanish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in Spanish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Spanish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

  12. f

    Autism Spectrum Disorder and Asperger Syndrome Question Answering Dataset...

    • figshare.com
    bin
    Updated Sep 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Victoria Firsanova (2023). Autism Spectrum Disorder and Asperger Syndrome Question Answering Dataset 1.0 [Dataset]. http://doi.org/10.6084/m9.figshare.13295831.v19
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 13, 2023
    Dataset provided by
    figshare
    Authors
    Victoria Firsanova
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    RUSПоследнее обновление: 13/09/2023Набор данных предназначен для разработки русскоязычных диалоговых систем (чат-ботов, вопросно-ответных систем и т. д.) о расстройствах аутистического спектра. Источник текстов: https://aspergers.ruПроект реализуется победителем конкурса «Практики личной филантропии и альтруизма» Благотворительного фонда Владимира Потанина.75% данных собраны с помощью платформы Toloka.Состав набора данных:1. original.json: оригинальная версия датасета2. multiple.json: версия датасета с несколькими вариантами ответа3. short.json: версия датасета с укороченными ответами4. half_sized.json: версия датасета содержит 50% собранных данных5. no_impossible.json: версия содержит только релевантные вопросы7. age_dataset.tsv: набор данных для определения возраста пользователя (можно использовать для кастомизации моделей)ENGA dataset for question-answering used for building an informational Russian language chatbot for the inclusion of people with autism spectrum disorder and Asperger syndrome in particular, based on data from the following website: https://aspergers.ru.The detailed dataset statistics:ParameterDescriptionThe number of QA pairs4,138The number of irrelevant questions352The average question length53 symbols / 8 wordsThe average answer length141 symbols / 20 wordsThe average reading paragraph length453 symbols / 63 wordsMax question length226 symbols / 32 wordsMax answer length555 symbols / 85 wordsMax reading paragraph length551 symbols / 94 wordsMin question length9 symbols / 2 wordsMin answer length5 symbols / 1 wordsMin reading paragraph length144 symbols / 17 wordsThe dataset has several versions:1. Original version2. Half-sized version (50% of the original data)3. No impossible version (a version without irrelevant/impossible questions)4. Short version (a version with shorterned answers)5. Multiple version (a version with several answers, all the other versions contain only one answer to each question)

  13. j

    Data from: Question Answering (QA) for bridge design

    • jstagedata.jst.go.jp
    • figshare.com
    txt
    Updated May 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Riku Ogata; Junichi Okubo; Junichiro Fujii; Masazumi Amakata (2024). Question Answering (QA) for bridge design [Dataset]. http://doi.org/10.50915/data.jsceiii.25459144.v2
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 16, 2024
    Dataset provided by
    Japan Society of Civil Engineers
    Authors
    Riku Ogata; Junichi Okubo; Junichiro Fujii; Masazumi Amakata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is a small dataset created to evaluate the performance of Question Answering (QA) for large-scale language models in the field of civil engineering. The dataset is targeted at the field of bridge design, and was created using the document (Survey on the common issues of bridge design (2018-2019) (Technical Note of NILIM No.1162)) used in bridge design projects in Japan. The dataset consists of 50 pairs of QAs, where each pair consists of a question asking about the content of a document and an answer extracted from the document associated with the question.

    Each column of the csv file shows the following data.

    Column 1: ID of the QA Column 2: Referenced page (page number of the full pdf) Column 3: Question Column 4: Answer

  14. T

    Data from: quac

    • tensorflow.org
    • huggingface.co
    • +1more
    Updated Dec 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). quac [Dataset]. https://www.tensorflow.org/datasets/catalog/quac
    Explore at:
    Dataset updated
    Dec 20, 2022
    Description

    Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('quac', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  15. Trojan Detection Software Challenge - nlp-question-answering-sep2021-holdout...

    • catalog.data.gov
    • data.nist.gov
    Updated Sep 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institute of Standards and Technology (2023). Trojan Detection Software Challenge - nlp-question-answering-sep2021-holdout [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-round-8-holdout-dataset
    Explore at:
    Dataset updated
    Sep 30, 2023
    Dataset provided by
    National Institute of Standards and Technologyhttp://www.nist.gov/
    Description

    Round 8 Holdout DatasetThis is the training data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform extractive question answering on English text. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 360 QA AI models using a small set of model architectures. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.

  16. m

    Dataset Question Answering for Admission of Higher Education Institution

    • data.mendeley.com
    Updated Sep 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emny Yossy (2023). Dataset Question Answering for Admission of Higher Education Institution [Dataset]. http://doi.org/10.17632/jc4df8srcb.2
    Explore at:
    Dataset updated
    Sep 26, 2023
    Authors
    Emny Yossy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data collection process commenced with web scraping of a selected higher education institution's website, collecting any data that relates to the admission topic of higher education institutions, during the period from July to September 2023. This resulted in a raw dataset primarily cantered around admission-related content. Subsequently, meticulous data cleaning and organization procedures were implemented to refine the dataset. The primary data, in its raw form before annotation into a question-and-answer format, was predominantly in the Indonesian language. Following this, a comprehensive annotation process was conducted to enrich the dataset with specific admission-related information, transforming it into secondary data. Both primary and secondary data predominantly remained in the Indonesian language. To enhance data quality, we added filters to remove or exclude: 1) data not in the Indonesian language, 2) data unrelated to the admission topic, and 3) redundant entries. This meticulous curation has culminated in the creation of a finalized dataset, meticulously prepared and now readily available for research and analysis in the domain of higher education admission.

  17. Piaf — The French-language dataset of Questions-Answers

    • data.europa.eu
    csv, json, plain text
    Updated Jun 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Etalab (2022). Piaf — The French-language dataset of Questions-Answers [Dataset]. https://data.europa.eu/data/datasets/5e83c3ed38f46c1808801fbb?locale=en
    Explore at:
    csv(1014663), plain text(816), json(4744747), json(2834209)Available download formats
    Dataset updated
    Jun 19, 2022
    Dataset authored and provided by
    Etalab
    Area covered
    France, French
    Description

    Piaf, build an open French-language dataset for AI

    The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.

    Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.

    After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.

    PIAF: a shared tool of the IA Lab

    In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.

    Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:

    1. Build a core team to internalise skills and expertise around AI
    2. Supporting AI projects in administrations through calls for expressions of interest
    3. Co-build shared tools that can be used as openly as possible

    The PIAF project is one of the shared tools of the IA Lab.

    Descriptive of the data made available

    The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.

    Thanks to the 500 contributors!

    We deeply thank our contributors who have made this project live on a voluntary basis to this day.

    Links

    Information on the protocol followed, the project news, the annotation platform and the related code are here:

  18. h

    COVID-QA-question-answering-biencoder-data-45_45_10

    • huggingface.co
    Updated Oct 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    minh anh (2023). COVID-QA-question-answering-biencoder-data-45_45_10 [Dataset]. https://huggingface.co/datasets/minh21/COVID-QA-question-answering-biencoder-data-45_45_10
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2023
    Authors
    minh anh
    Description

    Dataset Card for "COVID-QA-question-answering-biencoder-data-45_45_10"

    More Information needed

  19. h

    COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10

    • huggingface.co
    Updated Oct 17, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    minh anh (2023). COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10 [Dataset]. https://huggingface.co/datasets/minh21/COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2023
    Authors
    minh anh
    Description

    Dataset Card for "COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10"

    More Information needed

  20. F

    French Open Ended Question Answer Text Dataset

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). French Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/french-open-ended-question-answer-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    What’s Included

    The French Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the French language, advancing the field of artificial intelligence.

    Dataset Content:

    This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in French. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

    Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native French people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.

    This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

    Question Diversity:

    To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.

    Answer Formats:

    To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.

    Data Format and Annotation Details:

    This fully labeled French Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.

    Quality and Accuracy:

    The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

    Both the question and answers in French are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.

    Continuous Updates and Customization:

    The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

    License:

    The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy French Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FutureBee AI (2022). Filipino Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/filipino-closed-ended-question-answer-text-dataset

Filipino Closed Ended Question Answer Text Dataset

Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by
FutureBeeAI
Description

What’s Included

The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.

Dataset Content:

This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

Question Diversity:

To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

Answer Formats:

To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

Data Format and Annotation Details:

This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

Quality and Accuracy:

The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

Continuous Updates and Customization:

The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

License:

The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

Search
Clear search
Close search
Google apps
Main menu