100+ datasets found

F
Filipino Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Filipino Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/filipino-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
F
Hindi Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Hindi Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/hindi-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Hindi Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Hindi language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Hindi. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Hindi people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Hindi Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Hindi are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Hindi Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
E
Data from: Fine-tuned models for extractive question answering in the...
live.european-language-grid.eu
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Fine-tuned models for extractive question answering in the Slovenian language [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/21436
Explore at:
Dataset updated
Sep 21, 2022
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
6 different fine-tuned Transformer-based models that solve the downstream task of extractive question answering in the Slovenian language. The fine-tuned models included are: bert-base-cased-squad2-SLO, bert-base-multilingual-cased-squad2-SLO, electra-base-squad2-SLO, roberta-base-squad2-SLO, sloberta-squad2-SLO and xlm-roberta-base-squad2-SLO. The models were trained and evaluated using the Slovene translation of the SQuAD2.0 dataset (https://www.clarin.si/repository/xmlui/handle/11356/1756).

The models achieve these metric values: sloberta-squad2-SLO: EM=67.1, F1=73.56 xlm-roberta-base-squad2-SLO: EM=62.52, F1=69.51 bert-base-multilingual-cased-squad2-SLO: EM=61.37, F1=68.1 roberta-base-squad2-SLO: EM=58.23, F1=64.62 bert-base-cased-squad2-SLO: EM=55.12, F1=60.52 electra-base-squad2-SLO: EM=53.69, F1=60.85
h
Data from: quora-question-answer-dataset
huggingface.co
Updated Sep 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gregory Bizup (2023). quora-question-answer-dataset [Dataset]. https://huggingface.co/datasets/toughdata/quora-question-answer-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 2, 2023
Authors
Gregory Bizup
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
Quora Question Answer Dataset (Quora-QuAD) contains 56,402 question-answer pairs scraped from Quora.

Usage:

For instructions on fine-tuning a model (Flan-T5) with this dataset, please check out the article: https://www.toughdata.net/blog/post/finetune-flan-t5-question-answer-quora-dataset
P
Data from: QuAC Dataset
paperswithcode.com
Updated Aug 22, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eunsol Choi; He He; Mohit Iyyer; Mark Yatskar; Wen-tau Yih; Yejin Choi; Percy Liang; Luke Zettlemoyer (2018). QuAC Dataset [Dataset]. https://paperswithcode.com/dataset/quac
Explore at:
Dataset updated
Aug 22, 2018
Authors
Eunsol Choi; He He; Mohit Iyyer; Mark Yatskar; Wen-tau Yih; Yejin Choi; Percy Liang; Luke Zettlemoyer
Description
Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer pairs in total. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.
g
Data from: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question...
hotpotqa.github.io
explagraphs.github.io
+1more
json
Updated Jun 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carnegie Mellon University, Stanford University, Université de Montréal (2024). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering [Dataset]. https://hotpotqa.github.io/
Explore at:
jsonAvailable download formats
Dataset updated
Jun 25, 2024
Dataset authored and provided by
Carnegie Mellon University, Stanford University, Université de Montréal
Description
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems built based on Wikipedia.
F
English Closed Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/english-closed-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The English Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the English language, advancing the field of artificial intelligence.
Dataset Content:
This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in English. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native English people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled English Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The English versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy English Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

Data from: SPIQA: A Dataset for Multimodal Question Answering on Scientific...

paperswithcode.com

Updated Jul 12, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Shraman Pramanick; Rama Chellappa; Subhashini Venugopalan (2024). SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers Dataset [Dataset]. https://paperswithcode.com/dataset/spiqa-a-dataset-for-multimodal-question

Explore at:

Dataset updated

Jul 12, 2024

Authors

Shraman Pramanick; Rama Chellappa; Subhashini Venugopalan

Description

SPIQA Dataset Card Dataset Details Dataset Name: SPIQA (Scientific Paper Image Question Answering)

Paper: SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Github: SPIQA eval and metrics code repo

Dataset Summary: SPIQA is a large-scale and challenging QA dataset focused on figures, tables, and text paragraphs from scientific research papers in various computer science domains. The figures cover a wide variety of plots, charts, schematic diagrams, result visualization etc. The dataset is the result of a meticulous curation process, leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures. We employ both automatic and manual curation to ensure the highest level of quality and reliability. SPIQA consists of more than 270K questions divided into training, validation, and three different evaluation splits. The purpose of the dataset is to evaluate the ability of Large Multimodal Models to comprehend complex figures and tables with the textual paragraphs of scientific papers.

This Data Card describes the structure of the SPIQA dataset, divided into training, validation, and three different evaluation splits. The test-B and test-C splits are filtered from the QASA and QASPER datasets and contain human-written QAs. We collect all scientific papers published at top computer science conferences between 2018 and 2023 from arXiv.

If you have any comments or questions, reach out to Shraman Pramanick or Subhashini Venugopalan.

Supported Tasks: - Direct QA with figures and tables - Direct QA with full paper - CoT QA (retrieval of helpful figures, tables; then answering)

Language: English

Release Date: SPIQA is released in June 2024.

Data Splits The statistics of different splits of SPIQA is shown below.

Split	Papers	Questions	Schematics	Plots & Charts	Visualizations	Other figures	Tables
Train	25,459	262,524	44,008	70,041	27,297	6,450	114,728
Val	200	2,085	360	582	173	55	915
test-A	118	666	154	301	131	95	434
test-B	65	228	147	156	133	17	341
test-C	314	493	415	404	26	66	1,332

Dataset Structure The contents of this dataset card are structured as follows:

bash SPIQA ├── SPIQA_train_val_test-A_extracted_paragraphs.zip ├── Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits ├── SPIQA_train_val_test-A_raw_tex.zip └── The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research. ├── train_val ├── SPIQA_train_val_Images.zip └── Full resolution figures and tables from the papers in SPIQA train, val splits ├── SPIQA_train.json └── SPIQA train metadata ├── SPIQA_val.json └── SPIQA val metadata ├── test-A ├── SPIQA_testA_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA.json └── SPIQA test-A metadata ├── test-B ├── SPIQA_testB_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB.json └── SPIQA test-B metadata ├── test-C ├── SPIQA_testC_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC.json └── SPIQA test-C metadata

The testA_data_viewer.json file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata.

Metadata Structure The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are:

arXiv ID Semantic scholar ID (for test-B) Figures and tables Name of the png file Caption Content type (figure or table) Figure type (schematic, plot, photo (visualization), others)

QAs Question, answer and rationale Reference figures and tables Textual evidence (for test-B and test-C)

Abstract and full paper text (for test-B and test-C; full paper for other splits are provided as a zip)

Dataset Use and Starter Snippets Downloading the Dataset to Local We recommend the users to download the metadata and images to their local machine.

Download the whole dataset (all splits). bash from huggingface_hub import snapshot_download snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path

Download specific file. bash from huggingface_hub import hf_hub_download hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path

Questions and Answers from a Specific Paper in test-A bash import json testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r')) paper_id = '1702.03584v3' print(testA_metadata[paper_id]['qa'])

Questions and Answers from a Specific Paper in test-B bash import json testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r')) paper_id = '1707.07012' print(testB_metadata[paper_id]['question']) ## Questions print(testB_metadata[paper_id]['composition']) ## Answers

Questions and Answers from a Specific Paper in test-C bash import json testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r')) paper_id = '1808.08780' print(testC_metadata[paper_id]['question']) ## Questions print(testC_metadata[paper_id]['answer']) ## Answers

Annotation Overview Questions and answers for the SPIQA train, validation, and test-A sets were machine-generated. Additionally, the SPIQA test-A set was manually filtered and curated. Questions in the SPIQA test-B set are collected from the QASA dataset, while those in the SPIQA test-C set are from the QASPER dataset. Answering the questions in all splits requires holistic understanding of figures and tables with related text from the scientific papers.

Personal and Sensitive Information We are not aware of any personal or sensitive information in the dataset.

Licensing Information CC BY 4.0

Citation Information bibtex @article{pramanick2024spiqa, title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers}, author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, journal={NeurIPS}, year={2024} }

o
Question Answering Dataset
opendatabay.com
.csv
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Question Answering Dataset [Dataset]. https://www.opendatabay.com/data/dataset/f629f4eb-7708-4285-b55b-6766d9a1f15a
Explore at:
.csvAvailable download formats
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
This dataset is curated to support research and development in natural language processing (NLP), particularly in the area of question answering systems. Focused on the domain of Data Science and Analytics, it contains a diverse collection of question-answer pairs designed to reflect real-world inquiries about key concepts, tools, techniques, and trends within the field.

Each entry includes:

A natural language question related to data science topics such as machine learning, data wrangling, statistical analysis, data visualization, big data technologies, and analytics methods.

A corresponding answer, verified for accuracy and clarity, suitable for use in both retrieval-based and generative QA models.

Optional metadata such as topic category, difficulty level, and source context, where applicable.

Use Cases:

Training and evaluating QA models and chatbots focused on technical domains.

Developing educational tools and intelligent tutoring systems for data science learners.

Benchmarking NLP systems for domain-specific understanding and reasoning.

Target Audience:

AI/ML researchers

Data science educators and students

NLP developers working on domain-specific applications

This dataset aims to bridge the gap between technical knowledge and natural language understanding by providing high-quality QA pairs tailored to one of today’s most in-demand fields.

Original Data Source: Question Answering Dataset
h
cncf-question-and-answer-dataset-for-llm-training
huggingface.co
Updated Nov 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kubermatic (2020). cncf-question-and-answer-dataset-for-llm-training [Dataset]. https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2020
Dataset authored and provided by
Kubermatic
Description
CNCF QA Dataset for LLM Tuning

Description

This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model. The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training.
F
Spanish Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Spanish Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/spanish-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Spanish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Spanish language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Spanish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Spanish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Spanish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Spanish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Spanish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
f
Autism Spectrum Disorder and Asperger Syndrome Question Answering Dataset...
figshare.com
bin
Updated Sep 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Victoria Firsanova (2023). Autism Spectrum Disorder and Asperger Syndrome Question Answering Dataset 1.0 [Dataset]. http://doi.org/10.6084/m9.figshare.13295831.v19
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13295831.v19
Dataset updated
Sep 13, 2023
Dataset provided by
figshare
Authors
Victoria Firsanova
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
RUSПоследнее обновление: 13/09/2023Набор данных предназначен для разработки русскоязычных диалоговых систем (чат-ботов, вопросно-ответных систем и т. д.) о расстройствах аутистического спектра. Источник текстов: https://aspergers.ruПроект реализуется победителем конкурса «Практики личной филантропии и альтруизма» Благотворительного фонда Владимира Потанина.75% данных собраны с помощью платформы Toloka.Состав набора данных:1. original.json: оригинальная версия датасета2. multiple.json: версия датасета с несколькими вариантами ответа3. short.json: версия датасета с укороченными ответами4. half_sized.json: версия датасета содержит 50% собранных данных5. no_impossible.json: версия содержит только релевантные вопросы7. age_dataset.tsv: набор данных для определения возраста пользователя (можно использовать для кастомизации моделей)ENGA dataset for question-answering used for building an informational Russian language chatbot for the inclusion of people with autism spectrum disorder and Asperger syndrome in particular, based on data from the following website: https://aspergers.ru.The detailed dataset statistics:ParameterDescriptionThe number of QA pairs4,138The number of irrelevant questions352The average question length53 symbols / 8 wordsThe average answer length141 symbols / 20 wordsThe average reading paragraph length453 symbols / 63 wordsMax question length226 symbols / 32 wordsMax answer length555 symbols / 85 wordsMax reading paragraph length551 symbols / 94 wordsMin question length9 symbols / 2 wordsMin answer length5 symbols / 1 wordsMin reading paragraph length144 symbols / 17 wordsThe dataset has several versions:1. Original version2. Half-sized version (50% of the original data)3. No impossible version (a version without irrelevant/impossible questions)4. Short version (a version with shorterned answers)5. Multiple version (a version with several answers, all the other versions contain only one answer to each question)
j
Data from: Question Answering (QA) for bridge design
jstagedata.jst.go.jp
figshare.com
txt
Updated May 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Riku Ogata; Junichi Okubo; Junichiro Fujii; Masazumi Amakata (2024). Question Answering (QA) for bridge design [Dataset]. http://doi.org/10.50915/data.jsceiii.25459144.v2
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.50915/data.jsceiii.25459144.v2
Dataset updated
May 16, 2024
Dataset provided by
Japan Society of Civil Engineers
Authors
Riku Ogata; Junichi Okubo; Junichiro Fujii; Masazumi Amakata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a small dataset created to evaluate the performance of Question Answering (QA) for large-scale language models in the field of civil engineering. The dataset is targeted at the field of bridge design, and was created using the document (Survey on the common issues of bridge design (2018-2019) (Technical Note of NILIM No.1162)) used in bridge design projects in Japan. The dataset consists of 50 pairs of QAs, where each pair consists of a question asking about the content of a document and an answer extracted from the document associated with the question.

Each column of the csv file shows the following data.

Column 1: ID of the QA Column 2: Referenced page (page number of the full pdf) Column 3: Question Column 4: Answer
T
Data from: quac
tensorflow.org
huggingface.co
+1more
Updated Dec 20, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). quac [Dataset]. https://www.tensorflow.org/datasets/catalog/quac
Explore at:
Dataset updated
Dec 20, 2022
Description
Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('quac', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Trojan Detection Software Challenge - nlp-question-answering-sep2021-holdout...
catalog.data.gov
data.nist.gov
Updated Sep 30, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Standards and Technology (2023). Trojan Detection Software Challenge - nlp-question-answering-sep2021-holdout [Dataset]. https://catalog.data.gov/dataset/trojan-detection-software-challenge-round-8-holdout-dataset
Explore at:
Dataset updated
Sep 30, 2023
Dataset provided by
National Institute of Standards and Technologyhttp://www.nist.gov/
Description
Round 8 Holdout DatasetThis is the training data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform extractive question answering on English text. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 360 QA AI models using a small set of model architectures. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.
m
Dataset Question Answering for Admission of Higher Education Institution
data.mendeley.com
Updated Sep 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emny Yossy (2023). Dataset Question Answering for Admission of Higher Education Institution [Dataset]. http://doi.org/10.17632/jc4df8srcb.2
Explore at:
Unique identifier
https://doi.org/10.17632/jc4df8srcb.2
Dataset updated
Sep 26, 2023
Authors
Emny Yossy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data collection process commenced with web scraping of a selected higher education institution's website, collecting any data that relates to the admission topic of higher education institutions, during the period from July to September 2023. This resulted in a raw dataset primarily cantered around admission-related content. Subsequently, meticulous data cleaning and organization procedures were implemented to refine the dataset. The primary data, in its raw form before annotation into a question-and-answer format, was predominantly in the Indonesian language. Following this, a comprehensive annotation process was conducted to enrich the dataset with specific admission-related information, transforming it into secondary data. Both primary and secondary data predominantly remained in the Indonesian language. To enhance data quality, we added filters to remove or exclude: 1) data not in the Indonesian language, 2) data unrelated to the admission topic, and 3) redundant entries. This meticulous curation has culminated in the creation of a finalized dataset, meticulously prepared and now readily available for research and analysis in the domain of higher education admission.
Piaf — The French-language dataset of Questions-Answers
data.europa.eu
csv, json, plain text
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Etalab (2022). Piaf — The French-language dataset of Questions-Answers [Dataset]. https://data.europa.eu/data/datasets/5e83c3ed38f46c1808801fbb?locale=en
Explore at:
csv(1014663), plain text(816), json(4744747), json(2834209)Available download formats
Dataset updated
Jun 19, 2022
Dataset authored and provided by
Etalab
Area covered
France, French
Description
Piaf, build an open French-language dataset for AI

The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.

Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.

After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.

PIAF: a shared tool of the IA Lab

In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.

Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:

Build a core team to internalise skills and expertise around AI

Supporting AI projects in administrations through calls for expressions of interest

Co-build shared tools that can be used as openly as possible

The PIAF project is one of the shared tools of the IA Lab.

Descriptive of the data made available

The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.

Thanks to the 500 contributors!

We deeply thank our contributors who have made this project live on a voluntary basis to this day.

Links

Information on the protocol followed, the project news, the annotation platform and the related code are here:

https://piaf.etalab.studio/

https://piaf.etalab.studio/actualites.html

https://github.com/etalab/piaf

https://github.com/etalab-ia/piaf-code
h
COVID-QA-question-answering-biencoder-data-45_45_10
huggingface.co
Updated Oct 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
minh anh (2023). COVID-QA-question-answering-biencoder-data-45_45_10 [Dataset]. https://huggingface.co/datasets/minh21/COVID-QA-question-answering-biencoder-data-45_45_10
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2023
Authors
minh anh
Description
Dataset Card for "COVID-QA-question-answering-biencoder-data-45_45_10"

More Information needed
h
COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10
huggingface.co
Updated Oct 17, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
minh anh (2023). COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10 [Dataset]. https://huggingface.co/datasets/minh21/COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2023
Authors
minh anh
Description
Dataset Card for "COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10"

More Information needed
F
French Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). French Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/french-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The French Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the French language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in French. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native French people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled French Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in French are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy French Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.

Facebook

Twitter

Click to copy link

Link copied

Cite

FutureBee AI (2022). Filipino Closed Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/filipino-closed-ended-question-answer-text-dataset

Filipino Closed Ended Question Answer Text Dataset

Explore at:

wavAvailable download formats

Dataset updated

Aug 1, 2022

Dataset provided by

FutureBeeAI

Authors

FutureBee AI

License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Dataset funded by

FutureBeeAI

Description

What’s Included

The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.

Dataset Content:

This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.

Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.

This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.

Question Diversity:

To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.

Answer Formats:

To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.

Data Format and Annotation Details:

This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.

Quality and Accuracy:

The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.

The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.

Continuous Updates and Customization:

The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.

License:

The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.

Clear search

Close search

Google apps

Main menu

Filipino Closed Ended Question Answer Text Dataset

What’s Included

Hindi Open Ended Question Answer Text Dataset

What’s Included

Data from: Fine-tuned models for extractive question answering in the...

Data from: quora-question-answer-dataset

Data from: QuAC Dataset

Data from: HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question...

English Closed Ended Question Answer Text Dataset

What’s Included

Data from: SPIQA: A Dataset for Multimodal Question Answering on Scientific...

Question Answering Dataset

cncf-question-and-answer-dataset-for-llm-training

Spanish Open Ended Question Answer Text Dataset

What’s Included

Autism Spectrum Disorder and Asperger Syndrome Question Answering Dataset...

Data from: Question Answering (QA) for bridge design

Data from: quac

Trojan Detection Software Challenge - nlp-question-answering-sep2021-holdout...

Dataset Question Answering for Admission of Higher Education Institution

Piaf — The French-language dataset of Questions-Answers

Piaf, build an open French-language dataset for AI

PIAF: a shared tool of the IA Lab

Descriptive of the data made available

Thanks to the 500 contributors!

Links

COVID-QA-question-answering-biencoder-data-45_45_10

COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10

French Open Ended Question Answer Text Dataset

What’s Included

Filipino Closed Ended Question Answer Text Dataset

What’s Included