https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.
Dataset Content:This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Hindi Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Hindi language, advancing the field of artificial intelligence.
Dataset Content:This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Hindi. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Hindi people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Hindi Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Hindi are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Hindi Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
6 different fine-tuned Transformer-based models that solve the downstream task of extractive question answering in the Slovenian language. The fine-tuned models included are: bert-base-cased-squad2-SLO, bert-base-multilingual-cased-squad2-SLO, electra-base-squad2-SLO, roberta-base-squad2-SLO, sloberta-squad2-SLO and xlm-roberta-base-squad2-SLO. The models were trained and evaluated using the Slovene translation of the SQuAD2.0 dataset (https://www.clarin.si/repository/xmlui/handle/11356/1756).
The models achieve these metric values: sloberta-squad2-SLO: EM=67.1, F1=73.56 xlm-roberta-base-squad2-SLO: EM=62.52, F1=69.51 bert-base-multilingual-cased-squad2-SLO: EM=61.37, F1=68.1 roberta-base-squad2-SLO: EM=58.23, F1=64.62 bert-base-cased-squad2-SLO: EM=55.12, F1=60.52 electra-base-squad2-SLO: EM=53.69, F1=60.85
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Quora Question Answer Dataset (Quora-QuAD) contains 56,402 question-answer pairs scraped from Quora.
Usage:
For instructions on fine-tuning a model (Flan-T5) with this dataset, please check out the article: https://www.toughdata.net/blog/post/finetune-flan-t5-question-answer-quora-dataset
Question Answering in Context is a large-scale dataset that consists of around 14K crowdsourced Question Answering dialogs with 98K question-answer pairs in total. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text.
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems built based on Wikipedia.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the English language, advancing the field of artificial intelligence.
Dataset Content:This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in English. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native English people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled English Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The English versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy English Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.
SPIQA Dataset Card Dataset Details Dataset Name: SPIQA (Scientific Paper Image Question Answering)
Paper: SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Github: SPIQA eval and metrics code repo
Dataset Summary: SPIQA is a large-scale and challenging QA dataset focused on figures, tables, and text paragraphs from scientific research papers in various computer science domains. The figures cover a wide variety of plots, charts, schematic diagrams, result visualization etc. The dataset is the result of a meticulous curation process, leveraging the breadth of expertise and ability of multimodal large language models (MLLMs) to understand figures. We employ both automatic and manual curation to ensure the highest level of quality and reliability. SPIQA consists of more than 270K questions divided into training, validation, and three different evaluation splits. The purpose of the dataset is to evaluate the ability of Large Multimodal Models to comprehend complex figures and tables with the textual paragraphs of scientific papers.
This Data Card describes the structure of the SPIQA dataset, divided into training, validation, and three different evaluation splits. The test-B and test-C splits are filtered from the QASA and QASPER datasets and contain human-written QAs. We collect all scientific papers published at top computer science conferences between 2018 and 2023 from arXiv.
If you have any comments or questions, reach out to Shraman Pramanick or Subhashini Venugopalan.
Supported Tasks: - Direct QA with figures and tables - Direct QA with full paper - CoT QA (retrieval of helpful figures, tables; then answering)
Language: English
Release Date: SPIQA is released in June 2024.
Data Splits The statistics of different splits of SPIQA is shown below.
Split | Papers | Questions | Schematics | Plots & Charts | Visualizations | Other figures | Tables |
---|---|---|---|---|---|---|---|
Train | 25,459 | 262,524 | 44,008 | 70,041 | 27,297 | 6,450 | 114,728 |
Val | 200 | 2,085 | 360 | 582 | 173 | 55 | 915 |
test-A | 118 | 666 | 154 | 301 | 131 | 95 | 434 |
test-B | 65 | 228 | 147 | 156 | 133 | 17 | 341 |
test-C | 314 | 493 | 415 | 404 | 26 | 66 | 1,332 |
Dataset Structure The contents of this dataset card are structured as follows:
bash SPIQA ├── SPIQA_train_val_test-A_extracted_paragraphs.zip ├── Extracted textual paragraphs from the papers in SPIQA train, val and test-A splits ├── SPIQA_train_val_test-A_raw_tex.zip └── The raw tex files from the papers in SPIQA train, val and test-A splits. These files are not required to reproduce our results; we open-source them for future research. ├── train_val ├── SPIQA_train_val_Images.zip └── Full resolution figures and tables from the papers in SPIQA train, val splits ├── SPIQA_train.json └── SPIQA train metadata ├── SPIQA_val.json └── SPIQA val metadata ├── test-A ├── SPIQA_testA_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-A split ├── SPIQA_testA.json └── SPIQA test-A metadata ├── test-B ├── SPIQA_testB_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-B split ├── SPIQA_testB.json └── SPIQA test-B metadata ├── test-C ├── SPIQA_testC_Images.zip └── Full resolution figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC_Images_224px.zip └── 224px figures and tables from the papers in SPIQA test-C split ├── SPIQA_testC.json └── SPIQA test-C metadata
The testA_data_viewer.json file is only for viewing a portion of the data on HuggingFace viewer to get a quick sense of the metadata.
Metadata Structure The metadata for every split is provided as dictionary where the keys are arXiv IDs of the papers. The primary contents of each dictionary item are:
arXiv ID Semantic scholar ID (for test-B) Figures and tables Name of the png file Caption Content type (figure or table) Figure type (schematic, plot, photo (visualization), others)
QAs Question, answer and rationale Reference figures and tables Textual evidence (for test-B and test-C)
Abstract and full paper text (for test-B and test-C; full paper for other splits are provided as a zip)
Dataset Use and Starter Snippets Downloading the Dataset to Local We recommend the users to download the metadata and images to their local machine.
Download the whole dataset (all splits). bash from huggingface_hub import snapshot_download snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path
Download specific file. bash from huggingface_hub import hf_hub_download hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory path
Questions and Answers from a Specific Paper in test-A bash import json testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r')) paper_id = '1702.03584v3' print(testA_metadata[paper_id]['qa'])
Questions and Answers from a Specific Paper in test-B bash import json testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r')) paper_id = '1707.07012' print(testB_metadata[paper_id]['question']) ## Questions print(testB_metadata[paper_id]['composition']) ## Answers
Questions and Answers from a Specific Paper in test-C bash import json testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r')) paper_id = '1808.08780' print(testC_metadata[paper_id]['question']) ## Questions print(testC_metadata[paper_id]['answer']) ## Answers
Annotation Overview Questions and answers for the SPIQA train, validation, and test-A sets were machine-generated. Additionally, the SPIQA test-A set was manually filtered and curated. Questions in the SPIQA test-B set are collected from the QASA dataset, while those in the SPIQA test-C set are from the QASPER dataset. Answering the questions in all splits requires holistic understanding of figures and tables with related text from the scientific papers.
Personal and Sensitive Information We are not aware of any personal or sensitive information in the dataset.
Licensing Information CC BY 4.0
Citation Information bibtex @article{pramanick2024spiqa, title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers}, author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini}, journal={NeurIPS}, year={2024} }
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset is curated to support research and development in natural language processing (NLP), particularly in the area of question answering systems. Focused on the domain of Data Science and Analytics, it contains a diverse collection of question-answer pairs designed to reflect real-world inquiries about key concepts, tools, techniques, and trends within the field.
Each entry includes:
A natural language question related to data science topics such as machine learning, data wrangling, statistical analysis, data visualization, big data technologies, and analytics methods.
A corresponding answer, verified for accuracy and clarity, suitable for use in both retrieval-based and generative QA models.
Optional metadata such as topic category, difficulty level, and source context, where applicable.
Use Cases:
Training and evaluating QA models and chatbots focused on technical domains.
Developing educational tools and intelligent tutoring systems for data science learners.
Benchmarking NLP systems for domain-specific understanding and reasoning.
Target Audience:
AI/ML researchers
Data science educators and students
NLP developers working on domain-specific applications
This dataset aims to bridge the gap between technical knowledge and natural language understanding by providing high-quality QA pairs tailored to one of today’s most in-demand fields.
Original Data Source: Question Answering Dataset
CNCF QA Dataset for LLM Tuning
Description
This dataset, named cncf-qa-dataset-for-llm-tuning, is designed for fine-tuning large language models (LLMs) and is formatted in a question-answer (QA) style. The data is sourced from PDF and markdown (MD) files extracted from various project repositories within the CNCF (Cloud Native Computing Foundation) landscape. These files were processed and converted into a QA format to be fed into the LLM model. The dataset includes the… See the full description on the dataset page: https://huggingface.co/datasets/Kubermatic/cncf-question-and-answer-dataset-for-llm-training.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Spanish Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Spanish language, advancing the field of artificial intelligence.
Dataset Content:This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Spanish. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Spanish people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Spanish Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Spanish are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Spanish Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
RUSПоследнее обновление: 13/09/2023Набор данных предназначен для разработки русскоязычных диалоговых систем (чат-ботов, вопросно-ответных систем и т. д.) о расстройствах аутистического спектра. Источник текстов: https://aspergers.ruПроект реализуется победителем конкурса «Практики личной филантропии и альтруизма» Благотворительного фонда Владимира Потанина.75% данных собраны с помощью платформы Toloka.Состав набора данных:1. original.json: оригинальная версия датасета2. multiple.json: версия датасета с несколькими вариантами ответа3. short.json: версия датасета с укороченными ответами4. half_sized.json: версия датасета содержит 50% собранных данных5. no_impossible.json: версия содержит только релевантные вопросы7. age_dataset.tsv: набор данных для определения возраста пользователя (можно использовать для кастомизации моделей)ENGA dataset for question-answering used for building an informational Russian language chatbot for the inclusion of people with autism spectrum disorder and Asperger syndrome in particular, based on data from the following website: https://aspergers.ru.The detailed dataset statistics:ParameterDescriptionThe number of QA pairs4,138The number of irrelevant questions352The average question length53 symbols / 8 wordsThe average answer length141 symbols / 20 wordsThe average reading paragraph length453 symbols / 63 wordsMax question length226 symbols / 32 wordsMax answer length555 symbols / 85 wordsMax reading paragraph length551 symbols / 94 wordsMin question length9 symbols / 2 wordsMin answer length5 symbols / 1 wordsMin reading paragraph length144 symbols / 17 wordsThe dataset has several versions:1. Original version2. Half-sized version (50% of the original data)3. No impossible version (a version without irrelevant/impossible questions)4. Short version (a version with shorterned answers)5. Multiple version (a version with several answers, all the other versions contain only one answer to each question)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a small dataset created to evaluate the performance of Question Answering (QA) for large-scale language models in the field of civil engineering. The dataset is targeted at the field of bridge design, and was created using the document (Survey on the common issues of bridge design (2018-2019) (Technical Note of NILIM No.1162)) used in bridge design projects in Japan. The dataset consists of 50 pairs of QAs, where each pair consists of a question asking about the content of a document and an answer extracted from the document associated with the question.
Each column of the csv file shows the following data.
Column 1: ID of the QA Column 2: Referenced page (page number of the full pdf) Column 3: Question Column 4: Answer
Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('quac', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Round 8 Holdout DatasetThis is the training data used to construct and evaluate trojan detection software solutions. This data, generated at NIST, consists of natural language processing (NLP) AIs trained to perform extractive question answering on English text. A known percentage of these trained AI models have been poisoned with a known trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers. This dataset consists of 360 QA AI models using a small set of model architectures. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the input when the trigger is present.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data collection process commenced with web scraping of a selected higher education institution's website, collecting any data that relates to the admission topic of higher education institutions, during the period from July to September 2023. This resulted in a raw dataset primarily cantered around admission-related content. Subsequently, meticulous data cleaning and organization procedures were implemented to refine the dataset. The primary data, in its raw form before annotation into a question-and-answer format, was predominantly in the Indonesian language. Following this, a comprehensive annotation process was conducted to enrich the dataset with specific admission-related information, transforming it into secondary data. Both primary and secondary data predominantly remained in the Indonesian language. To enhance data quality, we added filters to remove or exclude: 1) data not in the Indonesian language, 2) data unrelated to the admission topic, and 3) redundant entries. This meticulous curation has culminated in the creation of a finalized dataset, meticulously prepared and now readily available for research and analysis in the domain of higher education admission.
The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.
Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.
After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.
In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.
Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:
The PIAF project is one of the shared tools of the IA Lab.
The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.
We deeply thank our contributors who have made this project live on a voluntary basis to this day.
Information on the protocol followed, the project news, the annotation platform and the related code are here:
Dataset Card for "COVID-QA-question-answering-biencoder-data-45_45_10"
More Information needed
Dataset Card for "COVID-QA-Chunk-64-question-answering-biencoder-data-65_25_10"
More Information needed
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The French Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the French language, advancing the field of artificial intelligence.
Dataset Content:This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in French. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native French people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled French Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in French are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy French Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The Filipino Closed-Ended Question Answering Dataset is a meticulously curated collection of 5000 comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and question-answering models in the Filipino language, advancing the field of artificial intelligence.
Dataset Content:This closed-ended QA dataset comprises a diverse set of context paragraphs and questions paired with corresponding answers in Filipino. There is a context paragraph given for each question to get the answer from. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Filipino people, and references were taken from diverse sources like books, news articles, websites, web forums, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. The QA dataset also contains questions with constraints, which makes it even more useful for LLM training.
Answer Formats:To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraphs types of answers. The answers contain text strings, numerical values, date and time formats as well. Such diversity strengthens the language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:This fully labeled Filipino Closed-Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as a unique id, context paragraph, context reference link, question, question type, question complexity, question category, domain, prompt type, answer, answer type, and rich text presence.
Quality and Accuracy:The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
The Filipino versions is grammatically accurate without any spelling or grammatical errors. No toxic or harmful content is used while building this dataset.
Continuous Updates and Customization:The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Filipino Closed-Ended Question Answer Dataset to enhance the language understanding capabilities of their generative AI models, improve response generation, and explore new approaches to NLP question-answering tasks.