50 datasets found

P
SQuAD Dataset
paperswithcode.com
Updated Oct 5, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
François Bienvenu; Mike Steel (2022). SQuAD Dataset [Dataset]. https://paperswithcode.com/dataset/squad
Explore at:
Dataset updated
Oct 5, 2022
Authors
François Bienvenu; Mike Steel
Description
The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.
h
squad
huggingface.co
tensorflow.org
+1more
Updated Jun 12, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranav R (2020). squad [Dataset]. https://huggingface.co/datasets/rajpurkar/squad
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2020
Authors
Pranav R
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for SQuAD

Dataset Summary

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.

Supported Tasks and Leaderboards

Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.
h
squad-sample
huggingface.co
Updated May 15, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (2015). squad-sample [Dataset]. https://huggingface.co/datasets/autoevaluate/squad-sample
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 15, 2015
Dataset authored and provided by
Evaluation on the Hub
Description
Dataset Card for Dataset Name

Dataset Summary

This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/autoevaluate/squad-sample.
h
squad_v2
huggingface.co
Updated Jun 15, 2005
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pranav R (2005). squad_v2 [Dataset]. https://huggingface.co/datasets/rajpurkar/squad_v2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2005
Authors
Pranav R
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for SQuAD 2.0

Dataset Summary

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.
P
MedQuAD Dataset
paperswithcode.com
Updated Feb 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asma Ben Abacha; Dina Demner-Fushman (2024). MedQuAD Dataset [Dataset]. https://paperswithcode.com/dataset/medquad
Explore at:
Dataset updated
Feb 16, 2024
Authors
Asma Ben Abacha; Dina Demner-Fushman
Description
MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
Piaf — The French-language dataset of Questions-Answers
data.europa.eu
csv, json, plain text
Updated Jun 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Etalab (2022). Piaf — The French-language dataset of Questions-Answers [Dataset]. https://data.europa.eu/data/datasets/5e83c3ed38f46c1808801fbb?locale=en
Explore at:
csv(1014663), plain text(816), json(4744747), json(2834209)Available download formats
Dataset updated
Jun 19, 2022
Dataset authored and provided by
Etalab
Area covered
France, French
Description
Piaf, build an open French-language dataset for AI

The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.

Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.

After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.

PIAF: a shared tool of the IA Lab

In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.

Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:

Build a core team to internalise skills and expertise around AI

Supporting AI projects in administrations through calls for expressions of interest

Co-build shared tools that can be used as openly as possible

The PIAF project is one of the shared tools of the IA Lab.

Descriptive of the data made available

The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.

Thanks to the 500 contributors!

We deeply thank our contributors who have made this project live on a voluntary basis to this day.

Links

Information on the protocol followed, the project news, the annotation platform and the related code are here:

https://piaf.etalab.studio/

https://piaf.etalab.studio/actualites.html

https://github.com/etalab/piaf

https://github.com/etalab-ia/piaf-code
P
TriviaQA Dataset
paperswithcode.com
opendatalab.com
+2more
Updated Mar 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mandar Joshi; Eunsol Choi; Daniel S. Weld; Luke Zettlemoyer (2024). TriviaQA Dataset [Dataset]. https://paperswithcode.com/dataset/triviaqa
Explore at:
Dataset updated
Mar 19, 2019
Authors
Mandar Joshi; Eunsol Choi; Daniel S. Weld; Luke Zettlemoyer
Description
TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.
h
multilingual-squad
huggingface.co
Updated Feb 15, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kyle (2015). multilingual-squad [Dataset]. https://huggingface.co/datasets/kyle-obrien/multilingual-squad
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 15, 2015
Authors
Kyle
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card

Dataset Summary

This dataset contains multilingual, parallel SQuAD dataset examples across EN, DE, ES, and IT. To construct the dataset, identifiers were aligned across the following SQuAD-related datasets:

EN, DE, ES: XQuAD (Cross-lingual Question Answering Dataset) IT: SQuAD-it

See citation information below.

Citation Information

XQuAD: @article{Artetxe:etal:2019, author = {Mikel Artetxe and Sebastian Ruder and Dani Yogatama}… See the full description on the dataset page: https://huggingface.co/datasets/kyle-obrien/multilingual-squad.
P
QNLI Dataset
paperswithcode.com
Updated May 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Wang; Amanpreet Singh; Julian Michael; Felix Hill; Omer Levy; Samuel R. Bowman (2021). QNLI Dataset [Dataset]. https://paperswithcode.com/dataset/qnli
Explore at:
Dataset updated
May 19, 2021
Authors
Alex Wang; Amanpreet Singh; Julian Michael; Felix Hill; Omer Levy; Samuel R. Bowman
Description
The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 (SQuAD). SQuAD v1.1 consists of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The dataset was converted into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue. The QNLI dataset is part of GLUE benchmark.
C
Question-and-answer dataset Rijksportaal Personnel
ckan.mobidatalab.eu
Updated Jul 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OverheidNl (2023). Question-and-answer dataset Rijksportaal Personnel [Dataset]. https://ckan.mobidatalab.eu/dataset/vraag-en-antwoord-dataset-rijksportaal-personeel
Explore at:
http://publications.europa.eu/resource/authority/file-type/json, http://publications.europa.eu/resource/authority/file-type/pdfAvailable download formats
Dataset updated
Jul 13, 2023
Dataset provided by
OverheidNl
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Questions, answers and documents are stored in the dataset. Every question has an answer and the answer comes from a page of Rijksportaal Personnel (intranet central government). With this dataset a question-and-answer model can be trained. The computer thus learns to answer questions in the context of P-Direkt. A total of 322 questions were used that were once asked by e-mail to the contact center of P-Direkt. The questions are very general and never ask about personal circumstances. The aim of the dataset was to test whether question-and-answer models could possibly be used in a P-Direkt environment. The structure of the dataset corresponds to the Squad 2.0 dataset. ### Example: #### Question: Is it true that my SCV hours of 2020 expire if I don't take them? #### Answer: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire #### Source*: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire. You cannot have your IKB savings leave paid out. Payment is only made in the event of termination of employment or death. You can save up to 1800 hours. Do you work part-time or more than an average of 36 hours per week? In that case, the maximum number of hours to be saved is calculated proportionally and rounded down to whole hours. Any remaining holiday hours from 2015 and extra-statutory holiday hours that you had left over from 2016 up to and including 2019 will be converted into IKB hours on 1 January 2020 and these have been added to your IKB savings leave. * Please note, source is a snapshot of National Portal Personnel from April 2021. Go to National Portal Personnel on the intranet for up-to-date information about personnel matters.
h
squad
huggingface.co
Updated Apr 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2024). squad [Dataset]. https://huggingface.co/datasets/sentence-transformers/squad
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2024
Dataset authored and provided by
Sentence Transformers
Description
Dataset Card for SQuAD

This dataset is a collection of question-answer pairs from the SQuAD dataset. See SQuAD for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

Dataset Subsets pair subset

Columns: "question", "answer" Column types: str, str Examples:{ 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answer': 'Architecturally, the school has a Catholic… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/squad.
T
mrqa
tensorflow.org
Updated Jan 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). mrqa [Dataset]. https://www.tensorflow.org/datasets/catalog/mrqa
Explore at:
Dataset updated
Jan 4, 2023
Description
The MRQA 2019 Shared Task focuses on generalization in question answering. An effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples — a significantly harder challenge.

MRQA adapts and unifies multiple distinct question answering datasets (carefully selected subsets of existing datasets) into the same format (SQuAD format). Among them, six datasets were made available for training, and six datasets were made available for testing. Small portions of the training datasets were held-out as in-domain data that may be used for development. The testing datasets only contain out-of-domain data. This benchmark is released as part of the MRQA 2019 Shared Task.

More information can be found at: https://mrqa.github.io/2019/shared.html.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('mrqa', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
squad-it
huggingface.co
Updated Nov 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Landro (2021). squad-it [Dataset]. https://huggingface.co/datasets/z-uo/squad-it
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2021
Authors
Nicola Landro
Description
Squad-it

This dataset is an adapted version of that squad-it to train on HuggingFace models. It contains:

train samples: 87599 test samples : 10570

This dataset is for question answering and his format is the following: [ { "answers": [ { "answer_start": [1], "text": ["Questo è un testo"] }, ], "context": "Questo è un testo relativo al contesto.", "id": "1", "question": "Questo è un testo?", "title": "train test" } ]

It can… See the full description on the dataset page: https://huggingface.co/datasets/z-uo/squad-it.
p
Rescue Squads in Ohio, United States - 27 Available (Free Sample)
poidata.io
csv
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Rescue Squads in Ohio, United States - 27 Available (Free Sample) [Dataset]. https://www.poidata.io/report/rescue-squad/united-states/ohio
Explore at:
csvAvailable download formats
Dataset updated
Jun 4, 2025
Dataset provided by
Poidata.io
Area covered
Ohio, United States
Description
This dataset provides information on 27 in Ohio, United States as of June, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.
P
MLQA Dataset
paperswithcode.com
opendatalab.com
+2more
Updated Jun 29, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Patrick Lewis; Barlas Oğuz; Ruty Rinott; Sebastian Riedel; Holger Schwenk (2021). MLQA Dataset [Dataset]. https://paperswithcode.com/dataset/mlqa
Explore at:
Dataset updated
Jun 29, 2021
Authors
Patrick Lewis; Barlas Oğuz; Ruty Rinott; Sebastian Riedel; Holger Schwenk
Description
MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.
h
internal-datasets
huggingface.co
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Rivaldo Marbun (2023). internal-datasets [Dataset]. https://huggingface.co/datasets/Marbyun/internal-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2023
Authors
Ivan Rivaldo Marbun
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.

In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.

For full details on how the dataset was created, kindly refer to the paper.
p
Rescue Squads in California, United States - 27 Available (Free Sample)
poidata.io
csv
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Rescue Squads in California, United States - 27 Available (Free Sample) [Dataset]. https://www.poidata.io/report/rescue-squad/united-states/california
Explore at:
csvAvailable download formats
Dataset updated
May 8, 2025
Dataset provided by
Poidata.io
Area covered
United States, California
Description
This dataset provides information on 27 in California, United States as of May, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.
p
Rescue Squads in Australia - 57 Available (Free Sample)
poidata.io
csv
Updated Jun 9, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Poidata.io (2025). Rescue Squads in Australia - 57 Available (Free Sample) [Dataset]. https://www.poidata.io/report/rescue-squad/australia
Explore at:
csvAvailable download formats
Dataset updated
Jun 9, 2025
Dataset provided by
Poidata.io
Area covered
Australia
Description
This dataset provides information on 57 in Australia as of June, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.
c
Smart qualitative data: Methods and community tools for data mark-Up (SQUAD)...
datacatalogue.cessda.eu
Updated Jun 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Corti, L (2025). Smart qualitative data: Methods and community tools for data mark-Up (SQUAD) [Dataset]. http://doi.org/10.5255/UKDA-SN-850003
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-850003
Dataset updated
Jun 6, 2025
Dataset provided by
University of Essex
Authors
Corti, L
Time period covered
Mar 1, 2005 - Oct 31, 2006
Area covered
United Kingdom
Variables measured
Other
Measurement technique
Tools and technologies to explore new forms of sharing and disseminating qualitative data
Description
SQUAD - Smart Qualitative Data: Methods and Community Tools for Data Mark-Up is a demonstrator project that will explore methodological and technical solutions for exposing digital qualitative data to make them fully shareable, exploitable and archivable for the longer term. Such tools are required to exploit fully the potential of qualitative data for adventurous collaborative research using web-based and e-science systems. An example of the latter might be linking multiple data and information sources, such as text, statistics and maps. Initially, the project deals with specifying and testing flexible means of storing and marking-up, or annotating, qualitative data using universal standards and technologies, through eXtensible Mark-up Language (XML).A community standard, or schema, will be proposed that will be applicable to most kinds of qualitative data. The second strand investigates optimal requirements for describing or 'contextualising' research data (e.g. interview setting or interviewer characteristics), aiming to develop standards for data documentation. The third strand aims to use natural language processing technologies to develop and implement user-friendly tools for semi-automating processes to prepare marked-up qualitative data. Finally, the project will investigate tools for publishing the enriched data and contextual information to web-based systems and for exporting to preservation formats.
s
PIAF - Le dataset francophone de Questions-Réponses
data.smartidf.services
data.gouv.fr
csv, excel, json
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). PIAF - Le dataset francophone de Questions-Réponses [Dataset]. https://data.smartidf.services/explore/dataset/piaf-le-dataset-francophone-de-questions-reponses/
Explore at:
csv, json, excelAvailable download formats
Dataset updated
Jun 29, 2023
Description
PIAF, construire un jeu de données francophones ouvert pour l’IA

Le recours à l’intelligence artificielle au sein de l’action publique est souvent identifié comme une opportunité pour interroger des textes documentaires et réaliser des outils de questions/réponses (QR) automatiques à destination des usagers. Interroger le code du travail en langage naturel, mettre à disposition un agent conversationnel pour un service donné, développer des moteurs de recherche performants, améliorer la gestion des connaissances, autant d’activités qui nécessitent de disposer de corpus de données d’entraînement de qualité afin de développer des algorithmes de questions/réponses. Le dataset PIAF est un jeu de données d’entraînement francophone public et ouvert qui permet d’entraîner ces algorithmes.

En nous inspirant de SQuAD, le jeu de données bien connu de QR anglais, nous avons eu l’ambition de construire un jeu de données similaire qui serait ouvert à tous. Le protocole que nous avons suivi est très similaire à celui de la première version de SQuAD (SQuAD v1.1). Néanmoins, quelques modifications ont dû être apportées pour s’adapter aux caractéristiques du Wikipédia français. Une autre grande différence est que nous n’employons pas de micro-travailleurs via des plateformes de crowd-sourcing.

Après plusieurs mois d’annotation, nous avons une plateforme d’annotation robuste et libre, une quantité suffisante d’annotations et une démarche d’animation de communauté et de participation collaborative bien calée et innovante au sein de l’administration française.

PIAF : un outil mutualisé du Lab IA

En mars 2018, la France a lancé sa stratégie nationale pour l’intelligence artificielle. Pilotée au sein de la Direction interministérielle du numérique, cette stratégie comprend trois volets : la recherche, l’économie et la transformation publique.

La politique de la donnée étant un axe majeur du développement de l’intelligence artificielle, la mission Etalab pilote la mise en place d’un “Lab IA” interministériel, dont la mission est d’accélérer le déploiement de l’IA dans les administrations via 3 activités principales :

Constituer une équipe coeur afin d’internaliser des compétences et de l’expertise autour de l’IA

Accompagner des projets d’IA dans les administrations, par l’intermédiaire d’appels à manifestations d’intérêt

Co-construire des outils mutualisés pouvant être utilisés de la manière la plus ouverte possible

Le projet PIAF est l'un des outils mutualisés du Lab IA.

Descriptif des données mises à disposition

Le dataset suive le format de SQuAD v1.1. PIAFv1.2 contient 9225 pairs des questions/réponses. Il s'agit d'un fichier type JSON. Un fichier texte exemplifiant le schéma est inclus ci-dessous. Ce fichier peut être utilisé pour générer et évaluer des modèles de Question-Réponse. Par exemple, en suivant ces instructions.

Merci aux 500 contributeurs !

Nous remercions profondément nos contributeurs qui ont fait vivre ce projet bénévolement jusqu’à aujourd’hui.

Liens

Des informations sur le protocole suivi, sur les actualités du projet, sur la plateforme d'annotation et le code lié, sont ici :

https://piaf.etalab.studio/

https://piaf.etalab.studio/actualites.html

https://github.com/etalab/piaf

https://github.com/etalab-ia/piaf-code

Facebook

Twitter

Click to copy link

Link copied

Cite

François Bienvenu; Mike Steel (2022). SQuAD Dataset [Dataset]. https://paperswithcode.com/dataset/squad

SQuAD Dataset

Stanford Question Answering Dataset

Explore at:

Dataset updated

Oct 5, 2022

Authors

François Bienvenu; Mike Steel

Description

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

Clear search

Close search

Google apps

Main menu

SQuAD Dataset

squad

squad-sample

squad_v2

MedQuAD Dataset

Piaf — The French-language dataset of Questions-Answers

Piaf, build an open French-language dataset for AI

PIAF: a shared tool of the IA Lab

Descriptive of the data made available

Thanks to the 500 contributors!

Links

TriviaQA Dataset

multilingual-squad

QNLI Dataset

Question-and-answer dataset Rijksportaal Personnel

squad

mrqa

squad-it

Rescue Squads in Ohio, United States - 27 Available (Free Sample)

MLQA Dataset

internal-datasets

Rescue Squads in California, United States - 27 Available (Free Sample)

Rescue Squads in Australia - 57 Available (Free Sample)

Smart qualitative data: Methods and community tools for data mark-Up (SQUAD)...

PIAF - Le dataset francophone de Questions-Réponses

PIAF, construire un jeu de données francophones ouvert pour l’IA

PIAF : un outil mutualisé du Lab IA

Descriptif des données mises à disposition

Merci aux 500 contributeurs !

Liens

SQuAD Dataset

Stanford Question Answering Dataset