50 datasets found
  1. P

    SQuAD Dataset

    • paperswithcode.com
    Updated Oct 5, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    François Bienvenu; Mike Steel (2022). SQuAD Dataset [Dataset]. https://paperswithcode.com/dataset/squad
    Explore at:
    Dataset updated
    Oct 5, 2022
    Authors
    François Bienvenu; Mike Steel
    Description

    The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

  2. h

    squad

    • huggingface.co
    • tensorflow.org
    • +1more
    Updated Jun 12, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav R (2020). squad [Dataset]. https://huggingface.co/datasets/rajpurkar/squad
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2020
    Authors
    Pranav R
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for SQuAD

      Dataset Summary
    

    Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles.

      Supported Tasks and Leaderboards
    

    Question Answering.… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad.

  3. h

    squad-sample

    • huggingface.co
    Updated May 15, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2015). squad-sample [Dataset]. https://huggingface.co/datasets/autoevaluate/squad-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 15, 2015
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for Dataset Name

      Dataset Summary
    

    This dataset card aims to be a base template for new datasets. It has been generated using this raw template.

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    [More Information Needed]

      Data Splits
    

    [More Information Needed]

      Dataset Creation… See the full description on the dataset page: https://huggingface.co/datasets/autoevaluate/squad-sample.
    
  4. h

    squad_v2

    • huggingface.co
    Updated Jun 15, 2005
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pranav R (2005). squad_v2 [Dataset]. https://huggingface.co/datasets/rajpurkar/squad_v2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2005
    Authors
    Pranav R
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for SQuAD 2.0

      Dataset Summary
    

    Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. SQuAD 2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers… See the full description on the dataset page: https://huggingface.co/datasets/rajpurkar/squad_v2.

  5. P

    MedQuAD Dataset

    • paperswithcode.com
    Updated Feb 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asma Ben Abacha; Dina Demner-Fushman (2024). MedQuAD Dataset [Dataset]. https://paperswithcode.com/dataset/medquad
    Explore at:
    Dataset updated
    Feb 16, 2024
    Authors
    Asma Ben Abacha; Dina Demner-Fushman
    Description

    MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.

  6. Piaf — The French-language dataset of Questions-Answers

    • data.europa.eu
    csv, json, plain text
    Updated Jun 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Etalab (2022). Piaf — The French-language dataset of Questions-Answers [Dataset]. https://data.europa.eu/data/datasets/5e83c3ed38f46c1808801fbb?locale=en
    Explore at:
    csv(1014663), plain text(816), json(4744747), json(2834209)Available download formats
    Dataset updated
    Jun 19, 2022
    Dataset authored and provided by
    Etalab
    Area covered
    France, French
    Description

    Piaf, build an open French-language dataset for AI

    The use of artificial intelligence in public action is often identified as an opportunity to query documentary texts and produce automatic QR tools for users. Questioning the work code in natural language, providing a conversational agent for a given service, developing efficient search engines, improving knowledge management, all of which require a body of quality training data in order to develop Q & A algorithms. The PIAF dataset is a public and open Francophone training dataset that allows to train these algorithms.

    Inspired by Squad, the well-known dataset of English QR, we had the ambition to build a similar dataset that would be open to all. The protocol we followed is very similar to that of the first version of Squad (Squad v1.1). However, some changes had to be made to adapt to the characteristics of the French Wikipedia. Another big difference is that we do not employ micro-workers via crowd-sourcing platforms.

    After several months of annotation, we have a robust and free annotation platform, a sufficient amount of annotations and a well-founded and innovative community animation and collaborative participation approach within the French administration.

    PIAF: a shared tool of the IA Lab

    In March 2018, France launched its national strategy for artificial intelligence. Piloted within the Interdepartmental Digital Branch, this strategy has three components: research, the economy and public transformation.

    Given that the data policy is a major focus of the development of artificial intelligence, the Etalab mission is piloting the establishment of an interministerial “Lab IA”, whose mission is to accelerate the deployment of AI in administrations via 3 main activities:

    1. Build a core team to internalise skills and expertise around AI
    2. Supporting AI projects in administrations through calls for expressions of interest
    3. Co-build shared tools that can be used as openly as possible

    The PIAF project is one of the shared tools of the IA Lab.

    Descriptive of the data made available

    The dataset follows the format of Squad v1.1. PIAFv1.2 contains 9225 Q & A peers. This is a JSON file. A text file illustrating the schema is included below. This file can be used to generate and evaluate Question-Response templates. For example, following these instructions.

    Thanks to the 500 contributors!

    We deeply thank our contributors who have made this project live on a voluntary basis to this day.

    Links

    Information on the protocol followed, the project news, the annotation platform and the related code are here:

  7. P

    TriviaQA Dataset

    • paperswithcode.com
    • opendatalab.com
    • +2more
    Updated Mar 19, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mandar Joshi; Eunsol Choi; Daniel S. Weld; Luke Zettlemoyer (2024). TriviaQA Dataset [Dataset]. https://paperswithcode.com/dataset/triviaqa
    Explore at:
    Dataset updated
    Mar 19, 2019
    Authors
    Mandar Joshi; Eunsol Choi; Daniel S. Weld; Luke Zettlemoyer
    Description

    TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD), as the answers for a question may not be directly obtained by span prediction and the context is very long. TriviaQA dataset consists of both human-verified and machine-generated QA subsets.

  8. h

    multilingual-squad

    • huggingface.co
    Updated Feb 15, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kyle (2015). multilingual-squad [Dataset]. https://huggingface.co/datasets/kyle-obrien/multilingual-squad
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 15, 2015
    Authors
    Kyle
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card

      Dataset Summary
    

    This dataset contains multilingual, parallel SQuAD dataset examples across EN, DE, ES, and IT. To construct the dataset, identifiers were aligned across the following SQuAD-related datasets:

    EN, DE, ES: XQuAD (Cross-lingual Question Answering Dataset) IT: SQuAD-it

    See citation information below.

      Citation Information
    

    XQuAD: @article{Artetxe:etal:2019, author = {Mikel Artetxe and Sebastian Ruder and Dani Yogatama}… See the full description on the dataset page: https://huggingface.co/datasets/kyle-obrien/multilingual-squad.

  9. P

    QNLI Dataset

    • paperswithcode.com
    Updated May 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alex Wang; Amanpreet Singh; Julian Michael; Felix Hill; Omer Levy; Samuel R. Bowman (2021). QNLI Dataset [Dataset]. https://paperswithcode.com/dataset/qnli
    Explore at:
    Dataset updated
    May 19, 2021
    Authors
    Alex Wang; Amanpreet Singh; Julian Michael; Felix Hill; Omer Levy; Samuel R. Bowman
    Description

    The QNLI (Question-answering NLI) dataset is a Natural Language Inference dataset automatically derived from the Stanford Question Answering Dataset v1.1 (SQuAD). SQuAD v1.1 consists of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The dataset was converted into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue. The QNLI dataset is part of GLUE benchmark.

  10. C

    Question-and-answer dataset Rijksportaal Personnel

    • ckan.mobidatalab.eu
    Updated Jul 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OverheidNl (2023). Question-and-answer dataset Rijksportaal Personnel [Dataset]. https://ckan.mobidatalab.eu/dataset/vraag-en-antwoord-dataset-rijksportaal-personeel
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/json, http://publications.europa.eu/resource/authority/file-type/pdfAvailable download formats
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    OverheidNl
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Questions, answers and documents are stored in the dataset. Every question has an answer and the answer comes from a page of Rijksportaal Personnel (intranet central government). With this dataset a question-and-answer model can be trained. The computer thus learns to answer questions in the context of P-Direkt. A total of 322 questions were used that were once asked by e-mail to the contact center of P-Direkt. The questions are very general and never ask about personal circumstances. The aim of the dataset was to test whether question-and-answer models could possibly be used in a P-Direkt environment. The structure of the dataset corresponds to the Squad 2.0 dataset. ### Example: #### Question: Is it true that my SCV hours of 2020 expire if I don't take them? #### Answer: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire #### Source*: You can save your IKB hours in your IKB savings leave. IKB hours that you have not taken as leave and have not paid out will be added to your IKB savings leave at the end of December. Your IKB savings leave cannot expire. You cannot have your IKB savings leave paid out. Payment is only made in the event of termination of employment or death. You can save up to 1800 hours. Do you work part-time or more than an average of 36 hours per week? In that case, the maximum number of hours to be saved is calculated proportionally and rounded down to whole hours. Any remaining holiday hours from 2015 and extra-statutory holiday hours that you had left over from 2016 up to and including 2019 will be converted into IKB hours on 1 January 2020 and these have been added to your IKB savings leave. * Please note, source is a snapshot of National Portal Personnel from April 2021. Go to National Portal Personnel on the intranet for up-to-date information about personnel matters.

  11. h

    squad

    • huggingface.co
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2024). squad [Dataset]. https://huggingface.co/datasets/sentence-transformers/squad
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2024
    Dataset authored and provided by
    Sentence Transformers
    Description

    Dataset Card for SQuAD

    This dataset is a collection of question-answer pairs from the SQuAD dataset. See SQuAD for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

      Dataset Subsets
    
    
    
    
    
      pair subset
    

    Columns: "question", "answer" Column types: str, str Examples:{ 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', 'answer': 'Architecturally, the school has a Catholic… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/squad.

  12. T

    mrqa

    • tensorflow.org
    Updated Jan 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). mrqa [Dataset]. https://www.tensorflow.org/datasets/catalog/mrqa
    Explore at:
    Dataset updated
    Jan 4, 2023
    Description

    The MRQA 2019 Shared Task focuses on generalization in question answering. An effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples — a significantly harder challenge.

    MRQA adapts and unifies multiple distinct question answering datasets (carefully selected subsets of existing datasets) into the same format (SQuAD format). Among them, six datasets were made available for training, and six datasets were made available for testing. Small portions of the training datasets were held-out as in-domain data that may be used for development. The testing datasets only contain out-of-domain data. This benchmark is released as part of the MRQA 2019 Shared Task.

    More information can be found at: https://mrqa.github.io/2019/shared.html.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('mrqa', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  13. h

    squad-it

    • huggingface.co
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Landro (2021). squad-it [Dataset]. https://huggingface.co/datasets/z-uo/squad-it
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Authors
    Nicola Landro
    Description

    Squad-it

    This dataset is an adapted version of that squad-it to train on HuggingFace models. It contains:

    train samples: 87599 test samples : 10570

    This dataset is for question answering and his format is the following: [ { "answers": [ { "answer_start": [1], "text": ["Questo è un testo"] }, ], "context": "Questo è un testo relativo al contesto.", "id": "1", "question": "Questo è un testo?", "title": "train test" } ]

    It can… See the full description on the dataset page: https://huggingface.co/datasets/z-uo/squad-it.

  14. p

    Rescue Squads in Ohio, United States - 27 Available (Free Sample)

    • poidata.io
    csv
    Updated Jun 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Rescue Squads in Ohio, United States - 27 Available (Free Sample) [Dataset]. https://www.poidata.io/report/rescue-squad/united-states/ohio
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 4, 2025
    Dataset provided by
    Poidata.io
    Area covered
    Ohio, United States
    Description

    This dataset provides information on 27 in Ohio, United States as of June, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.

  15. P

    MLQA Dataset

    • paperswithcode.com
    • opendatalab.com
    • +2more
    Updated Jun 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Patrick Lewis; Barlas Oğuz; Ruty Rinott; Sebastian Riedel; Holger Schwenk (2021). MLQA Dataset [Dataset]. https://paperswithcode.com/dataset/mlqa
    Explore at:
    Dataset updated
    Jun 29, 2021
    Authors
    Patrick Lewis; Barlas Oğuz; Ruty Rinott; Sebastian Riedel; Holger Schwenk
    Description

    MLQA (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question answering performance. MLQA consists of over 5K extractive QA instances (12K in English) in SQuAD format in seven languages - English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. MLQA is highly parallel, with QA instances parallel between 4 different languages on average.

  16. h

    internal-datasets

    • huggingface.co
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Rivaldo Marbun (2023). internal-datasets [Dataset]. https://huggingface.co/datasets/Marbyun/internal-datasets
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2023
    Authors
    Ivan Rivaldo Marbun
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SynQA is a Reading Comprehension dataset created in the work "Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation" (https://aclanthology.org/2021.emnlp-main.696/). It consists of 314,811 synthetically generated questions on the passages in the SQuAD v1.1 (https://arxiv.org/abs/1606.05250) training set.

    In this work, we use a synthetic adversarial data generation to make QA models more robust to human adversaries. We develop a data generation pipeline that selects source passages, identifies candidate answers, generates questions, then finally filters or re-labels them to improve quality. Using this approach, we amplify a smaller human-written adversarial dataset to a much larger set of synthetic question-answer pairs. By incorporating our synthetic data, we improve the state-of-the-art on the AdversarialQA (https://adversarialqa.github.io/) dataset by 3.7F1 and improve model generalisation on nine of the twelve MRQA datasets. We further conduct a novel human-in-the-loop evaluation to show that our models are considerably more robust to new human-written adversarial examples: crowdworkers can fool our model only 8.8% of the time on average, compared to 17.6% for a model trained without synthetic data.

    For full details on how the dataset was created, kindly refer to the paper.

  17. p

    Rescue Squads in California, United States - 27 Available (Free Sample)

    • poidata.io
    csv
    Updated May 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Rescue Squads in California, United States - 27 Available (Free Sample) [Dataset]. https://www.poidata.io/report/rescue-squad/united-states/california
    Explore at:
    csvAvailable download formats
    Dataset updated
    May 8, 2025
    Dataset provided by
    Poidata.io
    Area covered
    United States, California
    Description

    This dataset provides information on 27 in California, United States as of May, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.

  18. p

    Rescue Squads in Australia - 57 Available (Free Sample)

    • poidata.io
    csv
    Updated Jun 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Poidata.io (2025). Rescue Squads in Australia - 57 Available (Free Sample) [Dataset]. https://www.poidata.io/report/rescue-squad/australia
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jun 9, 2025
    Dataset provided by
    Poidata.io
    Area covered
    Australia
    Description

    This dataset provides information on 57 in Australia as of June, 2025. It includes details such as email addresses (where publicly available), phone numbers (where publicly available), and geocoded addresses. Explore market trends, identify potential business partners, and gain valuable insights into the industry. Download a complimentary sample of 10 records to see what's included.

  19. c

    Smart qualitative data: Methods and community tools for data mark-Up (SQUAD)...

    • datacatalogue.cessda.eu
    Updated Jun 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Corti, L (2025). Smart qualitative data: Methods and community tools for data mark-Up (SQUAD) [Dataset]. http://doi.org/10.5255/UKDA-SN-850003
    Explore at:
    Dataset updated
    Jun 6, 2025
    Dataset provided by
    University of Essex
    Authors
    Corti, L
    Time period covered
    Mar 1, 2005 - Oct 31, 2006
    Area covered
    United Kingdom
    Variables measured
    Other
    Measurement technique
    Tools and technologies to explore new forms of sharing and disseminating qualitative data
    Description

    SQUAD - Smart Qualitative Data: Methods and Community Tools for Data Mark-Up is a demonstrator project that will explore methodological and technical solutions for exposing digital qualitative data to make them fully shareable, exploitable and archivable for the longer term. Such tools are required to exploit fully the potential of qualitative data for adventurous collaborative research using web-based and e-science systems. An example of the latter might be linking multiple data and information sources, such as text, statistics and maps. Initially, the project deals with specifying and testing flexible means of storing and marking-up, or annotating, qualitative data using universal standards and technologies, through eXtensible Mark-up Language (XML).A community standard, or schema, will be proposed that will be applicable to most kinds of qualitative data. The second strand investigates optimal requirements for describing or 'contextualising' research data (e.g. interview setting or interviewer characteristics), aiming to develop standards for data documentation. The third strand aims to use natural language processing technologies to develop and implement user-friendly tools for semi-automating processes to prepare marked-up qualitative data. Finally, the project will investigate tools for publishing the enriched data and contextual information to web-based systems and for exporting to preservation formats.

  20. s

    PIAF - Le dataset francophone de Questions-Réponses

    • data.smartidf.services
    • data.gouv.fr
    csv, excel, json
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). PIAF - Le dataset francophone de Questions-Réponses [Dataset]. https://data.smartidf.services/explore/dataset/piaf-le-dataset-francophone-de-questions-reponses/
    Explore at:
    csv, json, excelAvailable download formats
    Dataset updated
    Jun 29, 2023
    Description

    PIAF, construire un jeu de données francophones ouvert pour l’IA

    Le recours à l’intelligence artificielle au sein de l’action publique est souvent identifié comme une opportunité pour interroger des textes documentaires et réaliser des outils de questions/réponses (QR) automatiques à destination des usagers. Interroger le code du travail en langage naturel, mettre à disposition un agent conversationnel pour un service donné, développer des moteurs de recherche performants, améliorer la gestion des connaissances, autant d’activités qui nécessitent de disposer de corpus de données d’entraînement de qualité afin de développer des algorithmes de questions/réponses. Le dataset PIAF est un jeu de données d’entraînement francophone public et ouvert qui permet d’entraîner ces algorithmes.

    En nous inspirant de SQuAD, le jeu de données bien connu de QR anglais, nous avons eu l’ambition de construire un jeu de données similaire qui serait ouvert à tous. Le protocole que nous avons suivi est très similaire à celui de la première version de SQuAD (SQuAD v1.1). Néanmoins, quelques modifications ont dû être apportées pour s’adapter aux caractéristiques du Wikipédia français. Une autre grande différence est que nous n’employons pas de micro-travailleurs via des plateformes de crowd-sourcing.

    Après plusieurs mois d’annotation, nous avons une plateforme d’annotation robuste et libre, une quantité suffisante d’annotations et une démarche d’animation de communauté et de participation collaborative bien calée et innovante au sein de l’administration française.

    PIAF : un outil mutualisé du Lab IA

    En mars 2018, la France a lancé sa stratégie nationale pour l’intelligence artificielle. Pilotée au sein de la Direction interministérielle du numérique, cette stratégie comprend trois volets : la recherche, l’économie et la transformation publique.

    La politique de la donnée étant un axe majeur du développement de l’intelligence artificielle, la mission Etalab pilote la mise en place d’un “Lab IA” interministériel, dont la mission est d’accélérer le déploiement de l’IA dans les administrations via 3 activités principales :

    1. Constituer une équipe coeur afin d’internaliser des compétences et de l’expertise autour de l’IA
    2. Accompagner des projets d’IA dans les administrations, par l’intermédiaire d’appels à manifestations d’intérêt
    3. Co-construire des outils mutualisés pouvant être utilisés de la manière la plus ouverte possible

    Le projet PIAF est l'un des outils mutualisés du Lab IA.

    Descriptif des données mises à disposition

    Le dataset suive le format de SQuAD v1.1. PIAFv1.2 contient 9225 pairs des questions/réponses. Il s'agit d'un fichier type JSON. Un fichier texte exemplifiant le schéma est inclus ci-dessous. Ce fichier peut être utilisé pour générer et évaluer des modèles de Question-Réponse. Par exemple, en suivant ces instructions.

    Merci aux 500 contributeurs !

    Nous remercions profondément nos contributeurs qui ont fait vivre ce projet bénévolement jusqu’à aujourd’hui.

    Liens

    Des informations sur le protocole suivi, sur les actualités du projet, sur la plateforme d'annotation et le code lié, sont ici :

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
François Bienvenu; Mike Steel (2022). SQuAD Dataset [Dataset]. https://paperswithcode.com/dataset/squad

SQuAD Dataset

Stanford Question Answering Dataset

Explore at:
Dataset updated
Oct 5, 2022
Authors
François Bienvenu; Mike Steel
Description

The Stanford Question Answering Dataset (SQuAD) is a collection of question-answer pairs derived from Wikipedia articles. In SQuAD, the correct answers of questions can be any sequence of tokens in the given text. Because the questions and answers are produced by humans through crowdsourcing, it is more diverse than some other question-answering datasets. SQuAD 1.1 contains 107,785 question-answer pairs on 536 articles. SQuAD2.0 (open-domain SQuAD, SQuAD-Open), the latest version, combines the 100,000 questions in SQuAD1.1 with over 50,000 un-answerable questions written adversarially by crowdworkers in forms that are similar to the answerable ones.

Search
Clear search
Close search
Google apps
Main menu