41 datasets found
  1. h

    rag-mini-bioasq

    • huggingface.co
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    RAG Datasets (2023). rag-mini-bioasq [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-bioasq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2023
    Dataset authored and provided by
    RAG Datasets
    License

    Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
    License information was derived automatically

    Description

    See here for an updated version without nans in text-corpus. In this huggingface discussion you can share what you used the dataset for. Derives from http://participants-area.bioasq.org/Tasks/11b/trainingDataset/ we generated our own subset using generate.py.

  2. P

    Data from: BioASQ Dataset

    • paperswithcode.com
    Updated Jun 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    George Tsatsaronis; Georgios Balikas; Prodromos Malakasiotis; Ioannis Partalas; Matthias Zschunke; Michael R. Alvers; Dirk Weissenborn; Anastasia Krithara; Sergios Petridis; Dimitris Polychronopoulos; Yannis Almirantis; John Pavlopoulos; Nicolas Baskiotis; Patrick Gallinari; Thierry Artières; Axel-Cyrille Ngonga Ngomo; Norman Heino; Éric Gaussier; Liliana Barrio-Alvers; Michael Schroeder; Ion Androutsopoulos; Georgios Paliouras (2022). BioASQ Dataset [Dataset]. https://paperswithcode.com/dataset/bioasq
    Explore at:
    Dataset updated
    Jun 29, 2022
    Authors
    George Tsatsaronis; Georgios Balikas; Prodromos Malakasiotis; Ioannis Partalas; Matthias Zschunke; Michael R. Alvers; Dirk Weissenborn; Anastasia Krithara; Sergios Petridis; Dimitris Polychronopoulos; Yannis Almirantis; John Pavlopoulos; Nicolas Baskiotis; Patrick Gallinari; Thierry Artières; Axel-Cyrille Ngonga Ngomo; Norman Heino; Éric Gaussier; Liliana Barrio-Alvers; Michael Schroeder; Ion Androutsopoulos; Georgios Paliouras
    Description

    BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).

  3. h

    Multilingual-BioASQ-6B

    • huggingface.co
    Updated Jun 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    HiTZ zentroa (2025). Multilingual-BioASQ-6B [Dataset]. https://huggingface.co/datasets/HiTZ/Multilingual-BioASQ-6B
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 16, 2025
    Dataset authored and provided by
    HiTZ zentroa
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Mutilingual BioASQ-6B

    We translate the BioASQ-6B English Question Answering dataset to generate parallel French, Italian and Spanish versions using the NLLB200 3B parameter model. For more info read the original task description: http://bioasq.org/participate/challenges_year_6

    We translate the body, snippets, ideal_answer and exact_answer fields. We have validated the quality of the ideal_answer field, however, the… See the full description on the dataset page: https://huggingface.co/datasets/HiTZ/Multilingual-BioASQ-6B.

  4. h

    BioASQ-taskb

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Burgess, BioASQ-taskb [Dataset]. https://huggingface.co/datasets/jmhb/BioASQ-taskb
    Explore at:
    Authors
    James Burgess
    Description

    BioASQ task B dataset. Where this data came from?

    Signed up for BioASQ account at https://www.bioasq.org/ Downloaded all the train and tests sets for "Task B" which is QA, added to a folder data/ Put this script in scripts/: https://gist.github.com/jmhb0/5a0789bf9c8605c5b95c63b72a1bbc8e

    Note: this script does deduplication. There is lots of overlap between train and tests sets in the Papers for attribution:

    https://www.nature.com/articles/s41597-023-02068-4… See the full description on the dataset page: https://huggingface.co/datasets/jmhb/BioASQ-taskb.

  5. Title-Based Semantic Subject Indexing

    • kaggle.com
    Updated Apr 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Mai (2018). Title-Based Semantic Subject Indexing [Dataset]. https://www.kaggle.com/hsrobo/titlebased-semantic-subject-indexing/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Florian Mai
    Description

    Semantic Subject Indexing

    Semantic subject indexing is the process of annotating documents with terms that describe what the document is about. This is often used in digital libraries to increase the findability of the documents. Annotations are usually created by human experts from the domain, who select appropriate terms from a pre-specified set of available labels. In order to keep up with the vast amount of new publications, (semi-)automatic tools are developed that assist the experts by suggesting them terms for annotation. Unfortunately, due to legal restrictions these tools often cannot use the full-text nor the abstract of the publication. Therefore, it is desirable to explore techniques that work with the publications' metadata only. To some extent, it is already possible to achieve competitive performance to the full-text by merely using titles. Yet, the performance of automatic subject indexing methods is still far from the level of human annotators. Semantic subject indexing can be framed as a multi-label classification problem, where the entry (i,j) of an indicator matrix is set to one if the label has been assigned to a document, or it is set to zero otherwise. A major challenge is that the label space is usually very large (up to almost 30,000), that the labels follow a power-law, and are subject to concept drift(cmp. Toepfer and Seifert).

    Here, we provide two large-scale datasets from the domain of economics and business studies (EconBiz) and biomedicine (PubMed) used in our recent study, which each come with the title and respective annotated labels. Do you find valuable insights in the data that can help understand the problem of semantic subject indexing better? Can you come up with clever ideas that push the state-of-the-art in automatic semantic subject indexing? We are excited to see what the collective power of data scientists can achieve on this task!

    Content

    We compiled two English datasets from two digital libraries, EconBiz and PubMed.

    EconBiz

    The EconBiz dataset was compiled from a meta-data export provided by ZBW - Leibniz Information Centre for Economics from July 2017. We only retained those publications that were flagged as being in English and that were annotated with STW labels. Afterwards, we removed duplicates by checking for same title and labels. In total, approximately 1,064k publications remain. The annotations were selected by human annotators from the Standard Thesaurus Wirtschaft (STW), which contains approximately 6,000 labels.

    PubMed

    The PubMed dataset was compiled from the training set of the 5th BioASQ challenge on large-scale semantic subject indexing of biomedical articles, which were all in English. Again, we removed duplicates by checking for same title and labels. In total, approximately 12.8 million publications remain. The labels are so called MeSH terms. In our data, approximately 28k of them are used.

    Fields Both datasets share the same set of fields:

    • id: An identifier used to refer to the publication in the respective digital library.
    • title: The title of the publication
    • labels: A string that represents a list of labels, separated by TAB.
    • fold: For reproducibility of the results in our study: Number of the fold a sample belongs to as used in our study. 0 to 9 correspond to the samples that have a full-text, fold 10 to all other samples.

    Acknowledgements

    We would like to thank ZBW - Information Centre for Economics for providing the EconBiz dataset, and in particular Tamara Pianos and Tobias Rebholz.

    We would also like to thank the team from the BioASQ challenge, from where we compiled the PubMed dataset. This organization is dedicated to advancing the state-of-the-art in large-scale semantic indexing. It is currently running the 6th iteration of their challenge, which you should definitely check out!

    The PubMed dataset has been gathered by BioASQ following the terms from the U.S. National Library of Medicine regarding public use and redistribution of the data.

  6. BioASQvec Plus.txt

    • figshare.com
    txt
    Updated Apr 12, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peiliang Lou (2019). BioASQvec Plus.txt [Dataset]. http://doi.org/10.6084/m9.figshare.7981739.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Apr 12, 2019
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Peiliang Lou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BioASQvec Plus is an extended version of BioASQvec(http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts) taking the advantage of protein alias corpus retrieved from biological databases and biomedical publications. Not only does it contains a bigger corpus of bio-entity names, but also can assign an equal representation to different names that correspond to the same entity. BioASQvec Plus is a generic word embeddings which could be applied to different biomedical text mining models for improving word representations.

  7. MultiClinSum Dataset: Summarization of Clinical Case Reports in English,...

    • zenodo.org
    zip
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salvador Lima López; Salvador Lima López; Miguel Rodríguez Ortega; Miguel Rodríguez Ortega; Eduard Rodríguez López; Eduard Rodríguez López; Martin Krallinger; Martin Krallinger (2025). MultiClinSum Dataset: Summarization of Clinical Case Reports in English, Spanish, French and Portuguese [Dataset]. http://doi.org/10.5281/zenodo.15188952
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Salvador Lima López; Salvador Lima López; Miguel Rodríguez Ortega; Miguel Rodríguez Ortega; Eduard Rodríguez López; Eduard Rodríguez López; Martin Krallinger; Martin Krallinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    French
    Description

    MultiClinSum Shared Task Dataset

    MultiClinSum is a shared task about the automatic summarization of clinical case reports in English, Spanish, French and Portuguese held as part of the BioASQ workshop at CLEF 2025. The task relies on a corpus of manually selected full clinical case reports and their corresponding clinical case report summaries derived from case report publications written in the previously mentioned languages. In addition, participants are allowed to use any other data source available online as long as they report it.

    This version of the data contains the sample set: a small subset of 20 full-text documents and their summaries in English meant to be used as a sample of the data that will be used in the task. Both the full-texts and their summaries are .txt documents in UTF-8. They are separated in different folders and each pair have an almost identical filename, with the summaries having the suffix "_sum".

    Resources:

    - MultiClinSum website

    - BioASQ website

    License

    This work is licensed under a Creative Commons Attribution 4.0 International License.

    Contact

    If you have any questions or suggestions, please contact us at:

    - Salvador Lima-López (

    Additional resources and corpora

    If you are interested in MultiClinSum, you might want to check out these corpora and resources:

    • DisTEMIST (Corpus of disease mentions and normalization to SNOMED CT)
    • MedProcNER (Corpus of clinical procedure mentions and normalization to SNOMED CT)
    • SympTEMIST (Corpus of clinical findings and normalization to SNOMED CT)
    • DrugTEMIST (Corpus of medication mentions)
    • CardioCCC (Corpus of diseases and medication mentions in cardiology texts)
    • PharmaCoNER (Corpus of medications, drugs, chemical substances, genes, proteins and vaccine mentions and normalization)
    • MEDDOPROF (Corpus of mentions of professions, occupations and working status and normalization)
    • MEDDOPLACE (Corpus of mentions of place-related entity mentions, including departments, nationalities or patient movements etc.. and normalization)
    • MEDDOCAN (Corpus of mentions of Personal Health Identifiers (PHI))
    • CANTEMIST (Corpus of cancer tumor morphology mentions and normalization)
    • CodiESP (Corpus of clinical case reportes with assigned clinical codes from ICD10, Spanish version)
    • LivingNER (Corpus of mentions of species, including human/family members, pathogens, food, etc.. and normalization to NCBI Taxonomy)
    • SPACCC-POS (Corpus of clinical case reports in Spanish annotated with POS-tags)
    • SPACCC-TOKEN (Corpus of clinical case reports in Spanish annotated with token-tags (word mention boundaries))
    • SPACCC-SPLIT (Corpus of clinical case reports in Spanish annotated with sentence boundary-tags)
    • MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts)

  8. BioASQ-QA: A manually curated corpus for Biomedical Question Answering

    • zenodo.org
    json
    Updated Feb 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasia Krithara; Anastasia Krithara; Anastasios Nentidis; Anastasios Nentidis; Konstantinos Bougiatiotis; Georgios Paliouras; Konstantinos Bougiatiotis; Georgios Paliouras (2023). BioASQ-QA: A manually curated corpus for Biomedical Question Answering [Dataset]. http://doi.org/10.5281/zenodo.7655127
    Explore at:
    jsonAvailable download formats
    Dataset updated
    Feb 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anastasia Krithara; Anastasia Krithara; Anastasios Nentidis; Anastasios Nentidis; Konstantinos Bougiatiotis; Georgios Paliouras; Konstantinos Bougiatiotis; Georgios Paliouras
    Description

    The BioASQ question answering (QA) benchmark dataset contains questions in English, along with golden standard (reference) answers and related material. The dataset has been designed to reflect real information needs of biomedical experts and is therefore more realistic and challenging than most existing datasets. Furthermore, unlike most previous QA benchmarks that contain only exact answers, the BioASQ-QA dataset also includes ideal answers (in effect summaries), which are particularly useful for research on multi-document summarization. The dataset combines structured and unstructured data. The material linked with each question comprise documents and snippets, which are useful for Information Retrieval and Passage Retrieval experiments, as well as concepts that are useful in concept-to-text Natural Language Generation. Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. Last but not least, the dataset is continuously extended, as the BioASQ challenge is running and new data are generated.

  9. h

    BioASQ-Task-B-Revised

    • huggingface.co
    Updated Mar 31, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bastien Hottelet (2016). BioASQ-Task-B-Revised [Dataset]. https://huggingface.co/datasets/BastienHot/BioASQ-Task-B-Revised
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2016
    Authors
    Bastien Hottelet
    License

    Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
    License information was derived automatically

    Description

    BastienHot/BioASQ-Task-B-Revised dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    hotpotqa

    • huggingface.co
    Updated Aug 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR (2022). hotpotqa [Dataset]. https://huggingface.co/datasets/BeIR/hotpotqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2022
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.

  11. BioASQ-training13b

    • kaggle.com
    Updated Mar 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    kazzene (2025). BioASQ-training13b [Dataset]. https://www.kaggle.com/datasets/kazzene/bioasq-training13b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    kazzene
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by kazzene

    Released under MIT

    Contents

  12. Z

    Machine reading compatible answers for BIOASQ 2018 Task B Training set

    • data.niaid.nih.gov
    • explore.openaire.eu
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MA, YUE (2020). Machine reading compatible answers for BIOASQ 2018 Task B Training set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1346192
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    GRAU, BRIGITTE
    KAMATH, SANJAY
    MA, YUE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated files of BIOASQ 6B training set compatible with Brat tool.

  13. h

    rag-mini-bioasq-with-metadata

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enelpol, rag-mini-bioasq-with-metadata [Dataset]. https://huggingface.co/datasets/enelpol/rag-mini-bioasq-with-metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Enelpol
    License

    Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
    License information was derived automatically

    Description

    This dataset is an extension of the rag-mini-bioasq dataset. Its difference resides in the text-corpus part of the aforementioned set where the metadata was added for each passage. Metadata contains six separate categories, each in a dedicated column:

    Year of the publication (publish_year) Type of the publication (publish_type) Country of the publication - often correlated with the homeland of the authors (country) Number of pages (no_pages) Authors (authors) Keywords (keywords)

  14. Z

    BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy)

    • data.niaid.nih.gov
    Updated Sep 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Müller, Bernd (2021). BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4680825
    Explore at:
    Dataset updated
    Sep 4, 2021
    Dataset authored and provided by
    Müller, Bernd
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The sub corpus contains Standoff Annotations for Drug Names and Terms from Epilepsy Ontologies with their Aggregations Recognized in the 2021 BioASQ corpus.

    The terms for epilepsy ontologies are from NCBO BioPortal, namely from the ontologies EpSO, ESSO, EPILONT, EPISEM and FENICS:

    https://bioportal.bioontology.org/ontologies/EPSO

    https://bioportal.bioontology.org/ontologies/ESSO

    https://bioportal.bioontology.org/ontologies/EPILONT

    https://bioportal.bioontology.org/ontologies/EPISEM

    https://bioportal.bioontology.org/ontologies/FENICS

    The dictionary for the identificatin of drug names is derived from the DrugBank vocabulary available online at https://go.drugbank.com/releases/latest#open-data.

    The terms were identified using a custom implementation of a UIMA-based text mining wokflow that annotates free text with the UIMA ConceptMapper. Further descriptions of this workflow can be found in the following publications:

    Bernd Müller, Alexandra Hagelstein: Beyond Metadata: Enriching life science publications in Livivo with semantic entities from the linked data cloud. SEMANTiCS (Posters, Demos, SuCCESS) 2016

    Bernd Müller, Alexandra Hagelstein, Thomas Gübitz: Life Science Ontologies in Literature Retrieval: A Comparison of Linked Data Sets for Use in Semantic Search on a Heterogeneous Corpus. EKAW (Satellite Events) 2016: 158-161

    Bernd Müller, Christoph Poley, Jana Pössel, Alexandra Hagelstein, Thomas Gübitz: LIVIVO - the Vertical Search Engine for Life Sciences. Datenbank-Spektrum 17(1): 29-34 (2017)

    Bernd Müller, Dietrich Rebholz-Schuhmann: Selected Approaches Ranking Contextual Term for the BioASQ Multi-label Classification (Task6a and 7a). PKDD/ECML Workshops (2) 2019: 569-580

    The file format is JSON. The file content is described as follows:

    bioasqepilepsy2021.json - All standoff annotations for each document in the 2021 BioASQ corpus

    aggepilepsy2021EPSOANDDrugNames.json - aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EpSO co-occurring with at least one drug name

    aggepilepsy2021ESSOANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from ESSO co-occurring with at least one drug name

    aggepilepsy2021EPILONTANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EPILONT co-occurring with at least one drug name

    aggepilepsy2021EPISEMANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EPISEM co-occurring with at least one drug name

    aggepilepsy2021FENICSANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from FENICS co-occurring with at least one drug name

    All JSON files should be importable into a collection of a MongoDB. Documents are identified by their PMIDs.

    Please cite this data as:

    Müller, Bernd. BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPEpsy) 2021. ZENODO, 10.5281/zenodo.4680086

  15. O

    Data from: BioASQ

    • opendatalab.com
    zip
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NCSR Demokritos (2022). BioASQ [Dataset]. https://opendatalab.com/OpenDataLab/BioASQ
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 1, 2022
    Dataset provided by
    Aristotle University of Thessaloniki
    NCSR Demokritos
    License

    Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
    License information was derived automatically

    Description

    The BioASQ question answering (QA) benchmark dataset contains questions in English, along with golden standard (reference) answers and related material. The dataset has been designed to reflect real information needs of biomedical experts and is therefore more realistic and challenging than most existing datasets. Furthermore, unlike most previous QA benchmarks that contain only exact answers, the BioASQ-QA dataset also includes ideal answers (in effect summaries), which are particularly useful for research on multi-document summarization. The dataset combines structured and unstructured data. The materials linked with each question comprise documents and snippets, which are useful for Information Retrieval and Passage Retrieval experiments, as well as concepts that are useful in concept-to-text Natural Language Generation. Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. Last but not least, the dataset is continuously extended, as the BioASQ challenge is running and new data are generated.

  16. MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

    • zenodo.org
    • data.niaid.nih.gov
    bin, tsv, zip
    Updated Oct 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Luis Gasco; Luis Gasco; Martin Krallinger; Martin Krallinger (2021). MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish [Dataset]. http://doi.org/10.5281/zenodo.4612275
    Explore at:
    tsv, bin, zipAvailable download formats
    Dataset updated
    Oct 28, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Luis Gasco; Luis Gasco; Martin Krallinger; Martin Krallinger
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/

    Introduction:
    These corpora contain the data for each of the sub-tracks of MESINESP2 shared-task:

    • Track 1- Medical indexing:
      • Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:
        • Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.
        • Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.
      • Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:
        • 213 articles were annotated by more than one annotator. We have selected de union between annotations.
        • 852 articles were annotated by only one of the three selected annotators with better performance.
      • Test set: To be published
    • Track 2- Clinical trials:
      • Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3592 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.30, which corresponds with the submission of the best three teams. We have selected the union of all codes assigned by those team.
      • Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.
    • Track 3- Patents: To be published

    Files structure:

    MESINESP2_corpus.zip contains the corpora generated for the shared task. Content:

    • Subtrack1:
      • Train
        • training_set_track1_all.json: Full training set for sub-track 1.
        • training_set_track1_only_articles.json: Articles training set for sub-track 1.
      • Test
        • development_set_subtrack1.json: Manually annotated development set for sub-track 1.
    • Subtrack2:
      • Train
        • training_set_subtrack2.json: Training set for sub-track 2.
      • Test
        • development_set_subtrack2.json: Manually annotated development set for sub-track 2.
    • Subtrack3: This folder is empty. Data for sub-track 3 will be published soon.

    DeCS2020.tsv contains a DeCS table with the following structure:

    • DeCS code
    • Preferred descriptor (the preferred label in the Latin Spanish Decs 2020 set)
    • List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020, separate by pipes)

    DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.

    For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at encargo-pln-life@bsc.es

  17. Z

    UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus...

    • data.niaid.nih.gov
    Updated Sep 4, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Müller, Bernd (2021). UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus with Drug Names and Terms from Epilepsy Ontologies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4683352
    Explore at:
    Dataset updated
    Sep 4, 2021
    Dataset authored and provided by
    Müller, Bernd
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The dictionary files for the annotation of free text with terms from epilepsy ontologies by the UIMA ConceptMapper are taken from NCBO BioPortal, namely from the ontologies EpSO, ESSO, EPILONT, EPISEM and FENICS:

    https://bioportal.bioontology.org/ontologies/EPSO

    https://bioportal.bioontology.org/ontologies/ESSO

    https://bioportal.bioontology.org/ontologies/EPILONT

    https://bioportal.bioontology.org/ontologies/EPISEM

    https://bioportal.bioontology.org/ontologies/FENICS

    The dictionary for the identification of drug names is derived from the DrugBank vocabulary available online at https://go.drugbank.com/releases/latest#open-data.

    Further descriptions of making use of the UIMA-based text mining workflow can be found in the following publications:

    Bernd Müller, Alexandra Hagelstein: Beyond Metadata: Enriching life science publications in Livivo with semantic entities from the linked data cloud. SEMANTiCS (Posters, Demos, SuCCESS) 2016

    Bernd Müller, Alexandra Hagelstein, Thomas Gübitz: Life Science Ontologies in Literature Retrieval: A Comparison of Linked Data Sets for Use in Semantic Search on a Heterogeneous Corpus. EKAW (Satellite Events) 2016: 158-161

    Bernd Müller, Christoph Poley, Jana Pössel, Alexandra Hagelstein, Thomas Gübitz: LIVIVO - the Vertical Search Engine for Life Sciences. Datenbank-Spektrum 17(1): 29-34 (2017)

    Bernd Müller, Dietrich Rebholz-Schuhmann: Selected Approaches Ranking Contextual Term for the BioASQ Multi-label Classification (Task6a and 7a). PKDD/ECML Workshops (2) 2019: 569-580

    The dictionary files are in particular:

    Dict_DrugNames.xml - constructed from the DrugBank vocabulary

    Dict_EpSO.xml - constructed from the EpSO ontology

    Dict_ESSO.xml - constructed from the ESSO ontology

    Dict_EPILONT.xml - constructed from the EPILONT ontology

    Dict_EPISEM.xml - constructed from the EPISEM ontology

    Dict_FENICS.xml - constructed from the FENICS ontology

    The dictionaries were used with the UIMA ConceptMapper for the annotation of the 2021 BioASQ corpus resulting in the BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy).

    Please cite this data as:

    Müller, Bernd. UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus with Drug Names and Terms from Epilepsy Ontologies. ZENODO, 10.5281/zenodo.4683353

  18. f

    The MAP performances of our approach compared with the variants and...

    • plos.figshare.com
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu (2023). The MAP performances of our approach compared with the variants and classical IR models on BioASQ. [Dataset]. http://doi.org/10.1371/journal.pone.0242061.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The MAP performances of our approach compared with the variants and classical IR models on BioASQ.

  19. f

    Comparisons with BioASQ 2016 participants.

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu (2023). Comparisons with BioASQ 2016 participants. [Dataset]. http://doi.org/10.1371/journal.pone.0242061.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparisons with BioASQ 2016 participants.

  20. F

    YESciEval Corpus

    • data.uni-hannover.de
    csv
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIB (2025). YESciEval Corpus [Dataset]. https://data.uni-hannover.de/lt/dataset/yescieval-corpus
    Explore at:
    csv(82283594), csv(29470889)Available download formats
    Dataset updated
    May 28, 2025
    Dataset authored and provided by
    TIB
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    YESciEval is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the YESciEval framework, developed to support robust, transparent, and scalable evaluation using open-source LLMs.

    🔍 Overview

    YESciEval provides:

    • ScienceQ&A datasets generated using open-source LLMs
    • Adversarial variants designed using fine-grained rubric-based heuristics
    • Evaluation scores from multiple LLMs acting as evaluators (LLM-as-a-judge)

    📂 Dataset Structure

    The dataset is organized into two main parts:

    1. Benign (Original) ScienceQ&A Data

    Synthesized answers to research questions based on abstracts from relevant papers.

    • Sources:

      • ORKGSyn: Multidisciplinary questions from the Open Research Knowledge Graph
      • BioASQ: Biomedical questions from the BioASQ challenge
    • Format: For each Q&A instance:

      • question: research question
      • abstracts: relevant paper abstracts
      • answer: LLM-generated synthesis

    2. Adversarial ScienceQ&A Data

    Each benign answer is perturbed with two types of adversarial modifications:

    • Subtle Perturbations: Realistic, light-weight errors designed to be difficult for models to detect
    • Extreme Perturbations: Significant modifications that should be easily identifiable by robust evaluators

    Perturbations target nine qualitative rubrics: - Cohesion - Conciseness - Readability - Coherence - Integration - Relevancy - Correctness - Completeness - Informativeness

    Each rubric has a defined subtle and extreme perturbation heuristic.

    🧪 Evaluation Outputs

    Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records:

    • A 1–5 Likert score for each rubric
    • A rationale for the score

    📊 Statistics

    ORKGSyn (33 disciplines)

    • Benign: 348 Q&A pairs
    • Subtle Adversarial: 348 Q&A pairs
    • Extreme Adversarial: 348 Q&A pairs

    BioASQ (Biomedical)

    • Benign: 73 Q&A pairs
    • Subtle Adversarial: 73 Q&A pairs
    • Extreme Adversarial: 73 Q&A pairs

    Total evaluations: ~45,000 across models and variants.

    🗃️ Access

    The dataset is also released on the YESciEval GitHub repository.

    A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines.

    📜 Citation

    If you use this dataset, please cite the following:

    D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. Proceedings of ACL 2025. Preprint

    🛠️ License

    This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0).

    🙋‍♀️ Questions?

    For questions or collaborations, contact Jennifer D’Souza.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
RAG Datasets (2023). rag-mini-bioasq [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-bioasq

rag-mini-bioasq

rag-datasets/rag-mini-bioasq

Explore at:
3 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Dataset authored and provided by
RAG Datasets
License

Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically

Description

See here for an updated version without nans in text-corpus. In this huggingface discussion you can share what you used the dataset for. Derives from http://participants-area.bioasq.org/Tasks/11b/trainingDataset/ we generated our own subset using generate.py.

Search
Clear search
Close search
Google apps
Main menu