41 datasets found

h
rag-mini-bioasq
huggingface.co
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RAG Datasets (2023). rag-mini-bioasq [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-bioasq
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Dataset authored and provided by
RAG Datasets
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Description
See here for an updated version without nans in text-corpus. In this huggingface discussion you can share what you used the dataset for. Derives from http://participants-area.bioasq.org/Tasks/11b/trainingDataset/ we generated our own subset using generate.py.
P
Data from: BioASQ Dataset
paperswithcode.com
Updated Jun 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
George Tsatsaronis; Georgios Balikas; Prodromos Malakasiotis; Ioannis Partalas; Matthias Zschunke; Michael R. Alvers; Dirk Weissenborn; Anastasia Krithara; Sergios Petridis; Dimitris Polychronopoulos; Yannis Almirantis; John Pavlopoulos; Nicolas Baskiotis; Patrick Gallinari; Thierry Artières; Axel-Cyrille Ngonga Ngomo; Norman Heino; Éric Gaussier; Liliana Barrio-Alvers; Michael Schroeder; Ion Androutsopoulos; Georgios Paliouras (2022). BioASQ Dataset [Dataset]. https://paperswithcode.com/dataset/bioasq
Explore at:
Dataset updated
Jun 29, 2022
Authors
George Tsatsaronis; Georgios Balikas; Prodromos Malakasiotis; Ioannis Partalas; Matthias Zschunke; Michael R. Alvers; Dirk Weissenborn; Anastasia Krithara; Sergios Petridis; Dimitris Polychronopoulos; Yannis Almirantis; John Pavlopoulos; Nicolas Baskiotis; Patrick Gallinari; Thierry Artières; Axel-Cyrille Ngonga Ngomo; Norman Heino; Éric Gaussier; Liliana Barrio-Alvers; Michael Schroeder; Ion Androutsopoulos; Georgios Paliouras
Description
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).
h
Multilingual-BioASQ-6B
huggingface.co
Updated Jun 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
HiTZ zentroa (2025). Multilingual-BioASQ-6B [Dataset]. https://huggingface.co/datasets/HiTZ/Multilingual-BioASQ-6B
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 16, 2025
Dataset authored and provided by
HiTZ zentroa
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Mutilingual BioASQ-6B

We translate the BioASQ-6B English Question Answering dataset to generate parallel French, Italian and Spanish versions using the NLLB200 3B parameter model. For more info read the original task description: http://bioasq.org/participate/challenges_year_6

We translate the body, snippets, ideal_answer and exact_answer fields. We have validated the quality of the ideal_answer field, however, the… See the full description on the dataset page: https://huggingface.co/datasets/HiTZ/Multilingual-BioASQ-6B.
h
BioASQ-taskb
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
James Burgess, BioASQ-taskb [Dataset]. https://huggingface.co/datasets/jmhb/BioASQ-taskb
Explore at:
Authors
James Burgess
Description
BioASQ task B dataset. Where this data came from?

Signed up for BioASQ account at https://www.bioasq.org/ Downloaded all the train and tests sets for "Task B" which is QA, added to a folder data/ Put this script in scripts/: https://gist.github.com/jmhb0/5a0789bf9c8605c5b95c63b72a1bbc8e

Note: this script does deduplication. There is lots of overlap between train and tests sets in the Papers for attribution:

https://www.nature.com/articles/s41597-023-02068-4… See the full description on the dataset page: https://huggingface.co/datasets/jmhb/BioASQ-taskb.
Title-Based Semantic Subject Indexing
kaggle.com
Updated Apr 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florian Mai (2018). Title-Based Semantic Subject Indexing [Dataset]. https://www.kaggle.com/hsrobo/titlebased-semantic-subject-indexing/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2018
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Florian Mai
Description
Semantic Subject Indexing

Semantic subject indexing is the process of annotating documents with terms that describe what the document is about. This is often used in digital libraries to increase the findability of the documents. Annotations are usually created by human experts from the domain, who select appropriate terms from a pre-specified set of available labels. In order to keep up with the vast amount of new publications, (semi-)automatic tools are developed that assist the experts by suggesting them terms for annotation. Unfortunately, due to legal restrictions these tools often cannot use the full-text nor the abstract of the publication. Therefore, it is desirable to explore techniques that work with the publications' metadata only. To some extent, it is already possible to achieve competitive performance to the full-text by merely using titles. Yet, the performance of automatic subject indexing methods is still far from the level of human annotators. Semantic subject indexing can be framed as a multi-label classification problem, where the entry (i,j) of an indicator matrix is set to one if the label has been assigned to a document, or it is set to zero otherwise. A major challenge is that the label space is usually very large (up to almost 30,000), that the labels follow a power-law, and are subject to concept drift(cmp. Toepfer and Seifert).

Here, we provide two large-scale datasets from the domain of economics and business studies (EconBiz) and biomedicine (PubMed) used in our recent study, which each come with the title and respective annotated labels. Do you find valuable insights in the data that can help understand the problem of semantic subject indexing better? Can you come up with clever ideas that push the state-of-the-art in automatic semantic subject indexing? We are excited to see what the collective power of data scientists can achieve on this task!

Content

We compiled two English datasets from two digital libraries, EconBiz and PubMed.

EconBiz

The EconBiz dataset was compiled from a meta-data export provided by ZBW - Leibniz Information Centre for Economics from July 2017. We only retained those publications that were flagged as being in English and that were annotated with STW labels. Afterwards, we removed duplicates by checking for same title and labels. In total, approximately 1,064k publications remain. The annotations were selected by human annotators from the Standard Thesaurus Wirtschaft (STW), which contains approximately 6,000 labels.

PubMed

The PubMed dataset was compiled from the training set of the 5th BioASQ challenge on large-scale semantic subject indexing of biomedical articles, which were all in English. Again, we removed duplicates by checking for same title and labels. In total, approximately 12.8 million publications remain. The labels are so called MeSH terms. In our data, approximately 28k of them are used.

Fields Both datasets share the same set of fields:

id: An identifier used to refer to the publication in the respective digital library.

title: The title of the publication

labels: A string that represents a list of labels, separated by TAB.

fold: For reproducibility of the results in our study: Number of the fold a sample belongs to as used in our study. 0 to 9 correspond to the samples that have a full-text, fold 10 to all other samples.

Acknowledgements

We would like to thank ZBW - Information Centre for Economics for providing the EconBiz dataset, and in particular Tamara Pianos and Tobias Rebholz.

We would also like to thank the team from the BioASQ challenge, from where we compiled the PubMed dataset. This organization is dedicated to advancing the state-of-the-art in large-scale semantic indexing. It is currently running the 6th iteration of their challenge, which you should definitely check out!

The PubMed dataset has been gathered by BioASQ following the terms from the U.S. National Library of Medicine regarding public use and redistribution of the data.
BioASQvec Plus.txt
figshare.com
txt
Updated Apr 12, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peiliang Lou (2019). BioASQvec Plus.txt [Dataset]. http://doi.org/10.6084/m9.figshare.7981739.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7981739.v1
Dataset updated
Apr 12, 2019
Dataset provided by
Figsharehttp://figshare.com/
Authors
Peiliang Lou
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BioASQvec Plus is an extended version of BioASQvec(http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts) taking the advantage of protein alias corpus retrieved from biological databases and biomedical publications. Not only does it contains a bigger corpus of bio-entity names, but also can assign an equal representation to different names that correspond to the same entity. BioASQvec Plus is a generic word embeddings which could be applied to different biomedical text mining models for improving word representations.
MultiClinSum Dataset: Summarization of Clinical Case Reports in English,...
zenodo.org
zip
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Salvador Lima López; Salvador Lima López; Miguel Rodríguez Ortega; Miguel Rodríguez Ortega; Eduard Rodríguez López; Eduard Rodríguez López; Martin Krallinger; Martin Krallinger (2025). MultiClinSum Dataset: Summarization of Clinical Case Reports in English, Spanish, French and Portuguese [Dataset]. http://doi.org/10.5281/zenodo.15188952
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15188952
Dataset updated
Apr 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Salvador Lima López; Salvador Lima López; Miguel Rodríguez Ortega; Miguel Rodríguez Ortega; Eduard Rodríguez López; Eduard Rodríguez López; Martin Krallinger; Martin Krallinger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
French
Description
MultiClinSum Shared Task Dataset

MultiClinSum is a shared task about the automatic summarization of clinical case reports in English, Spanish, French and Portuguese held as part of the BioASQ workshop at CLEF 2025. The task relies on a corpus of manually selected full clinical case reports and their corresponding clinical case report summaries derived from case report publications written in the previously mentioned languages. In addition, participants are allowed to use any other data source available online as long as they report it.

This version of the data contains the sample set: a small subset of 20 full-text documents and their summaries in English meant to be used as a sample of the data that will be used in the task. Both the full-texts and their summaries are .txt documents in UTF-8. They are separated in different folders and each pair have an almost identical filename, with the summaries having the suffix "_sum".

Resources:

- MultiClinSum website

- BioASQ website

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any questions or suggestions, please contact us at:

- Salvador Lima-López (

Additional resources and corpora

If you are interested in MultiClinSum, you might want to check out these corpora and resources:

DisTEMIST (Corpus of disease mentions and normalization to SNOMED CT)

MedProcNER (Corpus of clinical procedure mentions and normalization to SNOMED CT)

SympTEMIST (Corpus of clinical findings and normalization to SNOMED CT)

DrugTEMIST (Corpus of medication mentions)

CardioCCC (Corpus of diseases and medication mentions in cardiology texts)

PharmaCoNER (Corpus of medications, drugs, chemical substances, genes, proteins and vaccine mentions and normalization)

MEDDOPROF (Corpus of mentions of professions, occupations and working status and normalization)

MEDDOPLACE (Corpus of mentions of place-related entity mentions, including departments, nationalities or patient movements etc.. and normalization)

MEDDOCAN (Corpus of mentions of Personal Health Identifiers (PHI))

CANTEMIST (Corpus of cancer tumor morphology mentions and normalization)

CodiESP (Corpus of clinical case reportes with assigned clinical codes from ICD10, Spanish version)

LivingNER (Corpus of mentions of species, including human/family members, pathogens, food, etc.. and normalization to NCBI Taxonomy)

SPACCC-POS (Corpus of clinical case reports in Spanish annotated with POS-tags)

SPACCC-TOKEN (Corpus of clinical case reports in Spanish annotated with token-tags (word mention boundaries))

SPACCC-SPLIT (Corpus of clinical case reports in Spanish annotated with sentence boundary-tags)

MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, clinical trials, and patent abstracts)
BioASQ-QA: A manually curated corpus for Biomedical Question Answering
zenodo.org
json
Updated Feb 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasia Krithara; Anastasia Krithara; Anastasios Nentidis; Anastasios Nentidis; Konstantinos Bougiatiotis; Georgios Paliouras; Konstantinos Bougiatiotis; Georgios Paliouras (2023). BioASQ-QA: A manually curated corpus for Biomedical Question Answering [Dataset]. http://doi.org/10.5281/zenodo.7655127
Explore at:
jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7655127
Dataset updated
Feb 20, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasia Krithara; Anastasia Krithara; Anastasios Nentidis; Anastasios Nentidis; Konstantinos Bougiatiotis; Georgios Paliouras; Konstantinos Bougiatiotis; Georgios Paliouras
Description
The BioASQ question answering (QA) benchmark dataset contains questions in English, along with golden standard (reference) answers and related material. The dataset has been designed to reflect real information needs of biomedical experts and is therefore more realistic and challenging than most existing datasets. Furthermore, unlike most previous QA benchmarks that contain only exact answers, the BioASQ-QA dataset also includes ideal answers (in effect summaries), which are particularly useful for research on multi-document summarization. The dataset combines structured and unstructured data. The material linked with each question comprise documents and snippets, which are useful for Information Retrieval and Passage Retrieval experiments, as well as concepts that are useful in concept-to-text Natural Language Generation. Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. Last but not least, the dataset is continuously extended, as the BioASQ challenge is running and new data are generated.
h
BioASQ-Task-B-Revised
huggingface.co
Updated Mar 31, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bastien Hottelet (2016). BioASQ-Task-B-Revised [Dataset]. https://huggingface.co/datasets/BastienHot/BioASQ-Task-B-Revised
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 31, 2016
Authors
Bastien Hottelet
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Description
BastienHot/BioASQ-Task-B-Revised dataset hosted on Hugging Face and contributed by the HF Datasets community
h
hotpotqa
huggingface.co
Updated Aug 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BEIR (2022). hotpotqa [Dataset]. https://huggingface.co/datasets/BeIR/hotpotqa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 24, 2022
Dataset authored and provided by
BEIR
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for BEIR Benchmark

Dataset Summary

BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.
BioASQ-training13b
kaggle.com
Updated Mar 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
kazzene (2025). BioASQ-training13b [Dataset]. https://www.kaggle.com/datasets/kazzene/bioasq-training13b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
kazzene
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by kazzene

Released under MIT

Contents
Z
Machine reading compatible answers for BIOASQ 2018 Task B Training set
data.niaid.nih.gov
explore.openaire.eu
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MA, YUE (2020). Machine reading compatible answers for BIOASQ 2018 Task B Training set [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_1346192
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
GRAU, BRIGITTE
KAMATH, SANJAY
MA, YUE
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated files of BIOASQ 6B training set compatible with Brat tool.
h
rag-mini-bioasq-with-metadata
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Enelpol, rag-mini-bioasq-with-metadata [Dataset]. https://huggingface.co/datasets/enelpol/rag-mini-bioasq-with-metadata
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Enelpol
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Description
This dataset is an extension of the rag-mini-bioasq dataset. Its difference resides in the text-corpus part of the aforementioned set where the metadata was added for each passage. Metadata contains six separate categories, each in a dedicated column:

Year of the publication (publish_year) Type of the publication (publish_type) Country of the publication - often correlated with the homeland of the authors (country) Number of pages (no_pages) Authors (authors) Keywords (keywords)
Z
BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy)
data.niaid.nih.gov
Updated Sep 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Müller, Bernd (2021). BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4680825
Explore at:
Dataset updated
Sep 4, 2021
Dataset authored and provided by
Müller, Bernd
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The sub corpus contains Standoff Annotations for Drug Names and Terms from Epilepsy Ontologies with their Aggregations Recognized in the 2021 BioASQ corpus.

The terms for epilepsy ontologies are from NCBO BioPortal, namely from the ontologies EpSO, ESSO, EPILONT, EPISEM and FENICS:

https://bioportal.bioontology.org/ontologies/EPSO

https://bioportal.bioontology.org/ontologies/ESSO

https://bioportal.bioontology.org/ontologies/EPILONT

https://bioportal.bioontology.org/ontologies/EPISEM

https://bioportal.bioontology.org/ontologies/FENICS

The dictionary for the identificatin of drug names is derived from the DrugBank vocabulary available online at https://go.drugbank.com/releases/latest#open-data.

The terms were identified using a custom implementation of a UIMA-based text mining wokflow that annotates free text with the UIMA ConceptMapper. Further descriptions of this workflow can be found in the following publications:

Bernd Müller, Alexandra Hagelstein: Beyond Metadata: Enriching life science publications in Livivo with semantic entities from the linked data cloud. SEMANTiCS (Posters, Demos, SuCCESS) 2016

Bernd Müller, Alexandra Hagelstein, Thomas Gübitz: Life Science Ontologies in Literature Retrieval: A Comparison of Linked Data Sets for Use in Semantic Search on a Heterogeneous Corpus. EKAW (Satellite Events) 2016: 158-161

Bernd Müller, Christoph Poley, Jana Pössel, Alexandra Hagelstein, Thomas Gübitz: LIVIVO - the Vertical Search Engine for Life Sciences. Datenbank-Spektrum 17(1): 29-34 (2017)

Bernd Müller, Dietrich Rebholz-Schuhmann: Selected Approaches Ranking Contextual Term for the BioASQ Multi-label Classification (Task6a and 7a). PKDD/ECML Workshops (2) 2019: 569-580

The file format is JSON. The file content is described as follows:

bioasqepilepsy2021.json - All standoff annotations for each document in the 2021 BioASQ corpus

aggepilepsy2021EPSOANDDrugNames.json - aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EpSO co-occurring with at least one drug name

aggepilepsy2021ESSOANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from ESSO co-occurring with at least one drug name

aggepilepsy2021EPILONTANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EPILONT co-occurring with at least one drug name

aggepilepsy2021EPISEMANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EPISEM co-occurring with at least one drug name

aggepilepsy2021FENICSANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from FENICS co-occurring with at least one drug name

All JSON files should be importable into a collection of a MongoDB. Documents are identified by their PMIDs.

Please cite this data as:

Müller, Bernd. BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPEpsy) 2021. ZENODO, 10.5281/zenodo.4680086
O
Data from: BioASQ
opendatalab.com
zip
Updated Jan 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCSR Demokritos (2022). BioASQ [Dataset]. https://opendatalab.com/OpenDataLab/BioASQ
Explore at:
zipAvailable download formats
Dataset updated
Jan 1, 2022
Dataset provided by
Aristotle University of Thessaloniki
NCSR Demokritos
License
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
Description
The BioASQ question answering (QA) benchmark dataset contains questions in English, along with golden standard (reference) answers and related material. The dataset has been designed to reflect real information needs of biomedical experts and is therefore more realistic and challenging than most existing datasets. Furthermore, unlike most previous QA benchmarks that contain only exact answers, the BioASQ-QA dataset also includes ideal answers (in effect summaries), which are particularly useful for research on multi-document summarization. The dataset combines structured and unstructured data. The materials linked with each question comprise documents and snippets, which are useful for Information Retrieval and Passage Retrieval experiments, as well as concepts that are useful in concept-to-text Natural Language Generation. Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. Last but not least, the dataset is continuously extended, as the BioASQ challenge is running and new data are generated.
MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish
zenodo.org
data.niaid.nih.gov
bin, tsv, zip
Updated Oct 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luis Gasco; Luis Gasco; Martin Krallinger; Martin Krallinger (2021). MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish [Dataset]. http://doi.org/10.5281/zenodo.4612275
Explore at:
tsv, bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4612275
Dataset updated
Oct 28, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luis Gasco; Luis Gasco; Martin Krallinger; Martin Krallinger
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/

Introduction:
These corpora contain the data for each of the sub-tracks of MESINESP2 shared-task:

Track 1- Medical indexing:

Training set: It contains all spanish records from LILACS and IBECS databases at the Virtual Health Library (VHL) with non-empty abstract written in Spanish. We have filtered out empty abstracts and non-Spanish abstracts. We have built the training dataset with the data crawled on 01/29/2021. This means that the data is a snapshot of that moment and that may change over time since LILACS and IBECS usually add or modify indexes after the first inclusion in the database. We distribute two different datasets:

Articles training set: This corpus contains the set of 237574 Spanish scientific papers in VHL that have at least one DeCS code assigned to them.

Full training set: This corpus contains the whole set of 249474 Spanish documents from VHL that have at leas one DeCS code assigned to them.

Development set: We provide a development set manually indexed by expert annotators. This dataset includes 1065 articles annotated with DeCS by three expert indexers in this controlled vocabulary. The articles were initially indexed by 7 annotators, after analyzing the Inter-Annotator Agreement among their annotations we decided to select the 3 best ones, considering their annotations the valid ones to build the test set. From those 1065 records:

213 articles were annotated by more than one annotator. We have selected de union between annotations.

852 articles were annotated by only one of the three selected annotators with better performance.

Test set: To be published

Track 2- Clinical trials:

Training set: The training dataset contains records from Registro Español de Estudios Clínicos (REEC). REEC doesn't provide documents with the structure title/abstract needed in BioASQ, for that reason we have built artificial abstracts based on the content available in the data crawled using the REEC API. Clinical trials are not indexed with DeCS terminology, we have used as training data a set of 3592 clinical trials that were automatically annotated in the first edition of MESINESP and that were published as a Silver Standard outcome. Because the performance of the models used by the participants was variable, we have only selected predictions from runs with a MiF higher than 0.30, which corresponds with the submission of the best three teams. We have selected the union of all codes assigned by those team.

Development set: We provide a development set manually indexed by expert annotators. This dataset includes 147 clinical trials annotated with DeCS by seven expert indexers in this controlled vocabulary.

Track 3- Patents: To be published

Files structure:

MESINESP2_corpus.zip contains the corpora generated for the shared task. Content:

Subtrack1:

Train

training_set_track1_all.json: Full training set for sub-track 1.

training_set_track1_only_articles.json: Articles training set for sub-track 1.

Test

development_set_subtrack1.json: Manually annotated development set for sub-track 1.

Subtrack2:

Train

training_set_subtrack2.json: Training set for sub-track 2.

Test

development_set_subtrack2.json: Manually annotated development set for sub-track 2.

Subtrack3: This folder is empty. Data for sub-track 3 will be published soon.

DeCS2020.tsv contains a DeCS table with the following structure:

DeCS code

Preferred descriptor (the preferred label in the Latin Spanish Decs 2020 set)

List of synonyms (the descriptors and synonyms from Latin Spanish DeCS 2020, separate by pipes)

DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.

For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at encargo-pln-life@bsc.es
Z
UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus...
data.niaid.nih.gov
Updated Sep 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Müller, Bernd (2021). UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus with Drug Names and Terms from Epilepsy Ontologies [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4683352
Explore at:
Dataset updated
Sep 4, 2021
Dataset authored and provided by
Müller, Bernd
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
The dictionary files for the annotation of free text with terms from epilepsy ontologies by the UIMA ConceptMapper are taken from NCBO BioPortal, namely from the ontologies EpSO, ESSO, EPILONT, EPISEM and FENICS:

https://bioportal.bioontology.org/ontologies/EPSO

https://bioportal.bioontology.org/ontologies/ESSO

https://bioportal.bioontology.org/ontologies/EPILONT

https://bioportal.bioontology.org/ontologies/EPISEM

https://bioportal.bioontology.org/ontologies/FENICS

The dictionary for the identification of drug names is derived from the DrugBank vocabulary available online at https://go.drugbank.com/releases/latest#open-data.

Further descriptions of making use of the UIMA-based text mining workflow can be found in the following publications:

Bernd Müller, Alexandra Hagelstein: Beyond Metadata: Enriching life science publications in Livivo with semantic entities from the linked data cloud. SEMANTiCS (Posters, Demos, SuCCESS) 2016

Bernd Müller, Alexandra Hagelstein, Thomas Gübitz: Life Science Ontologies in Literature Retrieval: A Comparison of Linked Data Sets for Use in Semantic Search on a Heterogeneous Corpus. EKAW (Satellite Events) 2016: 158-161

Bernd Müller, Christoph Poley, Jana Pössel, Alexandra Hagelstein, Thomas Gübitz: LIVIVO - the Vertical Search Engine for Life Sciences. Datenbank-Spektrum 17(1): 29-34 (2017)

Bernd Müller, Dietrich Rebholz-Schuhmann: Selected Approaches Ranking Contextual Term for the BioASQ Multi-label Classification (Task6a and 7a). PKDD/ECML Workshops (2) 2019: 569-580

The dictionary files are in particular:

Dict_DrugNames.xml - constructed from the DrugBank vocabulary

Dict_EpSO.xml - constructed from the EpSO ontology

Dict_ESSO.xml - constructed from the ESSO ontology

Dict_EPILONT.xml - constructed from the EPILONT ontology

Dict_EPISEM.xml - constructed from the EPISEM ontology

Dict_FENICS.xml - constructed from the FENICS ontology

The dictionaries were used with the UIMA ConceptMapper for the annotation of the 2021 BioASQ corpus resulting in the BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy).

Please cite this data as:

Müller, Bernd. UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus with Drug Names and Terms from Epilepsy Ontologies. ZENODO, 10.5281/zenodo.4683353
f
The MAP performances of our approach compared with the variants and...
plos.figshare.com
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu (2023). The MAP performances of our approach compared with the variants and classical IR models on BioASQ. [Dataset]. http://doi.org/10.1371/journal.pone.0242061.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0242061.t001
Dataset updated
Jun 5, 2023
Dataset provided by
PLOS ONE
Authors
Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The MAP performances of our approach compared with the variants and classical IR models on BioASQ.
f
Comparisons with BioASQ 2016 participants.
plos.figshare.com
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu (2023). Comparisons with BioASQ 2016 participants. [Dataset]. http://doi.org/10.1371/journal.pone.0242061.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0242061.t005
Dataset updated
Jun 1, 2023
Dataset provided by
PLOS ONE
Authors
Yan Yan; Bo-Wen Zhang; Xu-Feng Li; Zhenhan Liu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Comparisons with BioASQ 2016 participants.
F
YESciEval Corpus
data.uni-hannover.de
csv
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TIB (2025). YESciEval Corpus [Dataset]. https://data.uni-hannover.de/lt/dataset/yescieval-corpus
Explore at:
csv(82283594), csv(29470889)Available download formats
Dataset updated
May 28, 2025
Dataset authored and provided by
TIB
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
YESciEval is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the YESciEval framework, developed to support robust, transparent, and scalable evaluation using open-source LLMs.

🔍 Overview

YESciEval provides:

ScienceQ&A datasets generated using open-source LLMs

Adversarial variants designed using fine-grained rubric-based heuristics

Evaluation scores from multiple LLMs acting as evaluators (LLM-as-a-judge)

📂 Dataset Structure

The dataset is organized into two main parts:

1. Benign (Original) ScienceQ&A Data

Synthesized answers to research questions based on abstracts from relevant papers.

Sources:

ORKGSyn: Multidisciplinary questions from the Open Research Knowledge Graph

BioASQ: Biomedical questions from the BioASQ challenge

Format: For each Q&A instance:

question: research question

abstracts: relevant paper abstracts

answer: LLM-generated synthesis

2. Adversarial ScienceQ&A Data

Each benign answer is perturbed with two types of adversarial modifications:

Subtle Perturbations: Realistic, light-weight errors designed to be difficult for models to detect

Extreme Perturbations: Significant modifications that should be easily identifiable by robust evaluators

Perturbations target nine qualitative rubrics: - Cohesion - Conciseness - Readability - Coherence - Integration - Relevancy - Correctness - Completeness - Informativeness

Each rubric has a defined subtle and extreme perturbation heuristic.

🧪 Evaluation Outputs

Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records:

A 1–5 Likert score for each rubric

A rationale for the score

📊 Statistics

ORKGSyn (33 disciplines)

Benign: 348 Q&A pairs

Subtle Adversarial: 348 Q&A pairs

Extreme Adversarial: 348 Q&A pairs

BioASQ (Biomedical)

Benign: 73 Q&A pairs

Subtle Adversarial: 73 Q&A pairs

Extreme Adversarial: 73 Q&A pairs

Total evaluations: ~45,000 across models and variants.

🗃️ Access

The dataset is also released on the YESciEval GitHub repository.

A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines.

📜 Citation

If you use this dataset, please cite the following:

D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. Proceedings of ACL 2025. Preprint

🛠️ License

This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0).

🙋‍♀️ Questions?

For questions or collaborations, contact Jennifer D’Souza.

Facebook

Twitter

Click to copy link

Link copied

Cite

RAG Datasets (2023). rag-mini-bioasq [Dataset]. https://huggingface.co/datasets/rag-datasets/rag-mini-bioasq

rag-mini-bioasq

rag-datasets/rag-mini-bioasq

Explore at:

3 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 20, 2023

Dataset authored and provided by

RAG Datasets

License

Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically

Description

See here for an updated version without nans in text-corpus. In this huggingface discussion you can share what you used the dataset for. Derives from http://participants-area.bioasq.org/Tasks/11b/trainingDataset/ we generated our own subset using generate.py.

Clear search

Close search

Google apps

Main menu

rag-mini-bioasq

Data from: BioASQ Dataset

Multilingual-BioASQ-6B

BioASQ-taskb

Title-Based Semantic Subject Indexing

Semantic Subject Indexing

Content

Acknowledgements

BioASQvec Plus.txt

MultiClinSum Dataset: Summarization of Clinical Case Reports in English,...

MultiClinSum Shared Task Dataset

Resources:

License

Contact

Additional resources and corpora

BioASQ-QA: A manually curated corpus for Biomedical Question Answering

BioASQ-Task-B-Revised

hotpotqa

BioASQ-training13b

Dataset

Contents

Machine reading compatible answers for BIOASQ 2018 Task B Training set

rag-mini-bioasq-with-metadata

BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy)

Data from: BioASQ

MESINESP2 Corpora: Annotated data for medical semantic indexing in Spanish

UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus...

The MAP performances of our approach compared with the variants and...

Comparisons with BioASQ 2016 participants.

YESciEval Corpus

🔍 Overview

📂 Dataset Structure

1. Benign (Original) ScienceQ&A Data

2. Adversarial ScienceQ&A Data

🧪 Evaluation Outputs

📊 Statistics

ORKGSyn (33 disciplines)

BioASQ (Biomedical)

🗃️ Access

📜 Citation

🛠️ License

🙋‍♀️ Questions?

rag-mini-bioasq

rag-datasets/rag-mini-bioasq