Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
See here for an updated version without nans in text-corpus. In this huggingface discussion you can share what you used the dataset for. Derives from http://participants-area.bioasq.org/Tasks/11b/trainingDataset/ we generated our own subset using generate.py.
BioASQ is a question answering dataset. Instances in the BioASQ dataset are composed of a question (Q), human-annotated answers (A), and the relevant contexts (C) (also called snippets).
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Mutilingual BioASQ-6B
We translate the BioASQ-6B English Question Answering dataset to generate parallel French, Italian and Spanish versions using the NLLB200 3B parameter model. For more info read the original task description: http://bioasq.org/participate/challenges_year_6
We translate the body, snippets, ideal_answer and exact_answer fields. We have validated the quality of the ideal_answer field, however, the… See the full description on the dataset page: https://huggingface.co/datasets/HiTZ/Multilingual-BioASQ-6B.
BioASQ task B dataset. Where this data came from?
Signed up for BioASQ account at https://www.bioasq.org/ Downloaded all the train and tests sets for "Task B" which is QA, added to a folder data/ Put this script in scripts/: https://gist.github.com/jmhb0/5a0789bf9c8605c5b95c63b72a1bbc8e
Note: this script does deduplication. There is lots of overlap between train and tests sets in the Papers for attribution:
https://www.nature.com/articles/s41597-023-02068-4… See the full description on the dataset page: https://huggingface.co/datasets/jmhb/BioASQ-taskb.
Semantic subject indexing is the process of annotating documents with terms that describe what the document is about. This is often used in digital libraries to increase the findability of the documents. Annotations are usually created by human experts from the domain, who select appropriate terms from a pre-specified set of available labels. In order to keep up with the vast amount of new publications, (semi-)automatic tools are developed that assist the experts by suggesting them terms for annotation. Unfortunately, due to legal restrictions these tools often cannot use the full-text nor the abstract of the publication. Therefore, it is desirable to explore techniques that work with the publications' metadata only. To some extent, it is already possible to achieve competitive performance to the full-text by merely using titles. Yet, the performance of automatic subject indexing methods is still far from the level of human annotators. Semantic subject indexing can be framed as a multi-label classification problem, where the entry (i,j) of an indicator matrix is set to one if the label has been assigned to a document, or it is set to zero otherwise. A major challenge is that the label space is usually very large (up to almost 30,000), that the labels follow a power-law, and are subject to concept drift(cmp. Toepfer and Seifert).
Here, we provide two large-scale datasets from the domain of economics and business studies (EconBiz) and biomedicine (PubMed) used in our recent study, which each come with the title and respective annotated labels. Do you find valuable insights in the data that can help understand the problem of semantic subject indexing better? Can you come up with clever ideas that push the state-of-the-art in automatic semantic subject indexing? We are excited to see what the collective power of data scientists can achieve on this task!
We compiled two English datasets from two digital libraries, EconBiz and PubMed.
EconBiz
The EconBiz dataset was compiled from a meta-data export provided by ZBW - Leibniz Information Centre for Economics from July 2017. We only retained those publications that were flagged as being in English and that were annotated with STW labels. Afterwards, we removed duplicates by checking for same title and labels. In total, approximately 1,064k publications remain. The annotations were selected by human annotators from the Standard Thesaurus Wirtschaft (STW), which contains approximately 6,000 labels.
PubMed
The PubMed dataset was compiled from the training set of the 5th BioASQ challenge on large-scale semantic subject indexing of biomedical articles, which were all in English. Again, we removed duplicates by checking for same title and labels. In total, approximately 12.8 million publications remain. The labels are so called MeSH terms. In our data, approximately 28k of them are used.
Fields Both datasets share the same set of fields:
We would like to thank ZBW - Information Centre for Economics for providing the EconBiz dataset, and in particular Tamara Pianos and Tobias Rebholz.
We would also like to thank the team from the BioASQ challenge, from where we compiled the PubMed dataset. This organization is dedicated to advancing the state-of-the-art in large-scale semantic indexing. It is currently running the 6th iteration of their challenge, which you should definitely check out!
The PubMed dataset has been gathered by BioASQ following the terms from the U.S. National Library of Medicine regarding public use and redistribution of the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BioASQvec Plus is an extended version of BioASQvec(http://bioasq.org/news/bioasq-releases-continuous-space-word-vectors-obtained-applying-word2vec-pubmed-abstracts) taking the advantage of protein alias corpus retrieved from biological databases and biomedical publications. Not only does it contains a bigger corpus of bio-entity names, but also can assign an equal representation to different names that correspond to the same entity. BioASQvec Plus is a generic word embeddings which could be applied to different biomedical text mining models for improving word representations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiClinSum is a shared task about the automatic summarization of clinical case reports in English, Spanish, French and Portuguese held as part of the BioASQ workshop at CLEF 2025. The task relies on a corpus of manually selected full clinical case reports and their corresponding clinical case report summaries derived from case report publications written in the previously mentioned languages. In addition, participants are allowed to use any other data source available online as long as they report it.
This version of the data contains the sample set: a small subset of 20 full-text documents and their summaries in English meant to be used as a sample of the data that will be used in the task. Both the full-texts and their summaries are .txt documents in UTF-8. They are separated in different folders and each pair have an almost identical filename, with the summaries having the suffix "_sum".
This work is licensed under a Creative Commons Attribution 4.0 International License.
If you have any questions or suggestions, please contact us at:
- Salvador Lima-López (
If you are interested in MultiClinSum, you might want to check out these corpora and resources:
The BioASQ question answering (QA) benchmark dataset contains questions in English, along with golden standard (reference) answers and related material. The dataset has been designed to reflect real information needs of biomedical experts and is therefore more realistic and challenging than most existing datasets. Furthermore, unlike most previous QA benchmarks that contain only exact answers, the BioASQ-QA dataset also includes ideal answers (in effect summaries), which are particularly useful for research on multi-document summarization. The dataset combines structured and unstructured data. The material linked with each question comprise documents and snippets, which are useful for Information Retrieval and Passage Retrieval experiments, as well as concepts that are useful in concept-to-text Natural Language Generation. Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. Last but not least, the dataset is continuously extended, as the BioASQ challenge is running and new data are generated.
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
BastienHot/BioASQ-Task-B-Revised dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for BEIR Benchmark
Dataset Summary
BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:
Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by kazzene
Released under MIT
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated files of BIOASQ 6B training set compatible with Brat tool.
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
This dataset is an extension of the rag-mini-bioasq dataset. Its difference resides in the text-corpus part of the aforementioned set where the metadata was added for each passage. Metadata contains six separate categories, each in a dedicated column:
Year of the publication (publish_year) Type of the publication (publish_type) Country of the publication - often correlated with the homeland of the authors (country) Number of pages (no_pages) Authors (authors) Keywords (keywords)
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The sub corpus contains Standoff Annotations for Drug Names and Terms from Epilepsy Ontologies with their Aggregations Recognized in the 2021 BioASQ corpus.
The terms for epilepsy ontologies are from NCBO BioPortal, namely from the ontologies EpSO, ESSO, EPILONT, EPISEM and FENICS:
https://bioportal.bioontology.org/ontologies/EPSO
https://bioportal.bioontology.org/ontologies/ESSO
https://bioportal.bioontology.org/ontologies/EPILONT
https://bioportal.bioontology.org/ontologies/EPISEM
https://bioportal.bioontology.org/ontologies/FENICS
The dictionary for the identificatin of drug names is derived from the DrugBank vocabulary available online at https://go.drugbank.com/releases/latest#open-data.
The terms were identified using a custom implementation of a UIMA-based text mining wokflow that annotates free text with the UIMA ConceptMapper. Further descriptions of this workflow can be found in the following publications:
Bernd Müller, Alexandra Hagelstein: Beyond Metadata: Enriching life science publications in Livivo with semantic entities from the linked data cloud. SEMANTiCS (Posters, Demos, SuCCESS) 2016
Bernd Müller, Alexandra Hagelstein, Thomas Gübitz: Life Science Ontologies in Literature Retrieval: A Comparison of Linked Data Sets for Use in Semantic Search on a Heterogeneous Corpus. EKAW (Satellite Events) 2016: 158-161
Bernd Müller, Christoph Poley, Jana Pössel, Alexandra Hagelstein, Thomas Gübitz: LIVIVO - the Vertical Search Engine for Life Sciences. Datenbank-Spektrum 17(1): 29-34 (2017)
Bernd Müller, Dietrich Rebholz-Schuhmann: Selected Approaches Ranking Contextual Term for the BioASQ Multi-label Classification (Task6a and 7a). PKDD/ECML Workshops (2) 2019: 569-580
The file format is JSON. The file content is described as follows:
bioasqepilepsy2021.json - All standoff annotations for each document in the 2021 BioASQ corpus
aggepilepsy2021EPSOANDDrugNames.json - aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EpSO co-occurring with at least one drug name
aggepilepsy2021ESSOANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from ESSO co-occurring with at least one drug name
aggepilepsy2021EPILONTANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EPILONT co-occurring with at least one drug name
aggepilepsy2021EPISEMANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from EPISEM co-occurring with at least one drug name
aggepilepsy2021FENICSANDDrugNames.json- aggregation of frequencies for all standoff annotations in documents from the 2021 BioASQ corpus that contain terms from FENICS co-occurring with at least one drug name
All JSON files should be importable into a collection of a MongoDB. Documents are identified by their PMIDs.
Please cite this data as:
Müller, Bernd. BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPEpsy) 2021. ZENODO, 10.5281/zenodo.4680086
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
The BioASQ question answering (QA) benchmark dataset contains questions in English, along with golden standard (reference) answers and related material. The dataset has been designed to reflect real information needs of biomedical experts and is therefore more realistic and challenging than most existing datasets. Furthermore, unlike most previous QA benchmarks that contain only exact answers, the BioASQ-QA dataset also includes ideal answers (in effect summaries), which are particularly useful for research on multi-document summarization. The dataset combines structured and unstructured data. The materials linked with each question comprise documents and snippets, which are useful for Information Retrieval and Passage Retrieval experiments, as well as concepts that are useful in concept-to-text Natural Language Generation. Researchers working on paraphrasing and textual entailment can also measure the degree to which their methods improve the performance of biomedical QA systems. Last but not least, the dataset is continuously extended, as the BioASQ challenge is running and new data are generated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated corpora for MESINESP2 shared-task (Spanish BioASQ track, see https://temu.bsc.es/mesinesp2). BioASQ 2021 will be held at CLEF 2021 (scheduled in Bucharest, Romania in September) http://clef2021.clef-initiative.eu/
Introduction:
These corpora contain the data for each of the sub-tracks of MESINESP2 shared-task:
Files structure:
MESINESP2_corpus.zip contains the corpora generated for the shared task. Content:
DeCS2020.tsv contains a DeCS table with the following structure:
Latin Spanish Decs
2020 set)Latin Spanish DeCS 2020, separate by pipes)
DeCS2020.obo contains the *.obo file with the hierarchical relationships between DeCS descriptors.
For further information, please visit https://temu.bsc.es/mesinesp2/ or email us at encargo-pln-life@bsc.es
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The dictionary files for the annotation of free text with terms from epilepsy ontologies by the UIMA ConceptMapper are taken from NCBO BioPortal, namely from the ontologies EpSO, ESSO, EPILONT, EPISEM and FENICS:
https://bioportal.bioontology.org/ontologies/EPSO
https://bioportal.bioontology.org/ontologies/ESSO
https://bioportal.bioontology.org/ontologies/EPILONT
https://bioportal.bioontology.org/ontologies/EPISEM
https://bioportal.bioontology.org/ontologies/FENICS
The dictionary for the identification of drug names is derived from the DrugBank vocabulary available online at https://go.drugbank.com/releases/latest#open-data.
Further descriptions of making use of the UIMA-based text mining workflow can be found in the following publications:
Bernd Müller, Alexandra Hagelstein: Beyond Metadata: Enriching life science publications in Livivo with semantic entities from the linked data cloud. SEMANTiCS (Posters, Demos, SuCCESS) 2016
Bernd Müller, Alexandra Hagelstein, Thomas Gübitz: Life Science Ontologies in Literature Retrieval: A Comparison of Linked Data Sets for Use in Semantic Search on a Heterogeneous Corpus. EKAW (Satellite Events) 2016: 158-161
Bernd Müller, Christoph Poley, Jana Pössel, Alexandra Hagelstein, Thomas Gübitz: LIVIVO - the Vertical Search Engine for Life Sciences. Datenbank-Spektrum 17(1): 29-34 (2017)
Bernd Müller, Dietrich Rebholz-Schuhmann: Selected Approaches Ranking Contextual Term for the BioASQ Multi-label Classification (Task6a and 7a). PKDD/ECML Workshops (2) 2019: 569-580
The dictionary files are in particular:
Dict_DrugNames.xml - constructed from the DrugBank vocabulary
Dict_EpSO.xml - constructed from the EpSO ontology
Dict_ESSO.xml - constructed from the ESSO ontology
Dict_EPILONT.xml - constructed from the EPILONT ontology
Dict_EPISEM.xml - constructed from the EPISEM ontology
Dict_FENICS.xml - constructed from the FENICS ontology
The dictionaries were used with the UIMA ConceptMapper for the annotation of the 2021 BioASQ corpus resulting in the BioASQ Sub-Corpus for the Pharmacology of Epilepsy (BioPepsy).
Please cite this data as:
Müller, Bernd. UIMA ConceptMapper Dictionaries for the Annotation of the 2021 BioASQ Corpus with Drug Names and Terms from Epilepsy Ontologies. ZENODO, 10.5281/zenodo.4683353
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MAP performances of our approach compared with the variants and classical IR models on BioASQ.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Comparisons with BioASQ 2016 participants.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
YESciEval is a benchmark dataset for evaluating the robustness of Large Language Models (LLMs) as evaluators in scientific question answering (scienceQ&A). It features multi-domain and biomedical question-answering instances with both standard and adversarial variants. The dataset is part of the YESciEval framework, developed to support robust, transparent, and scalable evaluation using open-source LLMs.
YESciEval provides:
The dataset is organized into two main parts:
Synthesized answers to research questions based on abstracts from relevant papers.
Sources:
Format: For each Q&A instance:
question
: research questionabstracts
: relevant paper abstractsanswer
: LLM-generated synthesisEach benign answer is perturbed with two types of adversarial modifications:
Perturbations target nine qualitative rubrics: - Cohesion - Conciseness - Readability - Coherence - Integration - Relevancy - Correctness - Completeness - Informativeness
Each rubric has a defined subtle and extreme perturbation heuristic.
Each dataset variant is scored using multiple open-source LLMs (e.g., LLaMA-3.1, Qwen-2.5, Mistral-Large) as evaluators. For each response, a JSON file records:
Total evaluations: ~45,000 across models and variants.
The dataset is also released on the YESciEval GitHub repository.
A dedicated repository, as this one, with only the dataset files can be used to simplify integration into benchmarking pipelines.
If you use this dataset, please cite the following:
D’Souza, J., Babaei Giglou, H., & Münch, Q. (2025). YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering. Proceedings of ACL 2025. Preprint
This dataset is released under the Creative Commons Attribution 4.0 International (CC BY 4.0).
For questions or collaborations, contact Jennifer D’Souza.
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
See here for an updated version without nans in text-corpus. In this huggingface discussion you can share what you used the dataset for. Derives from http://participants-area.bioasq.org/Tasks/11b/trainingDataset/ we generated our own subset using generate.py.