54 datasets found

h
ebm_pico
huggingface.co
Updated Mar 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). ebm_pico [Dataset]. https://huggingface.co/datasets/bigbio/ebm_pico
Explore at:
Dataset updated
Mar 14, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
This corpus release contains 4,993 abstracts annotated with (P)articipants, (I)nterventions, and (O)utcomes. Training labels are sourced from AMT workers and aggregated to reduce noise. Test labels are collected from medical professionals.
h
pubmed_qa
huggingface.co
Updated Mar 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). pubmed_qa [Dataset]. https://huggingface.co/datasets/bigbio/pubmed_qa
Explore at:
Dataset updated
Mar 3, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
PubMedQA is a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research biomedical questions with yes/no/maybe using the corresponding abstracts. PubMedQA has 1k expert-annotated (PQA-L), 61.2k unlabeled (PQA-U) and 211.3k artificially generated QA instances (PQA-A).

Each PubMedQA instance is composed of: (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding PubMed abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion.

PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions.

PubMedQA datasets comprise of 3 different subsets: (1) PubMedQA Labeled (PQA-L): A labeled PubMedQA subset comprises of 1k manually annotated yes/no/maybe QA data collected from PubMed articles. (2) PubMedQA Artificial (PQA-A): An artificially labelled PubMedQA subset comprises of 211.3k PubMed articles with automatically generated questions from the statement titles and yes/no answer labels generated using a simple heuristic. (3) PubMedQA Unlabeled (PQA-U): An unlabeled PubMedQA subset comprises of 61.2k context-question pairs data collected from PubMed articles.
h
an_em
huggingface.co
Updated Aug 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). an_em [Dataset]. https://huggingface.co/datasets/bigbio/an_em
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 22, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
Description
AnEM corpus is a domain- and species-independent resource manually annotated for anatomical entity mentions using a fine-grained classification system. The corpus consists of 500 documents (over 90,000 words) selected randomly from citation abstracts and full-text papers with the aim of making the corpus representative of the entire available biomedical scientific literature. The corpus annotation covers mentions of both healthy and pathological anatomical entities and contains over 3,000 annotated mentions.
h
medmentions
huggingface.co
opendatalab.com
Updated Feb 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2019). medmentions [Dataset]. https://huggingface.co/datasets/bigbio/medmentions
Explore at:
Dataset updated
Feb 25, 2019
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines.

Corpus: The MedMentions corpus consists of 4,392 papers (Titles and Abstracts) randomly selected from among papers released on PubMed in 2016, that were in the biomedical field, published in the English language, and had both a Title and an Abstract.

Annotators: We recruited a team of professional annotators with rich experience in biomedical content curation to exhaustively annotate all UMLS® (2017AA full version) entity mentions in these papers.

Annotation quality: We did not collect stringent IAA (Inter-annotator agreement) data. To gain insight on the annotation quality of MedMentions, we randomly selected eight papers from the annotated corpus, containing a total of 469 concepts. Two biologists ('Reviewer') who did not participate in the annotation task then each reviewed four papers. The agreement between Reviewers and Annotators, an estimate of the Precision of the annotations, was 97.3%.
h
medal
huggingface.co
opendatalab.com
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). medal [Dataset]. https://huggingface.co/datasets/bigbio/medal
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 8, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The Repository for Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain.
h
paramed
huggingface.co
Updated Oct 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). paramed [Dataset]. https://huggingface.co/datasets/bigbio/paramed
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 27, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NEJM is a Chinese-English parallel corpus crawled from the New England Journal of Medicine website. English articles are distributed through https://www.nejm.org/ and Chinese articles are distributed through http://nejmqianyan.cn/. The corpus contains all article pairs (around 2000 pairs) since 2011.
h
scifact
huggingface.co
Updated Apr 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). scifact [Dataset]. https://huggingface.co/datasets/bigbio/scifact
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
Description
{_DESCRIPTION_BASE} This config connects the claims to the evidence and doc ids.
h
cardiode
huggingface.co
heidata.uni-heidelberg.de
Updated Apr 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). cardiode [Dataset]. http://doi.org/10.11588/data/AFYQDY
Explore at:
Unique identifier
https://doi.org/10.11588/data/AFYQDY
Dataset updated
Apr 1, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
First freely available and distributable large German clinical corpus from the cardiovascular domain.
h
bio_sim_verb
huggingface.co
Updated Mar 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). bio_sim_verb [Dataset]. https://huggingface.co/datasets/bigbio/bio_sim_verb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 11, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
This repository contains the evaluation datasets for the paper Bio-SimVerb and Bio-SimLex: Wide-coverage Evaluation Sets of Word Similarity in Biomedicine by Billy Chiu, Sampo Pyysalo and Anna Korhonen.
h
sourcedata_nlp
huggingface.co
Updated Oct 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). sourcedata_nlp [Dataset]. https://huggingface.co/datasets/bigbio/sourcedata_nlp
Explore at:
Dataset updated
Oct 31, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SourceData is an NER/NED dataset of expert annotations of nine entity types in figure captions from biomedical research papers.
h
hallmarks_of_cancer
huggingface.co
Updated Apr 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). hallmarks_of_cancer [Dataset]. https://huggingface.co/datasets/bigbio/hallmarks_of_cancer
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
Description
The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).
h
mednli
huggingface.co
physionet.org
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2025). mednli [Dataset]. https://huggingface.co/datasets/bigbio/mednli
Explore at:
Dataset updated
Mar 19, 2025
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
State of the art models using deep neural networks have become very good in learning an accurate mapping from inputs to outputs. However, they still lack generalization capabilities in conditions that differ from the ones encountered during training. This is even more challenging in specialized, and knowledge intensive domains, where training data is limited. To address this gap, we introduce MedNLI - a dataset annotated by doctors, performing a natural language inference task (NLI), grounded in the medical history of patients. As the source of premise sentences, we used the MIMIC-III. More specifically, to minimize the risks to patient privacy, we worked with clinical notes corresponding to the deceased patients. The clinicians in our team suggested the Past Medical History to be the most informative section of a clinical note, from which useful inferences can be drawn about the patient.
h
mediqa_nli
huggingface.co
physionet.org
Updated Mar 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). mediqa_nli [Dataset]. https://huggingface.co/datasets/bigbio/mediqa_nli
Explore at:
Dataset updated
Mar 2, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Natural Language Inference (NLI) is the task of determining whether a given hypothesis can be inferred from a given premise. Also known as Recognizing Textual Entailment (RTE), this task has enjoyed popularity among researchers for some time. However, almost all datasets for this task focused on open domain data such as as news texts, blogs, and so on. To address this gap, the MedNLI dataset was created for language inference in the medical domain. MedNLI is a derived dataset with data sourced from MIMIC-III v1.4. In order to stimulate research for this problem, a shared task on Medical Inference and Question Answering (MEDIQA) was organized at the workshop for biomedical natural language processing (BioNLP) 2019. The dataset provided herein is a test set of 405 premise hypothesis pairs for the NLI challenge in the MEDIQA shared task. Participants of the shared task are expected to use the MedNLI data for development of their models and this dataset was used as an unseen dataset for scoring each participant submission.
h
grascco
huggingface.co
data.niaid.nih.gov
+1more
Updated Oct 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2024). grascco [Dataset]. https://huggingface.co/datasets/bigbio/grascco
Explore at:
Dataset updated
Oct 23, 2024
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
GraSCCo is a collection of artificially generated semi-structured and unstructured German-language clinical summaries. These summaries are formulated as letters from the hospital to the patient's GP after in-patient or out-patient care. This is common practice in Germany, Austria and Switzerland.

The creation of the GraSCCo documents were inspired by existing clinical texts, but all names and dates are purely fictional. There is no relation to existing patients, clinicians or institutions. Whereas the texts try to represent the range of German clinical language as best as possible, medical plausibility must not be assumed.

GraSCCo can therefore only be used to train clinical language models, not clinical domain models.
h
linnaeus
huggingface.co
opendatalab.com
Updated Oct 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). linnaeus [Dataset]. https://huggingface.co/datasets/bigbio/linnaeus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Linnaeus is a novel corpus of full-text documents manually annotated for species mentions.
h
chia
huggingface.co
Updated Apr 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). chia [Dataset]. https://huggingface.co/datasets/bigbio/chia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A large annotated corpus of patient eligibility criteria extracted from 1,000 interventional, Phase IV clinical trials registered in ClinicalTrials.gov. This dataset includes 12,409 annotated eligibility criteria, represented by 41,487 distinctive entities of 15 entity types and 25,017 relationships of 12 relationship types.
h
head_qa
huggingface.co
Updated Oct 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2024). head_qa [Dataset]. https://huggingface.co/datasets/bigbio/head_qa
Explore at:
Dataset updated
Oct 25, 2024
Dataset authored and provided by
BigScience Biomedical Datasets
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio de Sanidad, Consumo y Bienestar Social.The dataset contains questions about following topics: medicine, nursing, psychology, chemistry, pharmacology and biology.
h
mayosrs
huggingface.co
Updated Dec 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2024). mayosrs [Dataset]. https://huggingface.co/datasets/bigbio/mayosrs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 20, 2024
Dataset authored and provided by
BigScience Biomedical Datasets
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
MayoSRS consists of 101 clinical term pairs whose relatedness was determined by nine medical coders and three physicians from the Mayo Clinic.
h
pmc_patients
huggingface.co
Updated Feb 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2022). pmc_patients [Dataset]. https://huggingface.co/datasets/bigbio/pmc_patients
Explore at:
Dataset updated
Feb 28, 2022
Dataset authored and provided by
BigScience Biomedical Datasets
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is used for calculating the similarity between two patient descriptions.
h
evidence_inference
huggingface.co
Updated Jul 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigScience Biomedical Datasets (2023). evidence_inference [Dataset]. https://huggingface.co/datasets/bigbio/evidence_inference
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches. For the sake of this task, we assume that a particular article will report that the intervention of interest either significantly increased, significantly decreased or had significant effect on the outcome, relative to the comparator.

Facebook

Twitter

Click to copy link

Link copied

Cite

BigScience Biomedical Datasets (2023). ebm_pico [Dataset]. https://huggingface.co/datasets/bigbio/ebm_pico

ebm_pico

EBM NLP

bigbio/ebm_pico

Explore at:

4 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 14, 2023

Dataset authored and provided by

BigScience Biomedical Datasets

License

https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

Description

This corpus release contains 4,993 abstracts annotated with (P)articipants, (I)nterventions, and (O)utcomes. Training labels are sourced from AMT workers and aggregated to reduce noise. Test labels are collected from medical professionals.

Clear search

Close search

Google apps

Main menu

ebm_pico

pubmed_qa

an_em

medmentions

medal

paramed

scifact

cardiode

bio_sim_verb

sourcedata_nlp

hallmarks_of_cancer

mednli

mediqa_nli

grascco

linnaeus

chia

head_qa

mayosrs

pmc_patients

evidence_inference

ebm_pico

EBM NLP

bigbio/ebm_pico