Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This corpus release contains 4,993 abstracts annotated with (P)articipants, (I)nterventions, and (O)utcomes. Training labels are sourced from AMT workers and aggregated to reduce noise. Test labels are collected from medical professionals.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
PubMedQA is a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research biomedical questions with yes/no/maybe using the corresponding abstracts. PubMedQA has 1k expert-annotated (PQA-L), 61.2k unlabeled (PQA-U) and 211.3k artificially generated QA instances (PQA-A).
Each PubMedQA instance is composed of: (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding PubMed abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion.
PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions.
PubMedQA datasets comprise of 3 different subsets: (1) PubMedQA Labeled (PQA-L): A labeled PubMedQA subset comprises of 1k manually annotated yes/no/maybe QA data collected from PubMed articles. (2) PubMedQA Artificial (PQA-A): An artificially labelled PubMedQA subset comprises of 211.3k PubMed articles with automatically generated questions from the statement titles and yes/no answer labels generated using a simple heuristic. (3) PubMedQA Unlabeled (PQA-U): An unlabeled PubMedQA subset comprises of 61.2k context-question pairs data collected from PubMed articles.
Facebook
TwitterAttribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
License information was derived automatically
AnEM corpus is a domain- and species-independent resource manually annotated for anatomical entity mentions using a fine-grained classification system. The corpus consists of 500 documents (over 90,000 words) selected randomly from citation abstracts and full-text papers with the aim of making the corpus representative of the entire available biomedical scientific literature. The corpus annotation covers mentions of both healthy and pathological anatomical entities and contains over 3,000 annotated mentions.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines.
Corpus: The MedMentions corpus consists of 4,392 papers (Titles and Abstracts) randomly selected from among papers released on PubMed in 2016, that were in the biomedical field, published in the English language, and had both a Title and an Abstract.
Annotators: We recruited a team of professional annotators with rich experience in biomedical content curation to exhaustively annotate all UMLS® (2017AA full version) entity mentions in these papers.
Annotation quality: We did not collect stringent IAA (Inter-annotator agreement) data. To gain insight on the annotation quality of MedMentions, we randomly selected eight papers from the annotated corpus, containing a total of 469 concepts. Two biologists ('Reviewer') who did not participate in the annotation task then each reviewed four papers. The agreement between Reviewers and Annotators, an estimate of the Precision of the annotations, was 97.3%.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The Repository for Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NEJM is a Chinese-English parallel corpus crawled from the New England Journal of Medicine website. English articles are distributed through https://www.nejm.org/ and Chinese articles are distributed through http://nejmqianyan.cn/. The corpus contains all article pairs (around 2000 pairs) since 2011.
Facebook
TwitterAttribution-NonCommercial 2.0 (CC BY-NC 2.0)https://creativecommons.org/licenses/by-nc/2.0/
License information was derived automatically
{_DESCRIPTION_BASE} This config connects the claims to the evidence and doc ids.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
First freely available and distributable large German clinical corpus from the cardiovascular domain.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This repository contains the evaluation datasets for the paper Bio-SimVerb and Bio-SimLex: Wide-coverage Evaluation Sets of Word Similarity in Biomedicine by Billy Chiu, Sampo Pyysalo and Anna Korhonen.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
SourceData is an NER/NED dataset of expert annotations of nine entity types in figure captions from biomedical research papers.
Facebook
Twitterhttps://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/
The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
State of the art models using deep neural networks have become very good in learning an accurate mapping from inputs to outputs. However, they still lack generalization capabilities in conditions that differ from the ones encountered during training. This is even more challenging in specialized, and knowledge intensive domains, where training data is limited. To address this gap, we introduce MedNLI - a dataset annotated by doctors, performing a natural language inference task (NLI), grounded in the medical history of patients. As the source of premise sentences, we used the MIMIC-III. More specifically, to minimize the risks to patient privacy, we worked with clinical notes corresponding to the deceased patients. The clinicians in our team suggested the Past Medical History to be the most informative section of a clinical note, from which useful inferences can be drawn about the patient.
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Natural Language Inference (NLI) is the task of determining whether a given hypothesis can be inferred from a given premise. Also known as Recognizing Textual Entailment (RTE), this task has enjoyed popularity among researchers for some time. However, almost all datasets for this task focused on open domain data such as as news texts, blogs, and so on. To address this gap, the MedNLI dataset was created for language inference in the medical domain. MedNLI is a derived dataset with data sourced from MIMIC-III v1.4. In order to stimulate research for this problem, a shared task on Medical Inference and Question Answering (MEDIQA) was organized at the workshop for biomedical natural language processing (BioNLP) 2019. The dataset provided herein is a test set of 405 premise hypothesis pairs for the NLI challenge in the MEDIQA shared task. Participants of the shared task are expected to use the MedNLI data for development of their models and this dataset was used as an unseen dataset for scoring each participant submission.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GraSCCo is a collection of artificially generated semi-structured and unstructured German-language clinical summaries. These summaries are formulated as letters from the hospital to the patient's GP after in-patient or out-patient care. This is common practice in Germany, Austria and Switzerland.
The creation of the GraSCCo documents were inspired by existing clinical texts, but all names and dates are purely fictional. There is no relation to existing patients, clinicians or institutions. Whereas the texts try to represent the range of German clinical language as best as possible, medical plausibility must not be assumed.
GraSCCo can therefore only be used to train clinical language models, not clinical domain models.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Linnaeus is a novel corpus of full-text documents manually annotated for species mentions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A large annotated corpus of patient eligibility criteria extracted from 1,000 interventional, Phase IV clinical trials registered in ClinicalTrials.gov. This dataset includes 12,409 annotated eligibility criteria, represented by 41,487 distinctive entities of 15 entity types and 25,017 relationships of 12 relationship types.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio de Sanidad, Consumo y Bienestar Social.The dataset contains questions about following topics: medicine, nursing, psychology, chemistry, pharmacology and biology.
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
MayoSRS consists of 101 clinical term pairs whose relatedness was determined by nine medical coders and three physicians from the Mayo Clinic.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is used for calculating the similarity between two patient descriptions.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset consists of biomedical articles describing randomized control trials (RCTs) that compare multiple treatments. Each of these articles will have multiple questions, or 'prompts' associated with them. These prompts will ask about the relationship between an intervention and comparator with respect to an outcome, as reported in the trial. For example, a prompt may ask about the reported effects of aspirin as compared to placebo on the duration of headaches. For the sake of this task, we assume that a particular article will report that the intervention of interest either significantly increased, significantly decreased or had significant effect on the outcome, relative to the comparator.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
This corpus release contains 4,993 abstracts annotated with (P)articipants, (I)nterventions, and (O)utcomes. Training labels are sourced from AMT workers and aggregated to reduce noise. Test labels are collected from medical professionals.