16 datasets found
  1. h

    hallmarks_of_cancer

    • huggingface.co
    Updated Apr 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). hallmarks_of_cancer [Dataset]. https://huggingface.co/datasets/bigbio/hallmarks_of_cancer
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 4, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).

  2. h

    grascco

    • huggingface.co
    • data.niaid.nih.gov
    Updated Oct 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2024). grascco [Dataset]. https://huggingface.co/datasets/bigbio/grascco
    Explore at:
    Dataset updated
    Oct 23, 2024
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    GraSCCo is a collection of artificially generated semi-structured and unstructured German-language clinical summaries. These summaries are formulated as letters from the hospital to the patient's GP after in-patient or out-patient care. This is common practice in Germany, Austria and Switzerland.

    The creation of the GraSCCo documents were inspired by existing clinical texts, but all names and dates are purely fictional. There is no relation to existing patients, clinicians or institutions. Whereas the texts try to represent the range of German clinical language as best as possible, medical plausibility must not be assumed.

    GraSCCo can therefore only be used to train clinical language models, not clinical domain models.

  3. e

    med_qa

    • hf-proxy-cf.effarig.site
    • opendatalab.com
    • +1more
    Updated Jul 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2024). med_qa [Dataset]. https://hf-proxy-cf.effarig.site/datasets/bigbio/med_qa
    Explore at:
    Dataset updated
    Jul 23, 2024
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    In this work, we present the first free-form multiple-choice OpenQA dataset for solving medical problems, MedQA, collected from the professional medical board exams. It covers three languages: English, simplified Chinese, and traditional Chinese, and contains 12,723, 34,251, and 14,123 questions for the three languages, respectively. Together with the question data, we also collect and release a large-scale corpus from medical textbooks from which the reading comprehension models can obtain necessary knowledge for answering the questions.

  4. h

    ddi_corpus

    • huggingface.co
    • opendatalab.com
    Updated Dec 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2022). ddi_corpus [Dataset]. https://huggingface.co/datasets/bigbio/ddi_corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 14, 2022
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The DDI corpus has been manually annotated with drugs and pharmacokinetics and pharmacodynamics interactions. It contains 1025 documents from two different sources: DrugBank database and MedLine.

  5. h

    mediqa_qa

    • huggingface.co
    Updated Feb 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). mediqa_qa [Dataset]. https://huggingface.co/datasets/bigbio/mediqa_qa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 22, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The MEDIQA challenge is an ACL-BioNLP 2019 shared task aiming to attract further research efforts in Natural Language Inference (NLI), Recognizing Question Entailment (RQE), and their applications in medical Question Answering (QA). Mailing List: https://groups.google.com/forum/#!forum/bionlp-mediqa

    In the QA task, participants are tasked to: - filter/classify the provided answers (1: correct, 0: incorrect). - re-rank the answers.

  6. h

    cpi

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets, cpi [Dataset]. https://huggingface.co/datasets/bigbio/cpi
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The compound-protein relationship (CPI) dataset consists of 2,613 sentences from abstracts containing annotations of proteins, small molecules, and their relationships

  7. h

    head_qa

    • huggingface.co
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2024). head_qa [Dataset]. https://huggingface.co/datasets/bigbio/head_qa
    Explore at:
    Dataset updated
    Oct 25, 2024
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    HEAD-QA is a multi-choice HEAlthcare Dataset. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. They are designed by the Ministerio de Sanidad, Consumo y Bienestar Social.The dataset contains questions about following topics: medicine, nursing, psychology, chemistry, pharmacology and biology.

  8. h

    bionlp_st_2013_ge

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets, bionlp_st_2013_ge [Dataset]. https://huggingface.co/datasets/bigbio/bionlp_st_2013_ge
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The BioNLP-ST GE task has been promoting development of fine-grained information extraction (IE) from biomedical documents, since 2009. Particularly, it has focused on the domain of NFkB as a model domain of Biomedical IE

  9. h

    meddialog

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated Apr 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). meddialog [Dataset]. https://huggingface.co/datasets/bigbio/meddialog
    Explore at:
    Dataset updated
    Apr 22, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The MedDialog dataset (English) contains conversations (in English) between doctors and patients.It has 0.26 million dialogues. The data is continuously growing and more dialogues will be added. The raw dialogues are from healthcaremagic.com and icliniq.com. All copyrights of the data belong to healthcaremagic.com and icliniq.com.

  10. h

    linnaeus

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Oct 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). linnaeus [Dataset]. https://huggingface.co/datasets/bigbio/linnaeus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Linnaeus is a novel corpus of full-text documents manually annotated for species mentions.

  11. P

    MeDAL Dataset

    • paperswithcode.com
    • opendatalab.com
    • +2more
    Updated Sep 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhi Wen; Xing Han Lu; Siva Reddy (2020). MeDAL Dataset [Dataset]. https://paperswithcode.com/dataset/medal
    Explore at:
    Dataset updated
    Sep 30, 2020
    Authors
    Zhi Wen; Xing Han Lu; Siva Reddy
    Description

    The Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) is a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. It was published at the ClinicalNLP workshop at EMNLP.

  12. h

    mediqa_nli

    • huggingface.co
    • physionet.org
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). mediqa_nli [Dataset]. https://huggingface.co/datasets/bigbio/mediqa_nli
    Explore at:
    Dataset updated
    Mar 2, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Natural Language Inference (NLI) is the task of determining whether a given hypothesis can be inferred from a given premise. Also known as Recognizing Textual Entailment (RTE), this task has enjoyed popularity among researchers for some time. However, almost all datasets for this task focused on open domain data such as as news texts, blogs, and so on. To address this gap, the MedNLI dataset was created for language inference in the medical domain. MedNLI is a derived dataset with data sourced from MIMIC-III v1.4. In order to stimulate research for this problem, a shared task on Medical Inference and Question Answering (MEDIQA) was organized at the workshop for biomedical natural language processing (BioNLP) 2019. The dataset provided herein is a test set of 405 premise hypothesis pairs for the NLI challenge in the MEDIQA shared task. Participants of the shared task are expected to use the MedNLI data for development of their models and this dataset was used as an unseen dataset for scoring each participant submission.

  13. h

    scitail

    • huggingface.co
    • tensorflow.org
    • +1more
    Updated Apr 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). scitail [Dataset]. https://huggingface.co/datasets/bigbio/scitail
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis. We use information retrieval to obtain relevant text from a large text corpus of web sentences, and use these sentences as a premise P. We crowdsource the annotation of such premise-hypothesis pair as supports (entails) or not (neutral), in order to create the SciTail dataset. The dataset contains 27,026 examples with 10,101 examples with entails label and 16,925 examples with neutral label.

  14. P

    EHR-Rel Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Claudia Schulz; Josh Levy-Kramer; Camille Van Assel; Miklos Kepes; Nils Hammerla (2023). EHR-Rel Dataset [Dataset]. https://paperswithcode.com/dataset/ehr-rel
    Explore at:
    Dataset updated
    Nov 26, 2023
    Authors
    Claudia Schulz; Josh Levy-Kramer; Camille Van Assel; Miklos Kepes; Nils Hammerla
    Description

    EHR-RelB is a benchmark dataset for biomedical concept relatedness, consisting of 3630 concept pairs sampled from electronic health records (EHRs). EHR-RelA is a smaller dataset of 111 concept pairs, which are mainly unrelated.

  15. h

    pdr

    • huggingface.co
    Updated Mar 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). pdr [Dataset]. https://huggingface.co/datasets/bigbio/pdr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    The corpus of plant-disease relation consists of plants and diseases and their relation to PubMed abstract. The corpus consists of about 2400 plant and disease entities and 300 annotated relations from 179 abstracts.

  16. h

    paramed

    • huggingface.co
    Updated Oct 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigScience Biomedical Datasets (2023). paramed [Dataset]. https://huggingface.co/datasets/bigbio/paramed
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 27, 2023
    Dataset authored and provided by
    BigScience Biomedical Datasets
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NEJM is a Chinese-English parallel corpus crawled from the New England Journal of Medicine website. English articles are distributed through https://www.nejm.org/ and Chinese articles are distributed through http://nejmqianyan.cn/. The corpus contains all article pairs (around 2000 pairs) since 2011.

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
BigScience Biomedical Datasets (2023). hallmarks_of_cancer [Dataset]. https://huggingface.co/datasets/bigbio/hallmarks_of_cancer

hallmarks_of_cancer

Hallmarks of Cancer

bigbio/hallmarks_of_cancer

Explore at:
19 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 4, 2023
Dataset authored and provided by
BigScience Biomedical Datasets
License

https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

Description

The Hallmarks of Cancer (HOC) Corpus consists of 1852 PubMed publication abstracts manually annotated by experts according to a taxonomy. The taxonomy consists of 37 classes in a hierarchy. Zero or more class labels are assigned to each sentence in the corpus. The labels are found under the "labels" directory, while the tokenized text can be found under "text" directory. The filenames are the corresponding PubMed IDs (PMID).

Search
Clear search
Close search
Google apps
Main menu