12 datasets found
  1. h

    bc5cdr

    • huggingface.co
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TNER (2022). bc5cdr [Dataset]. https://huggingface.co/datasets/tner/bc5cdr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 21, 2022
    Dataset authored and provided by
    TNER
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description
  2. h

    BC5CDR-IOB

    • huggingface.co
    Updated Jun 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satya (2023). BC5CDR-IOB [Dataset]. https://huggingface.co/datasets/omniquad/BC5CDR-IOB
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 20, 2023
    Authors
    Satya
    Description

    The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/

  3. i

    NCBI; BC5CDR; i2b2 2010; HPRD50; AIMed; MedNLI

    • ieee-dataport.org
    Updated Apr 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    chen peng (2024). NCBI; BC5CDR; i2b2 2010; HPRD50; AIMed; MedNLI [Dataset]. https://ieee-dataport.org/documents/ncbi-bc5cdr-i2b2-2010-hprd50-aimed-mednli
    Explore at:
    Dataset updated
    Apr 2, 2024
    Authors
    chen peng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NCBI: The NCBI dataset is a biomedical corpus containing 793 PubMed abstracts

  4. t

    BC5CDR-c - Dataset - LDM

    • service.tib.eu
    Updated Dec 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). BC5CDR-c - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/bc5cdr-c
    Explore at:
    Dataset updated
    Dec 16, 2024
    Description

    Biomedical entity linking is the task of linking entity mentions in a biomedical document to referent entities in a knowledge base.

  5. h

    autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985

    • huggingface.co
    Updated Oct 20, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2023). autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985
    Explore at:
    Dataset updated
    Oct 20, 2023
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985 dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. f

    Detailed representation of the dataset.

    • plos.figshare.com
    xls
    Updated May 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dequan Zheng; Rong Han; Feng Yu; Yannan Li (2024). Detailed representation of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0304329.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Dequan Zheng; Rong Han; Feng Yu; Yannan Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.

  7. d

    Data from: NLMChem a new resource for chemical entity recognition in PubMed...

    • datadryad.org
    • data.niaid.nih.gov
    • +1more
    zip
    Updated Mar 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rezarta Islamaj; Robert Leaman; Zhiyong Lu (2021). NLMChem a new resource for chemical entity recognition in PubMed full-text literature [Dataset]. http://doi.org/10.5061/dryad.3tx95x6dz
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 22, 2021
    Dataset provided by
    Dryad
    Authors
    Rezarta Islamaj; Robert Leaman; Zhiyong Lu
    Time period covered
    Mar 19, 2021
    Description

    We include the document of annotation guidelines, which makes it clear that the corpus can be combined with ChemDNER and BC5CDR corpora, which contain chemical name annotations, and name and MeSH annotations for chemicals respectively to further improve Chemical NER in biomedical literature.

    The corpus has been divided into train/dev/test to facilitate benchmarking and comparisons.

    The data annotations are inline in the BioC XML format, which is a minimalistic approach to facilitate text mining. We also maintain a copy here: https://www.ncbi.nlm.nih.gov/research/bionlp/

  8. f

    Experimental results of different fusion methods.

    • plos.figshare.com
    xls
    Updated May 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dequan Zheng; Rong Han; Feng Yu; Yannan Li (2024). Experimental results of different fusion methods. [Dataset]. http://doi.org/10.1371/journal.pone.0304329.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 28, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Dequan Zheng; Rong Han; Feng Yu; Yannan Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.

  9. h

    autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667

    • huggingface.co
    Updated Oct 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2023). autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667
    Explore at:
    Dataset updated
    Oct 21, 2023
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667 dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. Z

    Microchromosomes and their association with human diseases

    • data.niaid.nih.gov
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saldanha, Elveera (2022). Microchromosomes and their association with human diseases [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5880553
    Explore at:
    Dataset updated
    Feb 24, 2022
    Dataset provided by
    Dutt, Amit
    Kumar Prabhash
    Poojary, Disha
    Saldanha, Elveera
    Shripad Banavali
    Chandrani, Pratik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supply table-S1: Commonly used patient derived cell lines, number of average microchromosomes, citation to original article.

    Supply table-S2: List of PubMed abstracts and annotation of disease identified by artificial intelligence-based technique related to incidence of microchromosomes in human

    Machine Learning_NER_code_output_20220120.zip: File containing the PubMed abstracts, machine learning analysis output and disease interpretation output.

    Cell lines karyotype.zip: File containing the raw karyotype data of all head & neck cancer cell lines.

    Brief description of methodology:

    To investigate the incidence of microchromosomes in human genome, we mine the PubMed literature for studies related to keywords “((microchromosome) OR ("marker chromosome") OR ("small chromosome"))” and applying the filter “human”. A total of 1,365 abstracts are obtained from PubMed as per date 08-Jan-2022. We analyze the PubMed abstracts using the Named Entity Recognition (NER) technique of Machine Learning (ML) implemented in Spacy (3.0) – scispaCy (0.4.0) – Python (3.7) running on Windows 11 system. The scispaCy package NER model “en_ner_bc5cdr_md” which is pretrained on BC5CDR corpus was used for disease entity recognition (https://allenai.github.io/scispacy/). Approximately 2000 disease entities are recognized by the model from the abstract text of the 1365 articles. The disease entities present in the abstract texts are extracted and then grouped together for most common broad disease classes as shown in excel file Supply_Table-S1.xlsx. The Python code, PubMed input and output files are available in "Machine Learning_NER_code_output_20220120.zip". Overall, inherited or somatically acquired microchromosomes in human individuals are frequently reported with diseases and disorders like Cancer, Trisomy, Turner’s syndrome, Epilepsy, Infertility, and Autism.

  11. f

    DataSheet1_Medical terminology-based computing system: a lightweight...

    • frontiersin.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nadia Saeed; Hammad Naveed (2023). DataSheet1_Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms.PDF [Dataset]. http://doi.org/10.3389/fmolb.2022.928530.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers
    Authors
    Nadia Saeed; Hammad Naveed
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS—a lightweight, post-processing module—to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.

  12. h

    BC5-CDR-Balanced

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Baoyang Liu (2025). BC5-CDR-Balanced [Dataset]. https://huggingface.co/datasets/n0v1cee/BC5-CDR-Balanced
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    Baoyang Liu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    n0v1cee/BC5-CDR-Balanced dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TNER (2022). bc5cdr [Dataset]. https://huggingface.co/datasets/tner/bc5cdr

bc5cdr

tner/bc5cdr

BioCreative V CDR

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2022
Dataset authored and provided by
TNER
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description
Search
Clear search
Close search
Google apps
Main menu