https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NCBI: The NCBI dataset is a biomedical corpus containing 793 PubMed abstracts
Biomedical entity linking is the task of linking entity mentions in a biomedical document to referent entities in a knowledge base.
autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.
We include the document of annotation guidelines, which makes it clear that the corpus can be combined with ChemDNER and BC5CDR corpora, which contain chemical name annotations, and name and MeSH annotations for chemicals respectively to further improve Chemical NER in biomedical literature.
The corpus has been divided into train/dev/test to facilitate benchmarking and comparisons.
The data annotations are inline in the BioC XML format, which is a minimalistic approach to facilitate text mining. We also maintain a copy here: https://www.ncbi.nlm.nih.gov/research/bionlp/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.
autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667 dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supply table-S1: Commonly used patient derived cell lines, number of average microchromosomes, citation to original article.
Supply table-S2: List of PubMed abstracts and annotation of disease identified by artificial intelligence-based technique related to incidence of microchromosomes in human
Machine Learning_NER_code_output_20220120.zip: File containing the PubMed abstracts, machine learning analysis output and disease interpretation output.
Cell lines karyotype.zip: File containing the raw karyotype data of all head & neck cancer cell lines.
Brief description of methodology:
To investigate the incidence of microchromosomes in human genome, we mine the PubMed literature for studies related to keywords “((microchromosome) OR ("marker chromosome") OR ("small chromosome"))” and applying the filter “human”. A total of 1,365 abstracts are obtained from PubMed as per date 08-Jan-2022. We analyze the PubMed abstracts using the Named Entity Recognition (NER) technique of Machine Learning (ML) implemented in Spacy (3.0) – scispaCy (0.4.0) – Python (3.7) running on Windows 11 system. The scispaCy package NER model “en_ner_bc5cdr_md” which is pretrained on BC5CDR corpus was used for disease entity recognition (https://allenai.github.io/scispacy/). Approximately 2000 disease entities are recognized by the model from the abstract text of the 1365 articles. The disease entities present in the abstract texts are extracted and then grouped together for most common broad disease classes as shown in excel file Supply_Table-S1.xlsx. The Python code, PubMed input and output files are available in "Machine Learning_NER_code_output_20220120.zip". Overall, inherited or somatically acquired microchromosomes in human individuals are frequently reported with diseases and disorders like Cancer, Trisomy, Turner’s syndrome, Epilepsy, Infertility, and Autism.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS—a lightweight, post-processing module—to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
n0v1cee/BC5-CDR-Balanced dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/