12 datasets found

h
bc5cdr
huggingface.co
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TNER (2022). bc5cdr [Dataset]. https://huggingface.co/datasets/tner/bc5cdr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 21, 2022
Dataset authored and provided by
TNER
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Bio Creative 5 CDR NER dataset
h
BC5CDR-IOB
huggingface.co
Updated Jun 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satya (2023). BC5CDR-IOB [Dataset]. https://huggingface.co/datasets/omniquad/BC5CDR-IOB
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 20, 2023
Authors
Satya
Description
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
i
NCBI; BC5CDR; i2b2 2010; HPRD50; AIMed; MedNLI
ieee-dataport.org
Updated Apr 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
chen peng (2024). NCBI; BC5CDR; i2b2 2010; HPRD50; AIMed; MedNLI [Dataset]. https://ieee-dataport.org/documents/ncbi-bc5cdr-i2b2-2010-hprd50-aimed-mednli
Explore at:
Dataset updated
Apr 2, 2024
Authors
chen peng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NCBI: The NCBI dataset is a biomedical corpus containing 793 PubMed abstracts
t
BC5CDR-c - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). BC5CDR-c - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/bc5cdr-c
Explore at:
Dataset updated
Dec 16, 2024
Description
Biomedical entity linking is the task of linking entity mentions in a biomedical document to referent entities in a knowledge base.
h
autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985
huggingface.co
Updated Oct 20, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (2023). autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985
Explore at:
Dataset updated
Oct 20, 2023
Dataset authored and provided by
Evaluation on the Hub
Description
autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-01abad-31923144985 dataset hosted on Hugging Face and contributed by the HF Datasets community
f
Detailed representation of the dataset.
plos.figshare.com
xls
Updated May 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dequan Zheng; Rong Han; Feng Yu; Yannan Li (2024). Detailed representation of the dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0304329.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0304329.t001
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Dequan Zheng; Rong Han; Feng Yu; Yannan Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.
d
Data from: NLMChem a new resource for chemical entity recognition in PubMed...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Mar 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rezarta Islamaj; Robert Leaman; Zhiyong Lu (2021). NLMChem a new resource for chemical entity recognition in PubMed full-text literature [Dataset]. http://doi.org/10.5061/dryad.3tx95x6dz
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.3tx95x6dz
Dataset updated
Mar 22, 2021
Dataset provided by
Dryad
Authors
Rezarta Islamaj; Robert Leaman; Zhiyong Lu
Time period covered
Mar 19, 2021
Description
We include the document of annotation guidelines, which makes it clear that the corpus can be combined with ChemDNER and BC5CDR corpora, which contain chemical name annotations, and name and MeSH annotations for chemicals respectively to further improve Chemical NER in biomedical literature.

The corpus has been divided into train/dev/test to facilitate benchmarking and comparisons.

The data annotations are inline in the BioC XML format, which is a minimalistic approach to facilitate text mining. We also maintain a copy here: https://www.ncbi.nlm.nih.gov/research/bionlp/
f
Experimental results of different fusion methods.
plos.figshare.com
xls
Updated May 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dequan Zheng; Rong Han; Feng Yu; Yannan Li (2024). Experimental results of different fusion methods. [Dataset]. http://doi.org/10.1371/journal.pone.0304329.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0304329.t005
Dataset updated
May 28, 2024
Dataset provided by
PLOS ONE
Authors
Dequan Zheng; Rong Han; Feng Yu; Yannan Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Currently, in the field of biomedical named entity recognition, CharCNN (Character-level Convolutional Neural Networks) or CharRNN (Character-level Recurrent Neural Network) is typically used independently to extract character features. However, this approach does not consider the complementary capabilities between them and only concatenates word features, ignoring the feature information during the process of word integration. Based on this, this paper proposes a method of multi-cross attention feature fusion. First, DistilBioBERT and CharCNN and CharLSTM are used to perform cross-attention word-char (word features and character features) fusion separately. Then, the two feature vectors obtained from cross-attention fusion are fused again through cross-attention to obtain the final feature vector. Subsequently, a BiLSTM is introduced with a multi-head attention mechanism to enhance the model’s ability to focus on key information features and further improve model performance. Finally, the output layer is used to output the final result. Experimental results show that the proposed model achieves the best F1 values of 90.76%, 89.79%, 94.98%, 80.27% and 88.84% on NCBI-Disease, BC5CDR-Disease, BC5CDR-Chem, JNLPBA and BC2GM biomedical datasets respectively. This indicates that our model can capture richer semantic features and improve the ability to recognize entities.
h
autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667
huggingface.co
Updated Oct 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (2023). autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667
Explore at:
Dataset updated
Oct 21, 2023
Dataset authored and provided by
Evaluation on the Hub
Description
autoevaluate/autoeval-eval-tner_bc5cdr-bc5cdr-5c6ce1-69913145667 dataset hosted on Hugging Face and contributed by the HF Datasets community
Z
Microchromosomes and their association with human diseases
data.niaid.nih.gov
Updated Feb 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saldanha, Elveera (2022). Microchromosomes and their association with human diseases [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5880553
Explore at:
Dataset updated
Feb 24, 2022
Dataset provided by
Dutt, Amit
Kumar Prabhash
Poojary, Disha
Saldanha, Elveera
Shripad Banavali
Chandrani, Pratik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supply table-S1: Commonly used patient derived cell lines, number of average microchromosomes, citation to original article.

Supply table-S2: List of PubMed abstracts and annotation of disease identified by artificial intelligence-based technique related to incidence of microchromosomes in human

Machine Learning_NER_code_output_20220120.zip: File containing the PubMed abstracts, machine learning analysis output and disease interpretation output.

Cell lines karyotype.zip: File containing the raw karyotype data of all head & neck cancer cell lines.

Brief description of methodology:

To investigate the incidence of microchromosomes in human genome, we mine the PubMed literature for studies related to keywords “((microchromosome) OR ("marker chromosome") OR ("small chromosome"))” and applying the filter “human”. A total of 1,365 abstracts are obtained from PubMed as per date 08-Jan-2022. We analyze the PubMed abstracts using the Named Entity Recognition (NER) technique of Machine Learning (ML) implemented in Spacy (3.0) – scispaCy (0.4.0) – Python (3.7) running on Windows 11 system. The scispaCy package NER model “en_ner_bc5cdr_md” which is pretrained on BC5CDR corpus was used for disease entity recognition (https://allenai.github.io/scispacy/). Approximately 2000 disease entities are recognized by the model from the abstract text of the 1365 articles. The disease entities present in the abstract texts are extracted and then grouped together for most common broad disease classes as shown in excel file Supply_Table-S1.xlsx. The Python code, PubMed input and output files are available in "Machine Learning_NER_code_output_20220120.zip". Overall, inherited or somatically acquired microchromosomes in human individuals are frequently reported with diseases and disorders like Cancer, Trisomy, Turner’s syndrome, Epilepsy, Infertility, and Autism.
f
DataSheet1_Medical terminology-based computing system: a lightweight...
frontiersin.figshare.com
pdf
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nadia Saeed; Hammad Naveed (2023). DataSheet1_Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms.PDF [Dataset]. http://doi.org/10.3389/fmolb.2022.928530.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fmolb.2022.928530.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Nadia Saeed; Hammad Naveed
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS—a lightweight, post-processing module—to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.
h
BC5-CDR-Balanced
huggingface.co
Updated May 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Baoyang Liu (2025). BC5-CDR-Balanced [Dataset]. https://huggingface.co/datasets/n0v1cee/BC5-CDR-Balanced
Explore at:
Dataset updated
May 28, 2025
Authors
Baoyang Liu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
n0v1cee/BC5-CDR-Balanced dataset hosted on Hugging Face and contributed by the HF Datasets community
Not seeing a result you expected?
Learn how you can add new datasets to our index.