17 datasets found
  1. h

    aeroBERT-NER

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
    Explore at:
    Dataset updated
    Apr 7, 2023
    Authors
    Archana Tikayat Ray
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for aeroBERT-NER

      Dataset Summary
    

    This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
    (1) Making available an open-source dataset for aerospace requirements which are often proprietary
    (2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

  2. Multilingual named entity recognition for medieval charters. Datasets and...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Torres Aguilar; Sergio Torres Aguilar (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. http://doi.org/10.5281/zenodo.6463699
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Torres Aguilar; Sergio Torres Aguilar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

    The original raw texts for all charters were collected from four charters collections

    - HOME-ALCAR corpus : https://zenodo.org/record/5600884

    - CBMA : http://www.cbma-project.eu

    - Diplomata Belgica : https://www.diplomata-belgica.be

    - CODEA corpus : https://corpuscodea.es/

    We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

    Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

    Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner

  3. h

    biobert-ner-fda-recalls-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miriam Farrington, biobert-ner-fda-recalls-dataset [Dataset]. https://huggingface.co/datasets/mfarrington/biobert-ner-fda-recalls-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Miriam Farrington
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for FDA CDRH Device Recalls NER Dataset

    This is a FDA Medical Device Recalls Dataset Created for Medical Device Named Entity Recognition (NER)

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This dataset was created for the purpose of performing NER tasks. It utilizes the OpenFDA Device Recalls dataset, which has been processed and annotated for performing NER. The Device Recalls dataset has been further processed to extract the recall action element, which… See the full description on the dataset page: https://huggingface.co/datasets/mfarrington/biobert-ner-fda-recalls-dataset.

  4. h

    dnrti-securebert-ner-512

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naor Matania, dnrti-securebert-ner-512 [Dataset]. https://huggingface.co/datasets/naorm/dnrti-securebert-ner-512
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Naor Matania
    Description

    naorm/dnrti-securebert-ner-512 dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Data from: Multi-head CRF classifier for biomedical multi-class Named Entity...

    • zenodo.org
    • data.niaid.nih.gov
    tsv, zip
    Updated May 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Richard A A Jonker; Richard A A Jonker; Tiago Almeida; Tiago Almeida; Rui Antunes; Rui Antunes; João Rafael Almeida; João Rafael Almeida; Sérgio Matos; Sérgio Matos (2024). Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes [Dataset]. http://doi.org/10.5281/zenodo.11174163
    Explore at:
    tsv, zipAvailable download formats
    Dataset updated
    May 10, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Richard A A Jonker; Richard A A Jonker; Tiago Almeida; Tiago Almeida; Rui Antunes; Rui Antunes; João Rafael Almeida; João Rafael Almeida; Sérgio Matos; Sérgio Matos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    May 6, 2024
    Description

    This contains the merged dataset as described in the work "Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes".

    This dataset consists of 4 seperate datasets:

    The dataset contains two tasks:

    Task 1: This task is related to multi-class Named Entity Recognition. This dataset contains 5 possible classes: SYMPTOM, PROCEDURE, DISEASE, CHEMICAL and PROTEIN.

    Task 2: This task is related to Named Entity Linking, where each code corresponds to a code within the SNOMED-CT corpus. The exact corpus used can be obtained here. Further for the MedProcNER, SympTEMIST and DisTEMIST datasets, a gazetteer is provided in the original datasets.

    For more information on the construction of the dataset, aswell as dataloaders, we refer you to our GitHub repository.

    Further this also contains the embeddings from the SapBERT model.

    Please, cite:

    @article{jonker2024a, title = {Multi-head {{CRF}} classifier for biomedical multi-class named entity recognition on {{Spanish}} clinical notes}, author = {Jonker, Richard A. A. and Almeida, Tiago and Antunes, Rui and Almeida, Jo{\~a}o R. and Matos, S{\'e}rgio}, year = {2024}, journal = {Database}, publisher = {Oxford University Press} }
    Jonker, R. A. A., Almeida, T., Antunes, R., Almeida, J. R., & Matos, S. (2024). Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. (Submitted.)

    License

    This work is licensed under a Creative Commons Attribution 4.0 International License.

  6. h

    malware-text-db-securebert-ner-512

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naor Matania, malware-text-db-securebert-ner-512 [Dataset]. https://huggingface.co/datasets/naorm/malware-text-db-securebert-ner-512
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Naor Matania
    Description

    naorm/malware-text-db-securebert-ner-512 dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. Multilingual NER Dataset

    • kaggle.com
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Multilingual NER Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/multilingual-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Multilingual NER Dataset

    Multilingual NER Dataset for Named Entity Recognition

    By Babelscape (From Huggingface) [source]

    About this dataset

    The Babelscape/wikineural NER Dataset is a comprehensive and diverse collection of multilingual text data specifically designed for the task of Named Entity Recognition (NER). It offers an extensive range of labeled sentences in nine different languages: French, German, Portuguese, Spanish, Polish, Dutch, Russian, English, and Italian.

    Each sentence in the dataset contains tokens (words or characters) that have been labeled with named entity recognition tags. These tags provide valuable information about the type of named entity each token represents. The dataset also includes a language column to indicate the language in which each sentence is written.

    This dataset serves as an invaluable resource for developing and evaluating NER models across multiple languages. It encompasses various domains and contexts to ensure diversity and representativeness. Researchers and practitioners can utilize this dataset to train and test their NER models in real-world scenarios.

    By using this dataset for NER tasks, users can enhance their understanding of how named entities are recognized across different languages. Furthermore, it enables benchmarking performance comparisons between various NER models developed for specific languages or trained on multiple languages simultaneously.

    Whether you are an experienced researcher or a beginner exploring multilingual NER tasks, the Babelscape/wikineural NER Dataset provides a highly informative and versatile resource that can contribute to advancements in natural language processing and information extraction applications on a global scale

    How to use the dataset

    • Understand the Data Structure:

      • The dataset consists of labeled sentences in nine different languages: French (fr), German (de), Portuguese (pt), Spanish (es), Polish (pl), Dutch (nl), Russian (ru), English (en), and Italian (it).
      • Each sentence is represented by three columns: tokens, ner_tags, and lang.
      • The tokens column contains the individual words or characters in each labeled sentence.
      • The ner_tags column provides named entity recognition tags for each token, indicating their entity types.
      • The lang column specifies the language of each sentence.
    • Explore Different Languages:

      • Since this dataset covers multiple languages, you can choose to focus on a specific language or perform cross-lingual analysis.
      • Analyzing multiple languages can help uncover patterns and differences in named entities across various linguistic contexts.
    • Preprocessing and Cleaning:

      • Before training your NER models or applying any NLP techniques to this dataset, it's essential to preprocess and clean the data.
      • Consider removing any unnecessary punctuation marks or special characters unless they carry significant meaning in certain languages.
    • Training Named Entity Recognition Models: 4a. Data Splitting: Divide the dataset into training, validation, and testing sets based on your requirements using appropriate ratios. 4b. Feature Extraction: Prepare input features from tokenized text data such as word embeddings or character-level representations depending on your model choice. 4c. Model Training: Utilize state-of-the-art NER models (e.g., LSTM-CRF, Transformer-based models) to train on the labeled sentences and ner_tags columns. 4d. Evaluation: Evaluate your trained model's performance using the provided validation dataset or test datasets specific to each language.

    • Applying Pretrained Models:

      • Instead of training a model from scratch, you can leverage existing pretrained NER models like BERT, GPT-2, or SpaCy's named entity recognition capabilities.
      • Fine-tune these pre-trained models on your specific NER task using the labeled

    Research Ideas

    • Training NER models: This dataset can be used to train NER models in multiple languages. By providing labeled sentences and their corresponding named entity recognition tags, the dataset can help train models to accurately identify and classify named entities in different languages.
    • Evaluating NER performance: The dataset can be used as a benchmark to evaluate the performance of pre-trained or custom-built NER models. By using the labeled sentences as test data, developers and researchers can measure the accuracy, precision, recall, and F1-score of their models across multiple languages.
    • Cross-lingual analysis: With labeled sentences available in nine different languages, researchers can perform cross-lingual analysis...
  8. h

    location-ner-4-BERT

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nabin Oli, location-ner-4-BERT [Dataset]. https://huggingface.co/datasets/nabin2004/location-ner-4-BERT
    Explore at:
    Authors
    Nabin Oli
    Description

    nabin2004/location-ner-4-BERT dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julia Dollis, stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr [Dataset]. https://huggingface.co/datasets/juliadollis/stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Julia Dollis
    Description

    juliadollis/stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    protein_structure_NER_model_v1.2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melanie, protein_structure_NER_model_v1.2 [Dataset]. https://huggingface.co/datasets/mevol/protein_structure_NER_model_v1.2
    Explore at:
    Authors
    Melanie
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This data was used to train model: https://huggingface.co/mevol/BiomedNLP-PubMedBERT-ProteinStructure-NER-v1.2 There are 19 different entity types in this dataset: "chemical", "complex_assembly", "evidence", "experimental_method", "gene", "mutant", "oligomeric_state", "protein", "protein_state", "protein_type", "ptm", "residue_name", "residue_name_number","residue_number", "residue_range", "site", "species", "structure_element", "taxonomy_domain" The data prepared as IOB… See the full description on the dataset page: https://huggingface.co/datasets/mevol/protein_structure_NER_model_v1.2.

  11. h

    protein_structure_NER_model_v1.2

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Protein Data Bank in Europe, protein_structure_NER_model_v1.2 [Dataset]. https://huggingface.co/datasets/PDBEurope/protein_structure_NER_model_v1.2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Protein Data Bank in Europe
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This data was used to train model: https://huggingface.co/PDBEurope/BiomedNLP-PubMedBERT-ProteinStructure-NER-v1.2 There are 19 different entity types in this dataset: "chemical", "complex_assembly", "evidence", "experimental_method", "gene", "mutant", "oligomeric_state", "protein", "protein_state", "protein_type", "ptm", "residue_name", "residue_name_number","residue_number", "residue_range", "site", "species", "structure_element", "taxonomy_domain" The data prepared as… See the full description on the dataset page: https://huggingface.co/datasets/PDBEurope/protein_structure_NER_model_v1.2.

  12. h

    autoeval-staging-eval-project-2f2d3a43-7564875

    • huggingface.co
    Updated Oct 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2023). autoeval-staging-eval-project-2f2d3a43-7564875 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-2f2d3a43-7564875
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 12, 2023
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: Ravindra001/bert-finetuned-ner Dataset: wikiann

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @lewtun for evaluating this model.

  13. h

    autoeval-staging-eval-project-conll2003-2dc2f6d8-11805572

    • huggingface.co
    Updated Mar 1, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2003). autoeval-staging-eval-project-conll2003-2dc2f6d8-11805572 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-conll2003-2dc2f6d8-11805572
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2003
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: AJGP/bert-finetuned-ner Dataset: conll2003 Config: conll2003 Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @hrezaeim for evaluating this model.

  14. h

    autoeval-eval-conll2003-conll2003-0a1842-65397145557

    • huggingface.co
    Updated Nov 21, 1996
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (1996). autoeval-eval-conll2003-conll2003-0a1842-65397145557 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-conll2003-conll2003-0a1842-65397145557
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 1996
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: Abelll/bert-finetuned-ner Dataset: conll2003 Config: conll2003 Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @chnlyi for evaluating this model.

  15. h

    resume_ner

    • huggingface.co
    Updated Jul 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cool (2023). resume_ner [Dataset]. https://huggingface.co/datasets/ttxy/resume_ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2023
    Authors
    Cool
    License

    https://choosealicense.com/licenses/bsd/https://choosealicense.com/licenses/bsd/

    Description

    中文 resume ner 数据集, 来源: https://github.com/luopeixiang/named_entity_recognition 。 数据的格式如下,它的每一行由一个字及其对应的标注组成,标注集采用BIOES,句子之间用一个空行隔开。 美 B-LOC 国 E-LOC 的 O 华 B-PER 莱 I-PER 士 E-PER

    我 O 跟 O 他 O 谈 O 笑 O 风 O 生 O

      效果
    
    
    
    
    
      不同模型的效果对比:
    
    
    
    
    
    
    
      Bert-tiny 结果
    

    model precision recall f1-score support

    BERT-tiny 0.9490 0.9538 0.9447 全部

    BERT-tiny 0.9278 0.9251 0.9313 使用 100 train

    注:

    后面再测试,BERT-tiny(softmax) + 100 训练样本,暂时没有复现 0.9313 的结果,最好结果 0.8612 BERT-tiny +… See the full description on the dataset page: https://huggingface.co/datasets/ttxy/resume_ner.

  16. h

    autoeval-eval-lener_br-lener_br-280a5d-1776961679

    • huggingface.co
    Updated Nov 2, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2023). autoeval-eval-lener_br-lener_br-280a5d-1776961679 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-lener_br-lener_br-280a5d-1776961679
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 2, 2023
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: pierreguillou/ner-bert-large-cased-pt-lenerbr Dataset: lener_br Config: lener_br Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @Luciano for evaluating this model.

  17. h

    autoeval-eval-project-jnlpba-c103d433-1295449602

    • huggingface.co
    Updated Sep 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2022). autoeval-eval-project-jnlpba-c103d433-1295449602 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-project-jnlpba-c103d433-1295449602
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 20, 2022
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: siddharthtumre/biobert-ner Dataset: jnlpba Config: jnlpba Split: validation

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @siddharthtumre for evaluating this model.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470

aeroBERT-NER

all_text_annotation_NER.txt

archanatikayatray/aeroBERT-NER

Explore at:
36 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 7, 2023
Authors
Archana Tikayat Ray
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for aeroBERT-NER

  Dataset Summary

This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
(1) Making available an open-source dataset for aerospace requirements which are often proprietary
(2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

Search
Clear search
Close search
Google apps
Main menu