25 datasets found
  1. h

    aeroBERT-NER

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
    Explore at:
    Dataset updated
    Apr 7, 2023
    Authors
    Archana Tikayat Ray
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for aeroBERT-NER

      Dataset Summary
    

    This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
    (1) Making available an open-source dataset for aerospace requirements which are often proprietary
    (2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

  2. BERT-Base-Multilingual-Cased

    • kaggle.com
    zip
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehrdad ALMASI (2024). BERT-Base-Multilingual-Cased [Dataset]. https://www.kaggle.com/datasets/mehrdadal2023/bert-base-multilingual-cased
    Explore at:
    zip(2992453291 bytes)Available download formats
    Dataset updated
    Dec 9, 2024
    Authors
    Mehrdad ALMASI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This file contains the pre-trained BERT-Base-Multilingual-Cased model files, originally provided by Google on Hugging Face. The model supports 104 languages and is ideal for a wide range of Natural Language Processing (NLP) tasks, including text classification, named entity recognition (NER), and question answering.

    This model is particularly helpful for multilingual NLP applications due to its ability to process cased text (case-sensitive input). Key details:

    Source: Hugging Face Architecture: 12-layer Transformer with 110M parameters Tasks: Text classification, NER, question answering, etc.

    To install, run the line below :

    pip install /kaggle/input/bert-base-multilingual-cased/Google Bert Multilingual

  3. h

    bert-ner-test-data

    • huggingface.co
    Updated Jan 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ashley Liu (2025). bert-ner-test-data [Dataset]. https://huggingface.co/datasets/ashleyliu31/bert-ner-test-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2025
    Authors
    Ashley Liu
    Description

    ashleyliu31/bert-ner-test-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    autotrain-data-nlp-bert-ner-testing

    • huggingface.co
    Updated Feb 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    roua belgacem (2024). autotrain-data-nlp-bert-ner-testing [Dataset]. https://huggingface.co/datasets/rouabelgacem/autotrain-data-nlp-bert-ner-testing
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 25, 2024
    Authors
    roua belgacem
    Description

    rouabelgacem/autotrain-data-nlp-bert-ner-testing dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    biobert-ner-fda-recalls-dataset

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Miriam Farrington, biobert-ner-fda-recalls-dataset [Dataset]. https://huggingface.co/datasets/mfarrington/biobert-ner-fda-recalls-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Miriam Farrington
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for FDA CDRH Device Recalls NER Dataset

    This is a FDA Medical Device Recalls Dataset Created for Medical Device Named Entity Recognition (NER)

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    This dataset was created for the purpose of performing NER tasks. It utilizes the OpenFDA Device Recalls dataset, which has been processed and annotated for performing NER. The Device Recalls dataset has been further processed to extract the recall action element, which… See the full description on the dataset page: https://huggingface.co/datasets/mfarrington/biobert-ner-fda-recalls-dataset.

  6. huggingface_BERT_large_NER

    • kaggle.com
    zip
    Updated Feb 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dewei Chen (2022). huggingface_BERT_large_NER [Dataset]. https://www.kaggle.com/datasets/dwchen/huggingface-bert-large-ner
    Explore at:
    zip(7426354476 bytes)Available download formats
    Dataset updated
    Feb 24, 2022
    Authors
    Dewei Chen
    Description

    Dataset

    This dataset was created by Dewei Chen

    Contents

  7. h

    medmentions-ner

    • huggingface.co
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerald Amasi (2025). medmentions-ner [Dataset]. https://huggingface.co/datasets/geraldamasi/medmentions-ner
    Explore at:
    Dataset updated
    Aug 11, 2025
    Authors
    Gerald Amasi
    Description

    MedMentions BioNER (Custom Processed)

    This dataset is a custom preprocessed version of the MedMentions dataset for biomedical Named Entity Recognition (NER) tasks.It is compatible with Hugging Face Datasets and can be used directly for fine-tuning BERT-based models such as BERT or Bio_ClinicalBERT.

      Dataset Summary
    

    Task: Named Entity Recognition (NER) in biomedical text Source: MedMentions Language: English Entity Types: 128 entity classes derived from UMLS semantic… See the full description on the dataset page: https://huggingface.co/datasets/geraldamasi/medmentions-ner.

  8. Bio_ClinicalBERT

    • kaggle.com
    zip
    Updated Apr 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aditi Dutta (2022). Bio_ClinicalBERT [Dataset]. https://www.kaggle.com/datasets/aditidutta/bio-clinicalbert
    Explore at:
    zip(806570272 bytes)Available download formats
    Dataset updated
    Apr 21, 2022
    Authors
    Aditi Dutta
    Description

    # ClinicalBERT - Bio + Clinical BERT Model

    The Publicly Available Clinical BERT Embeddings paper contains four unique clinicalBERT models: initialized with BERT-Base (cased_L-12_H-768_A-12) or BioBERT (BioBERT-Base v1.0 + PubMed 200K + PMC 270K) & trained on either all MIMIC notes or only discharge summaries.

    This model card describes the Bio+Clinical BERT model, which was initialized from BioBERT & trained on all MIMIC notes.

    Pretraining Data

    The Bio_ClinicalBERT model was trained on all notes from MIMIC III, a database containing electronic health records from ICU patients at the Beth Israel Hospital in Boston, MA. For more details on MIMIC, see here. All notes from the NOTEEVENTS table were included (~880M words).

    Model Pretraining

    Note Preprocessing

    Each note in MIMIC was first split into sections using a rules-based section splitter (e.g. discharge summary notes were split into "History of Present Illness", "Family History", "Brief Hospital Course", etc. sections). Then each section was split into sentences using SciSpacy (en core sci md tokenizer).

    Pretraining Hyperparameters

    We used a batch size of 32, a maximum sequence length of 128, and a learning rate of 5 · 10−5 for pre-training our models. The models trained on all MIMIC notes were trained for 150,000 steps. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

    How to use the model

    Load the model via the transformers library:

    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
    

    More Information

    Refer to the original paper, Publicly Available Clinical BERT Embeddings (NAACL Clinical NLP Workshop 2019) for additional details and performance on NLI and NER tasks.

  9. GLiNER Github Repo

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2025). GLiNER Github Repo [Dataset]. https://www.kaggle.com/dschettler8845/gliner-github-repo
    Explore at:
    zip(545226 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Darien Schettler
    Description

    GLiNER : Generalist and Lightweight model for Named Entity Recognition

    GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

    Demo Image

    Models Status

    📢 Updates

    • 📝 Finetuning notebook is available: examples/finetune.ipynb
    • 🗂 Training dataset preprocessing scripts are now available in the data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.

    Available Models on Hugging Face

    To Release

    • [ ] ⏳ GLiNER-Multiv2
    • [ ] ⏳ GLiNER-Sup (trained on mixture of NER datasets)

    Area of improvements / research

    • [ ] Allow longer context (eg. train with long context transformers such as Longformer, LED, etc.)
    • [ ] Use Bi-encoder (entity encoder and span encoder) allowing precompute entity embeddings
    • [ ] Filtering mechanism to reduce number of spans before final classification to save memory and computation when the number entity types is large
    • [ ] Improve understanding of more detailed prompts/instruction, eg. "Find the first name of the person in the text"
    • [ ] Better loss function: for instance use Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than others
    • [ ] Improve multi-lingual capabilities: train on more languages, and use multi-lingual training data
    • [ ] Decoding: allow a span to have multiple labels, eg: "Cristiano Ronaldo" is both a "person" and "football player"
    • [ ] Dynamic thresholding (in model.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.
    • [ ] Train with EMAs (Exponential Moving Averages) or merge multiple checkpoints to improve model robustness (see this paper
    • [ ] Extend the model to relation extraction but need dataset with relation annotations. Our preliminary work ATG.

    Installation

    To use this model, you must install the GLiNER Python library: !pip install gliner

    Usage

    Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.

    from gliner import GLiNER
    
    model = GLiNER.from_pretrained("urchade/gliner_base")
    
    text = """
    Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
    
  10. Z

    Multilingual named entity recognition for medieval charters. Datasets and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Torres Aguilar, Sergio (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6463698
    Explore at:
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    École nationale des chartes
    Authors
    Torres Aguilar, Sergio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

    The original raw texts for all charters were collected from four charters collections

    We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

    Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

    Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner

  11. Multilingual NER Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Multilingual NER Dataset [Dataset]. https://www.kaggle.com/thedevastator/multilingual-ner-dataset
    Explore at:
    zip(72419294 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Multilingual NER Dataset

    Multilingual NER Dataset for Named Entity Recognition

    By Babelscape (From Huggingface) [source]

    About this dataset

    The Babelscape/wikineural NER Dataset is a comprehensive and diverse collection of multilingual text data specifically designed for the task of Named Entity Recognition (NER). It offers an extensive range of labeled sentences in nine different languages: French, German, Portuguese, Spanish, Polish, Dutch, Russian, English, and Italian.

    Each sentence in the dataset contains tokens (words or characters) that have been labeled with named entity recognition tags. These tags provide valuable information about the type of named entity each token represents. The dataset also includes a language column to indicate the language in which each sentence is written.

    This dataset serves as an invaluable resource for developing and evaluating NER models across multiple languages. It encompasses various domains and contexts to ensure diversity and representativeness. Researchers and practitioners can utilize this dataset to train and test their NER models in real-world scenarios.

    By using this dataset for NER tasks, users can enhance their understanding of how named entities are recognized across different languages. Furthermore, it enables benchmarking performance comparisons between various NER models developed for specific languages or trained on multiple languages simultaneously.

    Whether you are an experienced researcher or a beginner exploring multilingual NER tasks, the Babelscape/wikineural NER Dataset provides a highly informative and versatile resource that can contribute to advancements in natural language processing and information extraction applications on a global scale

    How to use the dataset

    • Understand the Data Structure:

      • The dataset consists of labeled sentences in nine different languages: French (fr), German (de), Portuguese (pt), Spanish (es), Polish (pl), Dutch (nl), Russian (ru), English (en), and Italian (it).
      • Each sentence is represented by three columns: tokens, ner_tags, and lang.
      • The tokens column contains the individual words or characters in each labeled sentence.
      • The ner_tags column provides named entity recognition tags for each token, indicating their entity types.
      • The lang column specifies the language of each sentence.
    • Explore Different Languages:

      • Since this dataset covers multiple languages, you can choose to focus on a specific language or perform cross-lingual analysis.
      • Analyzing multiple languages can help uncover patterns and differences in named entities across various linguistic contexts.
    • Preprocessing and Cleaning:

      • Before training your NER models or applying any NLP techniques to this dataset, it's essential to preprocess and clean the data.
      • Consider removing any unnecessary punctuation marks or special characters unless they carry significant meaning in certain languages.
    • Training Named Entity Recognition Models: 4a. Data Splitting: Divide the dataset into training, validation, and testing sets based on your requirements using appropriate ratios. 4b. Feature Extraction: Prepare input features from tokenized text data such as word embeddings or character-level representations depending on your model choice. 4c. Model Training: Utilize state-of-the-art NER models (e.g., LSTM-CRF, Transformer-based models) to train on the labeled sentences and ner_tags columns. 4d. Evaluation: Evaluate your trained model's performance using the provided validation dataset or test datasets specific to each language.

    • Applying Pretrained Models:

      • Instead of training a model from scratch, you can leverage existing pretrained NER models like BERT, GPT-2, or SpaCy's named entity recognition capabilities.
      • Fine-tune these pre-trained models on your specific NER task using the labeled

    Research Ideas

    • Training NER models: This dataset can be used to train NER models in multiple languages. By providing labeled sentences and their corresponding named entity recognition tags, the dataset can help train models to accurately identify and classify named entities in different languages.
    • Evaluating NER performance: The dataset can be used as a benchmark to evaluate the performance of pre-trained or custom-built NER models. By using the labeled sentences as test data, developers and researchers can measure the accuracy, precision, recall, and F1-score of their models across multiple languages.
    • Cross-lingual analysis: With labeled sentences available in nine different languages, researchers can perform cross-lingual analysis...
  12. h

    malware-text-db-securebert-ner

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naor Matania, malware-text-db-securebert-ner [Dataset]. https://huggingface.co/datasets/naorm/malware-text-db-securebert-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Naor Matania
    Description

    naorm/malware-text-db-securebert-ner dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    dnrti-securebert-ner

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naor Matania, dnrti-securebert-ner [Dataset]. https://huggingface.co/datasets/naorm/dnrti-securebert-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Naor Matania
    Description

    naorm/dnrti-securebert-ner dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    swahili-ner-dataset

    • huggingface.co
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kayode Balogun (2025). swahili-ner-dataset [Dataset]. https://huggingface.co/datasets/Balogvn/swahili-ner-dataset
    Explore at:
    Dataset updated
    Oct 7, 2025
    Authors
    Kayode Balogun
    Description

    swahili-ner-dataset

      Dataset Card
    

    Dataset Name: swahili-ner-datasetLanguage: sw (Swahili)Number of Samples: 3Model Used for Annotation: dslim/bert-base-NERFiles Processed: 1Texts Processed: 3Processing Time: 4.01 secondsGenerated: 2025-10-13 09:13:19 UTC

      Description
    

    This is an automatically annotated dataset for Swahili Named Entity Recognition (NER). The dataset was processed using the February AI Pipeline, which recursively discovers and processes JSON… See the full description on the dataset page: https://huggingface.co/datasets/Balogvn/swahili-ner-dataset.

  15. h

    stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julia Dollis, stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr [Dataset]. https://huggingface.co/datasets/juliadollis/stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Julia Dollis
    Description

    juliadollis/stf_ner_pierreguillou-ner-bert-large-cased-pt-lenerbr dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    bert-spanish-cased-finetuned-ner-6ent

    • huggingface.co
    Updated Oct 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andres Tituana (2025). bert-spanish-cased-finetuned-ner-6ent [Dataset]. https://huggingface.co/datasets/aatituanav/bert-spanish-cased-finetuned-ner-6ent
    Explore at:
    Dataset updated
    Oct 7, 2025
    Authors
    Andres Tituana
    Description

    aatituanav/bert-spanish-cased-finetuned-ner-6ent dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    protein_structure_NER_model_v1.4

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Protein Data Bank in Europe, protein_structure_NER_model_v1.4 [Dataset]. https://huggingface.co/datasets/PDBEurope/protein_structure_NER_model_v1.4
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Protein Data Bank in Europe
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This data was used to train model: https://huggingface.co/PDBEurope/BiomedNLP-PubMedBERT-ProteinStructure-NER-v1.4 There are 19 different entity types in this dataset: "chemical", "complex_assembly", "evidence", "experimental_method", "gene", "mutant", "oligomeric_state", "protein", "protein_state", "protein_type", "ptm", "residue_name", "residue_name_number","residue_number", "residue_range", "site", "species", "structure_element", "taxonomy_domain" The data prepared as… See the full description on the dataset page: https://huggingface.co/datasets/PDBEurope/protein_structure_NER_model_v1.4.

  18. h

    autoeval-staging-eval-project-b20351ec-8855170

    • huggingface.co
    Updated Dec 6, 1996
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (1996). autoeval-staging-eval-project-b20351ec-8855170 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-b20351ec-8855170
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 1996
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: huggingface-course/bert-finetuned-ner Dataset: conll2003

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @ for evaluating this model.

  19. h

    ner_acro_combined

    • huggingface.co
    Updated Sep 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eduard E M (2023). ner_acro_combined [Dataset]. https://huggingface.co/datasets/eduardem/ner_acro_combined
    Explore at:
    Dataset updated
    Sep 21, 2023
    Authors
    Eduard E M
    License

    https://choosealicense.com/licenses/openrail/https://choosealicense.com/licenses/openrail/

    Description

    European Languages Multipurpose Dataset for NER

      Introduction
    

    This is a multipurpose dataset that includes names, proper nouns, and acronyms from various European languages. It is designed to be particularly useful for Named Entity Recognition (NER) tasks.

      Language Composition
    

    The dataset predominantly features data in English, followed by Spanish, French, and Romanian.

      Objective
    

    The primary aim of this dataset is to further fine-tune base BERT or… See the full description on the dataset page: https://huggingface.co/datasets/eduardem/ner_acro_combined.

  20. h

    protein_structure_NER_independent_val_set

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Protein Data Bank in Europe, protein_structure_NER_independent_val_set [Dataset]. https://huggingface.co/datasets/PDBEurope/protein_structure_NER_independent_val_set
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Protein Data Bank in Europe
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Overview

    This data was used to evaluate the two models below to decide whether convergence was reached. https://huggingface.co/PDBEurope/BiomedNLP-PubMedBERT-ProteinStructure-NER-v2.1 https://huggingface.co/PDBEurope/BiomedNLP-PubMedBERT-ProteinStructure-NER-v3.1 There are 20 different entity types in this dataset: "bond_interaction", "chemical", "complex_assembly", "evidence", "experimental_method", "gene", "mutant", "oligomeric_state", "protein", "protein_state", "protein_type"… See the full description on the dataset page: https://huggingface.co/datasets/PDBEurope/protein_structure_NER_independent_val_set.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470

aeroBERT-NER

all_text_annotation_NER.txt

archanatikayatray/aeroBERT-NER

Explore at:
43 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 7, 2023
Authors
Archana Tikayat Ray
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for aeroBERT-NER

  Dataset Summary

This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
(1) Making available an open-source dataset for aerospace requirements which are often proprietary
(2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

Search
Clear search
Close search
Google apps
Main menu