100+ datasets found
  1. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  2. h

    Pile-NER-definition

    • huggingface.co
    Updated Aug 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universal-NER (2023). Pile-NER-definition [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-definition
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 13, 2023
    Authors
    Universal-NER
    Description

    Intro

    Pile-NER-definition is a set of GPT-generated data for named entity recognition using the definition-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

      License
    

    Attribution-NonCommercial 4.0 International

  3. kpwr-ner

    • huggingface.co
    Updated Apr 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLARIN-PL (2022). kpwr-ner [Dataset]. https://huggingface.co/datasets/clarin-pl/kpwr-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2022
    Dataset authored and provided by
    CLARIN-PL
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    KPWR-NER tagging dataset.

  4. h

    Financial-NER-NLP

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Joseph G Flowers
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.

  5. BERT-Base-Multilingual-Cased

    • kaggle.com
    zip
    Updated Dec 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehrdad ALMASI (2024). BERT-Base-Multilingual-Cased [Dataset]. https://www.kaggle.com/datasets/mehrdadal2023/bert-base-multilingual-cased
    Explore at:
    zip(2992453291 bytes)Available download formats
    Dataset updated
    Dec 9, 2024
    Authors
    Mehrdad ALMASI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This file contains the pre-trained BERT-Base-Multilingual-Cased model files, originally provided by Google on Hugging Face. The model supports 104 languages and is ideal for a wide range of Natural Language Processing (NLP) tasks, including text classification, named entity recognition (NER), and question answering.

    This model is particularly helpful for multilingual NLP applications due to its ability to process cased text (case-sensitive input). Key details:

    Source: Hugging Face Architecture: 12-layer Transformer with 110M parameters Tasks: Text classification, NER, question answering, etc.

    To install, run the line below :

    pip install /kaggle/input/bert-base-multilingual-cased/Google Bert Multilingual

  6. MultiNERD NER models

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jayant Yadav (2023). MultiNERD NER models [Dataset]. https://www.kaggle.com/datasets/jayantyadav/multinerd-ner-models/versions/5
    Explore at:
    zip(2588704751 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    Jayant Yadav
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
  7. GLiNER Github Repo

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2025). GLiNER Github Repo [Dataset]. https://www.kaggle.com/dschettler8845/gliner-github-repo
    Explore at:
    zip(545226 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Darien Schettler
    Description

    GLiNER : Generalist and Lightweight model for Named Entity Recognition

    GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

    Demo Image

    Models Status

    📢 Updates

    • 📝 Finetuning notebook is available: examples/finetune.ipynb
    • 🗂 Training dataset preprocessing scripts are now available in the data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.

    Available Models on Hugging Face

    To Release

    • [ ] ⏳ GLiNER-Multiv2
    • [ ] ⏳ GLiNER-Sup (trained on mixture of NER datasets)

    Area of improvements / research

    • [ ] Allow longer context (eg. train with long context transformers such as Longformer, LED, etc.)
    • [ ] Use Bi-encoder (entity encoder and span encoder) allowing precompute entity embeddings
    • [ ] Filtering mechanism to reduce number of spans before final classification to save memory and computation when the number entity types is large
    • [ ] Improve understanding of more detailed prompts/instruction, eg. "Find the first name of the person in the text"
    • [ ] Better loss function: for instance use Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than others
    • [ ] Improve multi-lingual capabilities: train on more languages, and use multi-lingual training data
    • [ ] Decoding: allow a span to have multiple labels, eg: "Cristiano Ronaldo" is both a "person" and "football player"
    • [ ] Dynamic thresholding (in model.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.
    • [ ] Train with EMAs (Exponential Moving Averages) or merge multiple checkpoints to improve model robustness (see this paper
    • [ ] Extend the model to relation extraction but need dataset with relation annotations. Our preliminary work ATG.

    Installation

    To use this model, you must install the GLiNER Python library: !pip install gliner

    Usage

    Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.

    from gliner import GLiNER
    
    model = GLiNER.from_pretrained("urchade/gliner_base")
    
    text = """
    Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
    
  8. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

  9. Multilingual named entity recognition for medieval charters. Datasets and...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Torres Aguilar; Sergio Torres Aguilar (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. http://doi.org/10.5281/zenodo.6463699
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Torres Aguilar; Sergio Torres Aguilar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

    The original raw texts for all charters were collected from four charters collections

    - HOME-ALCAR corpus : https://zenodo.org/record/5600884

    - CBMA : http://www.cbma-project.eu

    - Diplomata Belgica : https://www.diplomata-belgica.be

    - CODEA corpus : https://corpuscodea.es/

    We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

    Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

    Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner

  10. h

    AnythingNER

    • huggingface.co
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CascadeNER (2024). AnythingNER [Dataset]. https://huggingface.co/datasets/CascadeNER/AnythingNER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Authors
    CascadeNER
    Description

    DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

    This repository is supplement material for the paper: DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

      💓Update!
    

    DynamicNER is disclosed now! Please download it from Huggingface for train and evaluation!

    We add more GEIC format existing datasets and also the format for fine-tuning and inferrence based on SWIFT! You can… See the full description on the dataset page: https://huggingface.co/datasets/CascadeNER/AnythingNER.

  11. HAREM Portuguese NER Corpus

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). HAREM Portuguese NER Corpus [Dataset]. https://www.kaggle.com/thedevastator/harem-portuguese-ner-corpus
    Explore at:
    zip(258157 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    HAREM Portuguese NER Corpus

    Portuguese NER Corpus with 10 Classes

    By harem (From Huggingface) [source]

    About this dataset

    The dataset is available in two versions: a complete version with 10 different named entity classes, including Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other; and a selective version with only 5 classes (Person, Organization, Location ,Value,and Date). The selective version focuses on the most commonly recognized named entity types.

    It's worth noting that the original HAREM dataset had two levels of NER details: Category and Sub-type. However,the processed version of the corpus presented in this Kaggle dataset only includes information up to the Category level.

    Each entry in this dataset consists of tokenized words from the original text along with their corresponding NER tags assigned through annotation. The tokens column contains individual words or tokens extracted from the text while **tokens provide a duplicate column for consistency purposes.

    Furthermore,the ner_tags column contains specific class labels assigned to each token indicating their corresponding named entity class such as Person or Organization.The **ner_tags serves as an additional identical column which contributes to ensuring consistency within datasets where both columns might co-occur.

    This particular Kaggle dataset also contains three separate CSV files: train.csv for training data,a validation.csv subset file utilized for validating NER model performance on Portuguese texts,and test.csv comprising another subset of HAREM corpus where there are tokenized words alongside their respective NER tags.The availability of different files enables users to efficiently train,test,and validate NER models on Portuguese texts using reliable sources,

    How to use the dataset

    Introduction:

    • Dataset Overview:

    • Dataset Files: a) train.csv - Contains the training data with tokens (individual words or tokens) and their corresponding named entity recognition (NER) tags. b) validation.csv - Provides a subset of the corpus for validating model performance in identifying named entities. c) test.csv - Contains tokenized words from the corpus along with their respective NER tags.

    • Named Entity Classes: The dataset includes 10 different named entity classes: Person, Organization, Location, Value, Date,**+, Title,**part as-seq +,, Thing,+seq+ Abstraction,+adv , Event,+pron +no,. Other,+d_em , Type sequences[uTO, DoI, -DATETIME] represent substantive addresses,.

    • Understanding the Columns: a) tokens:**contains** - This column comprises individual tokens or words extracted from the text. b)**ner_tags:**contains** - The ner_tags column lists the assigned named entity recognition tags associated with each token in relation to its respective class.

    • Training and Evaluation: To use this dataset for training a NER model, you can utilize the train.csv file. The tokens column will provide you with the words or tokens, while the ner_tags column will guide you in labeling the named entities within your training data.

    For evaluating your model's performance, the validation.csv file can be used. Similar to the train.csv file, it contains tokenized words and their corresponding NER tags.

    • Applying Pretrained Models: You can also use this dataset to fine-tune or evaluate pretrained NER models in Portuguese. By utilizing transfer learning techniques on this corpus, you may improve their performance on relevant named entity recognition tasks specific

    Research Ideas

    • Entity Recognition and Classification: This dataset can be used to train and evaluate models for named entity recognition (NER) tasks in Portuguese. The NER tags provided in the dataset can serve as labels for training models to accurately identify and classify entities such as person names, organization names, locations, dates, etc.
    • Cross-lingual Transfer Learning: The dataset can also be leveraged for cross-lingual transfer learning tasks by training models on this dataset and then using the trained model to extract named entities from other languages as well. This would enable NER tasks in multiple languages using a single trained model by leveraging knowledge gained from this rich resource of labeled data in Portuguese

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, mod...

  12. WIESP2022-NER

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAO/NASA Astrophysics Data System, WIESP2022-NER [Dataset]. https://huggingface.co/datasets/adsabs/WIESP2022-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Astrophysics Data Systemhttp://www.adsabs.harvard.edu/
    Authors
    SAO/NASA Astrophysics Data System
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for the first Workshop on Information Extraction from Scientific Publications (WIESP/2022).

      Dataset Description
    

    Datasets with text fragments from astrophysics papers, provided by the NASA Astrophysical Data System with manually tagged astronomical facilities and other entities of interest (e.g., celestial objects).Datasets are in JSON Lines format (each line is a json dictionary).The datasets are formatted similarly to the CONLL2003 format. Each token is… See the full description on the dataset page: https://huggingface.co/datasets/adsabs/WIESP2022-NER.

  13. h

    ancora-ca-ner

    • huggingface.co
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Projecte Aina (2021). ancora-ca-ner [Dataset]. https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Dataset authored and provided by
    Projecte Aina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for AnCora-Ca-NER

      Dataset Summary
    

    This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).

      Supported Tasks and Leaderboards
    

    Named Entities Recognition, Language Model

      Languages
    

    The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.

  14. h

    RWCS-NER-DC

    • huggingface.co
    Updated Mar 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shilin (2025). RWCS-NER-DC [Dataset]. https://huggingface.co/datasets/zsLin/RWCS-NER-DC
    Explore at:
    Dataset updated
    Mar 24, 2025
    Authors
    Shilin
    Description

    zsLin/RWCS-NER-DC dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    Annotated_NER_PDF_Resumes

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MehyarMlaweh (2024). Annotated_NER_PDF_Resumes [Dataset]. https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Authors
    MehyarMlaweh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    IT Skills Named Entity Recognition (NER) Dataset

      Description:
    

    This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.

      Highlights:
    

    5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.

  16. h

    aeroBERT-NER

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
    Explore at:
    Dataset updated
    Apr 7, 2023
    Authors
    Archana Tikayat Ray
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for aeroBERT-NER

      Dataset Summary
    

    This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
    (1) Making available an open-source dataset for aerospace requirements which are often proprietary
    (2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

  17. h

    ner-model-tune

    • huggingface.co
    Updated Jul 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maro Huang (2023). ner-model-tune [Dataset]. https://huggingface.co/datasets/ayuhamaro/ner-model-tune
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 11, 2023
    Authors
    Maro Huang
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Dataset Card for "NER Model Tune"

      Dataset Summary
    

    [More Information Needed]

      Supported Tasks and Leaderboards
    

    [More Information Needed]

      Languages
    

    [More Information Needed]

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    [More Information Needed]

      Data Fields
    

    [More Information Needed]

      Data Splits
    

    [More Information Needed]

      Dataset Creation
    
    
    
    
    
      Curation Rationale
    

    [More Information Needed]

      Source Data… See the full description on the dataset page: https://huggingface.co/datasets/ayuhamaro/ner-model-tune.
    
  18. h

    AskNews-NER-v0

    • huggingface.co
    Updated Mar 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emergent Methods (2024). AskNews-NER-v0 [Dataset]. https://huggingface.co/datasets/EmergentMethods/AskNews-NER-v0
    Explore at:
    Dataset updated
    Mar 30, 2024
    Dataset authored and provided by
    Emergent Methods
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset aims to improve the representation of underrepresented topics and entities in entity extractors, thereby improving entity extraction accuracy and generalization, especially on the latest news events (dataset represents broad news coverage between February 20-March 31, 2024). The dataset is a collection of news article summaries, translated and summarized with Llama2, and then entities extracted with Llama3. The distribution of data origin… See the full description on the dataset page: https://huggingface.co/datasets/EmergentMethods/AskNews-NER-v0.

  19. h

    medmentions-ner

    • huggingface.co
    Updated Aug 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gerald Amasi (2025). medmentions-ner [Dataset]. https://huggingface.co/datasets/geraldamasi/medmentions-ner
    Explore at:
    Dataset updated
    Aug 11, 2025
    Authors
    Gerald Amasi
    Description

    MedMentions BioNER (Custom Processed)

    This dataset is a custom preprocessed version of the MedMentions dataset for biomedical Named Entity Recognition (NER) tasks.It is compatible with Hugging Face Datasets and can be used directly for fine-tuning BERT-based models such as BERT or Bio_ClinicalBERT.

      Dataset Summary
    

    Task: Named Entity Recognition (NER) in biomedical text Source: MedMentions Language: English Entity Types: 128 entity classes derived from UMLS semantic… See the full description on the dataset page: https://huggingface.co/datasets/geraldamasi/medmentions-ner.

  20. h

    bioleaflets-biomedical-ner

    • huggingface.co
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruslan Yermak (2023). bioleaflets-biomedical-ner [Dataset]. https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2023
    Authors
    Ruslan Yermak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for BioLeaflets Dataset

      Dataset Summary
    

    BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Search
Clear search
Close search
Google apps
Main menu