100+ datasets found

h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
h
Pile-NER-definition
huggingface.co
Updated Aug 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Universal-NER (2023). Pile-NER-definition [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-definition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 13, 2023
Authors
Universal-NER
Description
Intro

Pile-NER-definition is a set of GPT-generated data for named entity recognition using the definition-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

License

Attribution-NonCommercial 4.0 International
kpwr-ner
huggingface.co
Updated Apr 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLARIN-PL (2022). kpwr-ner [Dataset]. https://huggingface.co/datasets/clarin-pl/kpwr-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2022
Dataset authored and provided by
CLARIN-PL
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
KPWR-NER tagging dataset.
h
Financial-NER-NLP
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Joseph G Flowers
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.
BERT-Base-Multilingual-Cased
kaggle.com
zip
Updated Dec 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehrdad ALMASI (2024). BERT-Base-Multilingual-Cased [Dataset]. https://www.kaggle.com/datasets/mehrdadal2023/bert-base-multilingual-cased
Explore at:
zip(2992453291 bytes)Available download formats
Dataset updated
Dec 9, 2024
Authors
Mehrdad ALMASI
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This file contains the pre-trained BERT-Base-Multilingual-Cased model files, originally provided by Google on Hugging Face. The model supports 104 languages and is ideal for a wide range of Natural Language Processing (NLP) tasks, including text classification, named entity recognition (NER), and question answering.

This model is particularly helpful for multilingual NLP applications due to its ability to process cased text (case-sensitive input). Key details:

Source: Hugging Face Architecture: 12-layer Transformer with 110M parameters Tasks: Text classification, NER, question answering, etc.

To install, run the line below :

pip install /kaggle/input/bert-base-multilingual-cased/Google Bert Multilingual
MultiNERD NER models
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jayant Yadav (2023). MultiNERD NER models [Dataset]. https://www.kaggle.com/datasets/jayantyadav/multinerd-ner-models/versions/5
Explore at:
zip(2588704751 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
Jayant Yadav
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MultiNERD Named Entity Recognition dataset: https://huggingface.co/datasets/Babelscape/multinerd Paper: https://aclanthology.org/2022.findings-naacl.60.pdf Sample token classification script: https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb RoBERTa-base pretrained model link: https://huggingface.co/docs/transformers/model_doc/roberta
GLiNER Github Repo
kaggle.com
zip
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2025). GLiNER Github Repo [Dataset]. https://www.kaggle.com/dschettler8845/gliner-github-repo
Explore at:
zip(545226 bytes)Available download formats
Dataset updated
Oct 26, 2025
Authors
Darien Schettler
Description
GLiNER : Generalist and Lightweight model for Named Entity Recognition

GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

Paper: https://arxiv.org/abs/2311.08526 (by Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois)

Demo: https://huggingface.co/spaces/tomaarsen/gliner_base

Colab: https://colab.research.google.com/drive/1mhalKWzmfSTqMnR0wQBZvt9-ktTsATHB?usp=sharing

Models Status

📢 Updates

📝 Finetuning notebook is available: examples/finetune.ipynb

🗂 Training dataset preprocessing scripts are now available in the data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.

Available Models on Hugging Face

[x] GLiNER-Base (CC BY NC 4.0)

[x] GLiNER-Multi (CC BY NC 4.0)

[x] GLiNER-small (CC BY NC 4.0)

[x] GLiNER-small-v2 (Apache)

[x] GLiNER-medium (CC BY NC 4.0)

[x] GLiNER-medium-v2 (Apache)

[x] GLiNER-large (CC BY NC 4.0)

[x] GLiNER-large-v2 (Apache)

To Release

[ ] ⏳ GLiNER-Multiv2

[ ] ⏳ GLiNER-Sup (trained on mixture of NER datasets)

Area of improvements / research

[ ] Allow longer context (eg. train with long context transformers such as Longformer, LED, etc.)

[ ] Use Bi-encoder (entity encoder and span encoder) allowing precompute entity embeddings

[ ] Filtering mechanism to reduce number of spans before final classification to save memory and computation when the number entity types is large

[ ] Improve understanding of more detailed prompts/instruction, eg. "Find the first name of the person in the text"

[ ] Better loss function: for instance use Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than others

[ ] Improve multi-lingual capabilities: train on more languages, and use multi-lingual training data

[ ] Decoding: allow a span to have multiple labels, eg: "Cristiano Ronaldo" is both a "person" and "football player"

[ ] Dynamic thresholding (in model.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.

[ ] Train with EMAs (Exponential Moving Averages) or merge multiple checkpoints to improve model robustness (see this paper

[ ] Extend the model to relation extraction but need dataset with relation annotations. Our preliminary work ATG.

Installation

To use this model, you must install the GLiNER Python library: !pip install gliner

Usage

Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.

from gliner import GLiNER model = GLiNER.from_pretrained("urchade/gliner_base") text = """ Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
h
PII-NER
huggingface.co
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2024
Authors
Joseph G Flowers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
Multilingual named entity recognition for medieval charters. Datasets and...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Torres Aguilar; Sergio Torres Aguilar (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. http://doi.org/10.5281/zenodo.6463699
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6463699
Dataset updated
Jan 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sergio Torres Aguilar; Sergio Torres Aguilar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

The original raw texts for all charters were collected from four charters collections

- HOME-ALCAR corpus : https://zenodo.org/record/5600884

- CBMA : http://www.cbma-project.eu

- Diplomata Belgica : https://www.diplomata-belgica.be

- CODEA corpus : https://corpuscodea.es/

We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner
h
AnythingNER
huggingface.co
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CascadeNER (2024). AnythingNER [Dataset]. https://huggingface.co/datasets/CascadeNER/AnythingNER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Authors
CascadeNER
Description
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

This repository is supplement material for the paper: DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

💓Update!

DynamicNER is disclosed now! Please download it from Huggingface for train and evaluation!

We add more GEIC format existing datasets and also the format for fine-tuning and inferrence based on SWIFT! You can… See the full description on the dataset page: https://huggingface.co/datasets/CascadeNER/AnythingNER.
HAREM Portuguese NER Corpus
kaggle.com
zip
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). HAREM Portuguese NER Corpus [Dataset]. https://www.kaggle.com/thedevastator/harem-portuguese-ner-corpus
Explore at:
zip(258157 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
HAREM Portuguese NER Corpus

Portuguese NER Corpus with 10 Classes

By harem (From Huggingface) [source]

About this dataset

The dataset is available in two versions: a complete version with 10 different named entity classes, including Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other; and a selective version with only 5 classes (Person, Organization, Location ,Value,and Date). The selective version focuses on the most commonly recognized named entity types.

It's worth noting that the original HAREM dataset had two levels of NER details: Category and Sub-type. However,the processed version of the corpus presented in this Kaggle dataset only includes information up to the Category level.

Each entry in this dataset consists of tokenized words from the original text along with their corresponding NER tags assigned through annotation. The tokens column contains individual words or tokens extracted from the text while **tokens provide a duplicate column for consistency purposes.

Furthermore,the ner_tags column contains specific class labels assigned to each token indicating their corresponding named entity class such as Person or Organization.The **ner_tags serves as an additional identical column which contributes to ensuring consistency within datasets where both columns might co-occur.

This particular Kaggle dataset also contains three separate CSV files: train.csv for training data,a validation.csv subset file utilized for validating NER model performance on Portuguese texts,and test.csv comprising another subset of HAREM corpus where there are tokenized words alongside their respective NER tags.The availability of different files enables users to efficiently train,test,and validate NER models on Portuguese texts using reliable sources,

How to use the dataset

Introduction:

Dataset Overview:

Dataset Files: a) train.csv - Contains the training data with tokens (individual words or tokens) and their corresponding named entity recognition (NER) tags. b) validation.csv - Provides a subset of the corpus for validating model performance in identifying named entities. c) test.csv - Contains tokenized words from the corpus along with their respective NER tags.

Named Entity Classes: The dataset includes 10 different named entity classes: Person, Organization, Location, Value, Date,**+, Title,**part as-seq +,, Thing,+seq+ Abstraction,+adv , Event,+pron +no,. Other,+d_em , Type sequences[uTO, DoI, -DATETIME] represent substantive addresses,.

Understanding the Columns: a) tokens:**contains** - This column comprises individual tokens or words extracted from the text. b)**ner_tags:**contains** - The ner_tags column lists the assigned named entity recognition tags associated with each token in relation to its respective class.

Training and Evaluation: To use this dataset for training a NER model, you can utilize the train.csv file. The tokens column will provide you with the words or tokens, while the ner_tags column will guide you in labeling the named entities within your training data.

For evaluating your model's performance, the validation.csv file can be used. Similar to the train.csv file, it contains tokenized words and their corresponding NER tags.

Applying Pretrained Models: You can also use this dataset to fine-tune or evaluate pretrained NER models in Portuguese. By utilizing transfer learning techniques on this corpus, you may improve their performance on relevant named entity recognition tasks specific

Research Ideas

Entity Recognition and Classification: This dataset can be used to train and evaluate models for named entity recognition (NER) tasks in Portuguese. The NER tags provided in the dataset can serve as labels for training models to accurately identify and classify entities such as person names, organization names, locations, dates, etc.

Cross-lingual Transfer Learning: The dataset can also be leveraged for cross-lingual transfer learning tasks by training models on this dataset and then using the trained model to extract named entities from other languages as well. This would enable NER tasks in multiple languages using a single trained model by leveraging knowledge gained from this rich resource of labeled data in Portuguese

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, mod...
WIESP2022-NER
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SAO/NASA Astrophysics Data System, WIESP2022-NER [Dataset]. https://huggingface.co/datasets/adsabs/WIESP2022-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Astrophysics Data Systemhttp://www.adsabs.harvard.edu/
Authors
SAO/NASA Astrophysics Data System
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for the first Workshop on Information Extraction from Scientific Publications (WIESP/2022).

Dataset Description

Datasets with text fragments from astrophysics papers, provided by the NASA Astrophysical Data System with manually tagged astronomical facilities and other entities of interest (e.g., celestial objects).Datasets are in JSON Lines format (each line is a json dictionary).The datasets are formatted similarly to the CONLL2003 format. Each token is… See the full description on the dataset page: https://huggingface.co/datasets/adsabs/WIESP2022-NER.
h
ancora-ca-ner
huggingface.co
Updated Nov 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Projecte Aina (2021). ancora-ca-ner [Dataset]. https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2021
Dataset authored and provided by
Projecte Aina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for AnCora-Ca-NER

Dataset Summary

This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).

Supported Tasks and Leaderboards

Named Entities Recognition, Language Model

Languages

The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.
h
RWCS-NER-DC
huggingface.co
Updated Mar 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shilin (2025). RWCS-NER-DC [Dataset]. https://huggingface.co/datasets/zsLin/RWCS-NER-DC
Explore at:
Dataset updated
Mar 24, 2025
Authors
Shilin
Description
zsLin/RWCS-NER-DC dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Annotated_NER_PDF_Resumes
huggingface.co
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MehyarMlaweh (2024). Annotated_NER_PDF_Resumes [Dataset]. https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Authors
MehyarMlaweh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
IT Skills Named Entity Recognition (NER) Dataset

Description:

This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.

Highlights:

5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.
h
aeroBERT-NER
huggingface.co
Updated Apr 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0470
Dataset updated
Apr 7, 2023
Authors
Archana Tikayat Ray
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for aeroBERT-NER

Dataset Summary

This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
(1) Making available an open-source dataset for aerospace requirements which are often proprietary
(2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.
h
ner-model-tune
huggingface.co
Updated Jul 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maro Huang (2023). ner-model-tune [Dataset]. https://huggingface.co/datasets/ayuhamaro/ner-model-tune
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 11, 2023
Authors
Maro Huang
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
Dataset Card for "NER Model Tune"

Dataset Summary

[More Information Needed]

Supported Tasks and Leaderboards

[More Information Needed]

Languages

[More Information Needed]

Dataset Structure Data Instances

[More Information Needed]

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation Curation Rationale

[More Information Needed]

Source Data… See the full description on the dataset page: https://huggingface.co/datasets/ayuhamaro/ner-model-tune.
h
AskNews-NER-v0
huggingface.co
Updated Mar 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emergent Methods (2024). AskNews-NER-v0 [Dataset]. https://huggingface.co/datasets/EmergentMethods/AskNews-NER-v0
Explore at:
Dataset updated
Mar 30, 2024
Dataset authored and provided by
Emergent Methods
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Dataset Name

This dataset aims to improve the representation of underrepresented topics and entities in entity extractors, thereby improving entity extraction accuracy and generalization, especially on the latest news events (dataset represents broad news coverage between February 20-March 31, 2024). The dataset is a collection of news article summaries, translated and summarized with Llama2, and then entities extracted with Llama3. The distribution of data origin… See the full description on the dataset page: https://huggingface.co/datasets/EmergentMethods/AskNews-NER-v0.
h
medmentions-ner
huggingface.co
Updated Aug 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerald Amasi (2025). medmentions-ner [Dataset]. https://huggingface.co/datasets/geraldamasi/medmentions-ner
Explore at:
Dataset updated
Aug 11, 2025
Authors
Gerald Amasi
Description
MedMentions BioNER (Custom Processed)

This dataset is a custom preprocessed version of the MedMentions dataset for biomedical Named Entity Recognition (NER) tasks.It is compatible with Hugging Face Datasets and can be used directly for fine-tuning BERT-based models such as BERT or Bio_ClinicalBERT.

Dataset Summary

Task: Named Entity Recognition (NER) in biomedical text Source: MedMentions Language: English Entity Types: 128 entity classes derived from UMLS semantic… See the full description on the dataset page: https://huggingface.co/datasets/geraldamasi/medmentions-ner.
h
bioleaflets-biomedical-ner
huggingface.co
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruslan Yermak (2023). bioleaflets-biomedical-ner [Dataset]. https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2023
Authors
Ruslan Yermak
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for BioLeaflets Dataset

Dataset Summary

BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 10, 2022

Authors

Rafael Arias Calles

License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Clear search

Close search

Google apps

Main menu

kaggle-entity-annotated-corpus-ner-dataset

Pile-NER-definition

kpwr-ner

Financial-NER-NLP

BERT-Base-Multilingual-Cased

MultiNERD NER models

GLiNER Github Repo

GLiNER : Generalist and Lightweight model for Named Entity Recognition

Models Status

📢 Updates

Available Models on Hugging Face

To Release

Area of improvements / research

Installation

Usage

PII-NER

Multilingual named entity recognition for medieval charters. Datasets and...

AnythingNER

HAREM Portuguese NER Corpus

HAREM Portuguese NER Corpus

Portuguese NER Corpus with 10 Classes

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

WIESP2022-NER

ancora-ca-ner

RWCS-NER-DC

Annotated_NER_PDF_Resumes

aeroBERT-NER

ner-model-tune

AskNews-NER-v0

medmentions-ner

bioleaflets-biomedical-ner

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset