100+ datasets found

h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
h
Financial-NER-NLP
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Joseph G Flowers
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.
h
Pile-NER-type
huggingface.co
Updated Aug 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Universal-NER (2023). Pile-NER-type [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-type
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Authors
Universal-NER
Description
Intro

Pile-NER-type is a set of GPT-generated data for named entity recognition using the type-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

License

Attribution-NonCommercial 4.0 International
E
Data from: PyTorch model for Slovenian Named Entity Recognition SloNER 1.0
live.european-language-grid.eu
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). PyTorch model for Slovenian Named Entity Recognition SloNER 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20980
Explore at:
Dataset updated
Jan 26, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The SloNER is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library (https://github.com/huggingface/transformers).

The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0 (http://hdl.handle.net/11356/1397). The model was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747).The source code of the model is available on GitHub repository https://github.com/clarinsi/SloNER.
MultiNERD NER models
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jayant Yadav (2023). MultiNERD NER models [Dataset]. https://www.kaggle.com/datasets/jayantyadav/multinerd-ner-models/versions/5
Explore at:
zip(2588704751 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
Jayant Yadav
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MultiNERD Named Entity Recognition dataset: https://huggingface.co/datasets/Babelscape/multinerd Paper: https://aclanthology.org/2022.findings-naacl.60.pdf Sample token classification script: https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb RoBERTa-base pretrained model link: https://huggingface.co/docs/transformers/model_doc/roberta
Multilingual NER Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Multilingual NER Dataset [Dataset]. https://www.kaggle.com/thedevastator/multilingual-ner-dataset
Explore at:
zip(72419294 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Multilingual NER Dataset

Multilingual NER Dataset for Named Entity Recognition

By Babelscape (From Huggingface) [source]

About this dataset

The Babelscape/wikineural NER Dataset is a comprehensive and diverse collection of multilingual text data specifically designed for the task of Named Entity Recognition (NER). It offers an extensive range of labeled sentences in nine different languages: French, German, Portuguese, Spanish, Polish, Dutch, Russian, English, and Italian.

Each sentence in the dataset contains tokens (words or characters) that have been labeled with named entity recognition tags. These tags provide valuable information about the type of named entity each token represents. The dataset also includes a language column to indicate the language in which each sentence is written.

This dataset serves as an invaluable resource for developing and evaluating NER models across multiple languages. It encompasses various domains and contexts to ensure diversity and representativeness. Researchers and practitioners can utilize this dataset to train and test their NER models in real-world scenarios.

By using this dataset for NER tasks, users can enhance their understanding of how named entities are recognized across different languages. Furthermore, it enables benchmarking performance comparisons between various NER models developed for specific languages or trained on multiple languages simultaneously.

Whether you are an experienced researcher or a beginner exploring multilingual NER tasks, the Babelscape/wikineural NER Dataset provides a highly informative and versatile resource that can contribute to advancements in natural language processing and information extraction applications on a global scale

How to use the dataset

Understand the Data Structure:

The dataset consists of labeled sentences in nine different languages: French (fr), German (de), Portuguese (pt), Spanish (es), Polish (pl), Dutch (nl), Russian (ru), English (en), and Italian (it).

Each sentence is represented by three columns: tokens, ner_tags, and lang.

The tokens column contains the individual words or characters in each labeled sentence.

The ner_tags column provides named entity recognition tags for each token, indicating their entity types.

The lang column specifies the language of each sentence.

Explore Different Languages:

Since this dataset covers multiple languages, you can choose to focus on a specific language or perform cross-lingual analysis.

Analyzing multiple languages can help uncover patterns and differences in named entities across various linguistic contexts.

Preprocessing and Cleaning:

Before training your NER models or applying any NLP techniques to this dataset, it's essential to preprocess and clean the data.

Consider removing any unnecessary punctuation marks or special characters unless they carry significant meaning in certain languages.

Training Named Entity Recognition Models: 4a. Data Splitting: Divide the dataset into training, validation, and testing sets based on your requirements using appropriate ratios. 4b. Feature Extraction: Prepare input features from tokenized text data such as word embeddings or character-level representations depending on your model choice. 4c. Model Training: Utilize state-of-the-art NER models (e.g., LSTM-CRF, Transformer-based models) to train on the labeled sentences and ner_tags columns. 4d. Evaluation: Evaluate your trained model's performance using the provided validation dataset or test datasets specific to each language.

Applying Pretrained Models:

Instead of training a model from scratch, you can leverage existing pretrained NER models like BERT, GPT-2, or SpaCy's named entity recognition capabilities.

Fine-tune these pre-trained models on your specific NER task using the labeled

Research Ideas

Training NER models: This dataset can be used to train NER models in multiple languages. By providing labeled sentences and their corresponding named entity recognition tags, the dataset can help train models to accurately identify and classify named entities in different languages.

Evaluating NER performance: The dataset can be used as a benchmark to evaluate the performance of pre-trained or custom-built NER models. By using the labeled sentences as test data, developers and researchers can measure the accuracy, precision, recall, and F1-score of their models across multiple languages.

Cross-lingual analysis: With labeled sentences available in nine different languages, researchers can perform cross-lingual analysis...
Z
GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...
data.niaid.nih.gov
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moncla, Ludovic; Vigier, Denis; McDonough, Katherine (2024). GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530177
Explore at:
Dataset updated
Mar 20, 2024
Dataset provided by
Lancaster University
Laboratoire d'Informatique en Images et Systèmes d'Information
Interactions, Corpus, Apprentissages, Représentations
Authors
Moncla, Ludovic; Vigier, Denis; McDonough, Katherine
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

The dataset is available in the following formats:

JSONL format provided by Prodigy

binary spaCy format (ready to use with the spaCy train pipeline)

The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

Tagset

NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.

NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.

ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.

Relation: spatial relation, e.g. dans, sur, à 10 lieues de.

Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.

NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.

NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.

ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.

NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique

ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.

Head: entry name

Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

HuggingFace

The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

Acknowledgement

The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
GLiNER Github Repo
kaggle.com
zip
Updated Oct 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2025). GLiNER Github Repo [Dataset]. https://www.kaggle.com/dschettler8845/gliner-github-repo
Explore at:
zip(545226 bytes)Available download formats
Dataset updated
Oct 26, 2025
Authors
Darien Schettler
Description
GLiNER : Generalist and Lightweight model for Named Entity Recognition

GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

Paper: https://arxiv.org/abs/2311.08526 (by Urchade Zaratiana, Nadi Tomeh, Pierre Holat, Thierry Charnois)

Demo: https://huggingface.co/spaces/tomaarsen/gliner_base

Colab: https://colab.research.google.com/drive/1mhalKWzmfSTqMnR0wQBZvt9-ktTsATHB?usp=sharing

Models Status

📢 Updates

📝 Finetuning notebook is available: examples/finetune.ipynb

🗂 Training dataset preprocessing scripts are now available in the data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.

Available Models on Hugging Face

[x] GLiNER-Base (CC BY NC 4.0)

[x] GLiNER-Multi (CC BY NC 4.0)

[x] GLiNER-small (CC BY NC 4.0)

[x] GLiNER-small-v2 (Apache)

[x] GLiNER-medium (CC BY NC 4.0)

[x] GLiNER-medium-v2 (Apache)

[x] GLiNER-large (CC BY NC 4.0)

[x] GLiNER-large-v2 (Apache)

To Release

[ ] ⏳ GLiNER-Multiv2

[ ] ⏳ GLiNER-Sup (trained on mixture of NER datasets)

Area of improvements / research

[ ] Allow longer context (eg. train with long context transformers such as Longformer, LED, etc.)

[ ] Use Bi-encoder (entity encoder and span encoder) allowing precompute entity embeddings

[ ] Filtering mechanism to reduce number of spans before final classification to save memory and computation when the number entity types is large

[ ] Improve understanding of more detailed prompts/instruction, eg. "Find the first name of the person in the text"

[ ] Better loss function: for instance use Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than others

[ ] Improve multi-lingual capabilities: train on more languages, and use multi-lingual training data

[ ] Decoding: allow a span to have multiple labels, eg: "Cristiano Ronaldo" is both a "person" and "football player"

[ ] Dynamic thresholding (in model.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.

[ ] Train with EMAs (Exponential Moving Averages) or merge multiple checkpoints to improve model robustness (see this paper

[ ] Extend the model to relation extraction but need dataset with relation annotations. Our preliminary work ATG.

Installation

To use this model, you must install the GLiNER Python library: !pip install gliner

Usage

Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.

from gliner import GLiNER model = GLiNER.from_pretrained("urchade/gliner_base") text = """ Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
h
HiNER-original
huggingface.co
Updated May 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Computation for Indian Language Technology (2022). HiNER-original [Dataset]. https://huggingface.co/datasets/cfilt/HiNER-original
Explore at:
Dataset updated
May 2, 2022
Dataset authored and provided by
Computation for Indian Language Technology
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is the dataset repository for HiNER Dataset accepted to be published at LREC 2022. The dataset can help build sequence labelling models for the task Named Entity Recognitin for the Hindi language.
Z
Multilingual named entity recognition for medieval charters. Datasets and...
data.niaid.nih.gov
zenodo.org
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Torres Aguilar, Sergio (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6463698
Explore at:
Dataset updated
Jan 16, 2023
Dataset provided by
École nationale des chartes
Authors
Torres Aguilar, Sergio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

The original raw texts for all charters were collected from four charters collections

HOME-ALCAR corpus : https://zenodo.org/record/5600884

CBMA : http://www.cbma-project.eu

Diplomata Belgica : https://www.diplomata-belgica.be

CODEA corpus : https://corpuscodea.es/

We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner
m
PyVulDet-NER
data.mendeley.com
huggingface.co
Updated Sep 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melanie Ehrenberg (2023). PyVulDet-NER [Dataset]. http://doi.org/10.17632/h22kxj6ydt.1
Explore at:
Unique identifier
https://doi.org/10.17632/h22kxj6ydt.1
Dataset updated
Sep 19, 2023
Authors
Melanie Ehrenberg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data in this repository is associated with a manuscript paper, entitled "Python Source Code Vulnerability Detection with Named Entity Recognition". The paper has been submitted to the "DevSecOps: Advances for Secure Software Development" special issue in the "Computers & Security" journal. This research is part of an in-progress dissertation for George Washington University. In addition to the data shown in this repository, the following NER models were created with this data to identify 6 vulnerability types in Python source code: https://huggingface.co/mmeberg/RoRo_PyVulDet_NER https://huggingface.co/mmeberg/RoCo_PyVulDet_NER https://huggingface.co/mmeberg/DiDi_PyVulDet_NER https://huggingface.co/mmeberg/CoRo_PyVulDet_NER https://huggingface.co/mmeberg/CoCo_PyVulDet_NER
NLUCat
zenodo.org
huggingface.co
Updated Mar 5, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10362026
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10362026
Dataset updated
Mar 5, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.
The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).
The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.
The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)
This dataset can be used to train models for intent classification, spans identification and examples generation.
This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.
This work is licensed under a CC0 International License.
In this repository you'll find the following items:
NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
NLUCat_dataset.json: the completed NLUCat dataset
NLUCat_stats.tsv: statistics about de NLUCat dataset
dataset: folder with the dataset as published in [HuggingFace](https://huggingface.co/datasets/projecte-aina/NLUCat), splited and prepared for training and evaluating intent classifiers
reports: folder with the reports done as feedback to the annotators during the annotation process
This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA.
kpwr-ner
huggingface.co
Updated Apr 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLARIN-PL (2022). kpwr-ner [Dataset]. https://huggingface.co/datasets/clarin-pl/kpwr-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2022
Dataset authored and provided by
CLARIN-PL
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
KPWR-NER tagging dataset.
Named Entity Recognition
sdiinnovation-geoplatform.hub.arcgis.com
Updated May 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2022). Named Entity Recognition [Dataset]. https://sdiinnovation-geoplatform.hub.arcgis.com/content/97369a6f1200428ba060410d13dbb078
Explore at:
Dataset updated
May 28, 2022
Dataset authored and provided by
Esrihttp://esri.com/
Area covered
Description
This deep learning model is used to identify or categorize entities in unstructured text. An entity may refer to a word or a sequence of words such as the name of “Organizations,” “Persons,” “Country,” or “Date” and “Time” in the text. This model detects entities from the given text and classifies them into pre-determined categories. Named entity recognition (NER) is useful when a high-level overview of a large quantity of text is required. NER can let you know crucial and important information in text by extracting the main entities from it. The extracted entities are categorized into pre-determined classes and can help in drawing meaningful decisions and conclusions.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check the Deep Learning Libraries Installer for ArcGIS. Fine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputText files on which named entity extraction will be performed. OutputClassified tokens into the following pre-defined entity classes: PERSON – People, including fictionalNORP – Nationalities or religious or political groupsFACILITY – Buildings, airports, highways, bridges, etc.ORGANIZATION – Companies, agencies, institutions, etc.GPE – Countries, cities, statesLOCATION – Non-GPE locations, mountain ranges, bodies of waterPRODUCT – Vehicles, weapons, foods, etc. (Not services)EVENT – Named hurricanes, battles, wars, sports events, etc.WORK OF ART – Titles of books, songs, etc.LAW – Named documents made into lawsLANGUAGE – Any named languageDATE – Absolute or relative dates or periodsTIME – Times smaller than a dayPERCENT – Percentage (including “%”)MONEY – Monetary values, including unitQUANTITY – Measurements, as of weight or distanceORDINAL – “first,” “second”CARDINAL – Numerals that do not fall under another typeModel architectureThis model uses the XLM-RoBERTa architecture implemented in Hugging Face transformers using the TNER library.Accuracy metricsThis model has an accuracy of 91.6 percent.Training dataThe model has been trained on the OntoNotes Release 5.0 dataset.Sample resultsHere are a few results from the model. CitationsWeischedel, Ralph, et al. OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013. Asahi Ushio and Jose Camacho-Collados. 2021. TNER: An all-round Python library for transformer based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 53–62, Online. Association for Computational Linguistics.
Few-NERD
kaggle.com
opendatalab.com
+1more
zip
Updated Jun 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2021). Few-NERD [Dataset]. https://www.kaggle.com/nbroad/fewnerd
Explore at:
zip(49275322 bytes)Available download formats
Dataset updated
Jun 3, 2021
Authors
Nicholas Broad
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)). Few-NERD is collected by researchers from Tsinghua University and DAMO Academy, Alibaba Group.

For more details about Few-NERD, please refer to the ACL-IJCNLP 2021 paper: https://arxiv.org/abs/2105.07464

The official Few-NERD website is here: https://ningding97.github.io/fewnerd/
h
german-ler
huggingface.co
opendatalab.com
Updated Nov 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Leitner (2024). german-ler [Dataset]. http://doi.org/10.57967/hf/0046
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0046
Dataset updated
Nov 2, 2024
Authors
Elena Leitner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for "German LER"

Dataset Summary

A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. The dataset is human-annotated with 19 fine-grained entity classes. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities. NER tags use the BIO tagging scheme. The dataset includes two different versions of annotations, one with a set of 19 fine-grained semantic classes (ner_tags) and another one… See the full description on the dataset page: https://huggingface.co/datasets/elenanereiss/german-ler.
HAREM Portuguese NER Corpus
kaggle.com
zip
Updated Dec 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). HAREM Portuguese NER Corpus [Dataset]. https://www.kaggle.com/thedevastator/harem-portuguese-ner-corpus
Explore at:
zip(258157 bytes)Available download formats
Dataset updated
Dec 2, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
HAREM Portuguese NER Corpus

Portuguese NER Corpus with 10 Classes

By harem (From Huggingface) [source]

About this dataset

The dataset is available in two versions: a complete version with 10 different named entity classes, including Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other; and a selective version with only 5 classes (Person, Organization, Location ,Value,and Date). The selective version focuses on the most commonly recognized named entity types.

It's worth noting that the original HAREM dataset had two levels of NER details: Category and Sub-type. However,the processed version of the corpus presented in this Kaggle dataset only includes information up to the Category level.

Each entry in this dataset consists of tokenized words from the original text along with their corresponding NER tags assigned through annotation. The tokens column contains individual words or tokens extracted from the text while **tokens provide a duplicate column for consistency purposes.

Furthermore,the ner_tags column contains specific class labels assigned to each token indicating their corresponding named entity class such as Person or Organization.The **ner_tags serves as an additional identical column which contributes to ensuring consistency within datasets where both columns might co-occur.

This particular Kaggle dataset also contains three separate CSV files: train.csv for training data,a validation.csv subset file utilized for validating NER model performance on Portuguese texts,and test.csv comprising another subset of HAREM corpus where there are tokenized words alongside their respective NER tags.The availability of different files enables users to efficiently train,test,and validate NER models on Portuguese texts using reliable sources,

How to use the dataset

Introduction:

Dataset Overview:

Dataset Files: a) train.csv - Contains the training data with tokens (individual words or tokens) and their corresponding named entity recognition (NER) tags. b) validation.csv - Provides a subset of the corpus for validating model performance in identifying named entities. c) test.csv - Contains tokenized words from the corpus along with their respective NER tags.

Named Entity Classes: The dataset includes 10 different named entity classes: Person, Organization, Location, Value, Date,**+, Title,**part as-seq +,, Thing,+seq+ Abstraction,+adv , Event,+pron +no,. Other,+d_em , Type sequences[uTO, DoI, -DATETIME] represent substantive addresses,.

Understanding the Columns: a) tokens:**contains** - This column comprises individual tokens or words extracted from the text. b)**ner_tags:**contains** - The ner_tags column lists the assigned named entity recognition tags associated with each token in relation to its respective class.

Training and Evaluation: To use this dataset for training a NER model, you can utilize the train.csv file. The tokens column will provide you with the words or tokens, while the ner_tags column will guide you in labeling the named entities within your training data.

For evaluating your model's performance, the validation.csv file can be used. Similar to the train.csv file, it contains tokenized words and their corresponding NER tags.

Applying Pretrained Models: You can also use this dataset to fine-tune or evaluate pretrained NER models in Portuguese. By utilizing transfer learning techniques on this corpus, you may improve their performance on relevant named entity recognition tasks specific

Research Ideas

Entity Recognition and Classification: This dataset can be used to train and evaluate models for named entity recognition (NER) tasks in Portuguese. The NER tags provided in the dataset can serve as labels for training models to accurately identify and classify entities such as person names, organization names, locations, dates, etc.

Cross-lingual Transfer Learning: The dataset can also be leveraged for cross-lingual transfer learning tasks by training models on this dataset and then using the trained model to extract named entities from other languages as well. This would enable NER tasks in multiple languages using a single trained model by leveraging knowledge gained from this rich resource of labeled data in Portuguese

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, mod...
h
AnythingNER
huggingface.co
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CascadeNER (2024). AnythingNER [Dataset]. https://huggingface.co/datasets/CascadeNER/AnythingNER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Authors
CascadeNER
Description
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

This repository is supplement material for the paper: DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

💓Update!

DynamicNER is disclosed now! Please download it from Huggingface for train and evaluation!

We add more GEIC format existing datasets and also the format for fine-tuning and inferrence based on SWIFT! You can… See the full description on the dataset page: https://huggingface.co/datasets/CascadeNER/AnythingNER.
h
PII-NER
huggingface.co
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2024
Authors
Joseph G Flowers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
The Chilean Waiting List Corpus
zenodo.org
txt, zip
Updated Jan 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pablo Báez; Pablo Báez; Fabián Villena; Fabián Villena; Matías Rojas; Felipe Bravo-Marquez; Jocelyn Dunstan; Jocelyn Dunstan; Matías Rojas; Felipe Bravo-Marquez (2023). The Chilean Waiting List Corpus [Dataset]. http://doi.org/10.5281/zenodo.7555181
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7555181
Dataset updated
Jan 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pablo Báez; Pablo Báez; Fabián Villena; Fabián Villena; Matías Rojas; Felipe Bravo-Marquez; Jocelyn Dunstan; Jocelyn Dunstan; Matías Rojas; Felipe Bravo-Marquez
Area covered
Chile
Description
Here we describe a new clinical corpus rich in nested entities and a series of neural models to identify them. The corpus comprises de-identified referrals from the waiting list in Chilean public hospitals. A subset of 9,000 referrals (medical and dental) was manually annotated with ten types of entities, six attributes, and pairs of relations with clinical relevance. A trained medical doctor or dentist annotated these referrals and then, together with three other researchers, consolidated each of the annotations. The annotated corpus has more than 48% of entities embedded in other entities or containing another. We use this corpus to build Named Entity Recognition (NER) models. The best results were achieved using Multiple Single-entity architectures with clinical word embeddings stacked with character and Flair contextual embeddings (refer to this paper: https://aclanthology.org/2022.coling-1.184/). The entity with the best performance is abbreviation, and the hardest to recognize is finding. NER models applied to this corpus can leverage statistics of diseases and pending procedures. This work constitutes the first annotated corpus using clinical narratives from Chile and one of the few in Spanish. The annotated corpus, clinical word embeddings, annotation guidelines, and neural models are freely released to the community.This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
We are releasing the dataset in 3 formats:

cwlc.zip: Contains the raw text files for each document along with its annotation file in Standoff format

cwlc_conll-format: CoNLL format for training NER models.

In addition, the dataset has been released in hugging face (https://huggingface.co/plncmm) to facilitate experiments with transformer-based architectures.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 10, 2022

Authors

Rafael Arias Calles

License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Clear search

Close search

Google apps

Main menu

kaggle-entity-annotated-corpus-ner-dataset

Financial-NER-NLP

Pile-NER-type

Data from: PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

MultiNERD NER models

Multilingual NER Dataset

Multilingual NER Dataset

Multilingual NER Dataset for Named Entity Recognition

About this dataset

How to use the dataset

Research Ideas

GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...

GLiNER Github Repo

GLiNER : Generalist and Lightweight model for Named Entity Recognition

Models Status

📢 Updates

Available Models on Hugging Face

To Release

Area of improvements / research

Installation

Usage

HiNER-original

Multilingual named entity recognition for medieval charters. Datasets and...

PyVulDet-NER

NLUCat

kpwr-ner

Named Entity Recognition

Few-NERD

german-ler

HAREM Portuguese NER Corpus

HAREM Portuguese NER Corpus

Portuguese NER Corpus with 10 Classes

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

AnythingNER

PII-NER

The Chilean Waiting List Corpus

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset