100+ datasets found

h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
h
universal_ner_nordic_FL
huggingface.co
Updated Nov 14, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kätriin Kukk (2024). universal_ner_nordic_FL [Dataset]. https://huggingface.co/datasets/K2triinK/universal_ner_nordic_FL
Explore at:
Dataset updated
Nov 14, 2024
Authors
Kätriin Kukk
Description
Universal Named Entity Recognition (UNER) aims to fill a gap in multilingual NLP: high quality NER datasets in many languages with a shared tagset. UNER is modeled after the Universal Dependencies project, in that it is intended to be a large community annotation effort with language-universal guidelines. Further, we use the same text corpora as Universal Dependencies.
h
Pile-NER-definition
huggingface.co
Updated Aug 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Universal-NER (2023). Pile-NER-definition [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-definition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 13, 2023
Authors
Universal-NER
Description
Intro

Pile-NER-definition is a set of GPT-generated data for named entity recognition using the definition-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

License

Attribution-NonCommercial 4.0 International
Multilingual named entity recognition for medieval charters. Datasets and...
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sergio Torres Aguilar; Sergio Torres Aguilar (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. http://doi.org/10.5281/zenodo.6463699
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6463699
Dataset updated
Jan 16, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sergio Torres Aguilar; Sergio Torres Aguilar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

The original raw texts for all charters were collected from four charters collections

- HOME-ALCAR corpus : https://zenodo.org/record/5600884

- CBMA : http://www.cbma-project.eu

- Diplomata Belgica : https://www.diplomata-belgica.be

- CODEA corpus : https://corpuscodea.es/

We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner
E
Data from: PyTorch model for Slovenian Named Entity Recognition SloNER 1.0
live.european-language-grid.eu
Updated Jan 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). PyTorch model for Slovenian Named Entity Recognition SloNER 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20980
Explore at:
Dataset updated
Jan 26, 2023
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The SloNER is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library (https://github.com/huggingface/transformers).

The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0 (http://hdl.handle.net/11356/1397). The model was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747).The source code of the model is available on GitHub repository https://github.com/clarinsi/SloNER.
Z
GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...
data.niaid.nih.gov
explore.openaire.eu
+1more
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moncla, Ludovic (2024). GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530177
Explore at:
Dataset updated
Mar 20, 2024
Dataset provided by
Vigier, Denis
McDonough, Katherine
Moncla, Ludovic
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

The dataset is available in the following formats:

JSONL format provided by Prodigy

binary spaCy format (ready to use with the spaCy train pipeline)

The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

Tagset

NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.

NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.

ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.

Relation: spatial relation, e.g. dans, sur, à 10 lieues de.

Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.

NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.

NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.

ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.

NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique

ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.

Head: entry name

Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

HuggingFace

The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

Acknowledgement

The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
NLUCat
zenodo.org
huggingface.co
+2more
zip
Updated Mar 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10721193
Dataset updated
Mar 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NLUCat

Dataset Description

Dataset Summary

NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

This dataset can be used to train models for intent classification, spans identification and examples generation.

This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

In this repository you'll find the following items:

NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team

NLUCat_dataset.json: the completed NLUCat dataset

NLUCat_stats.tsv: statistics about de NLUCat dataset

dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers

reports: folder with the reports done as feedback to the annotators during the annotation process

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

Supported Tasks and Leaderboards

Intent classification, spans identification and examples generation.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

Data Instances

Three JSON files, one for each split.

Data Fields

example: `str`. Example

annotation: `dict`. Annotation of the example

intent: `str`. Intent tag

slots: `list`. List of slots

Tag:`str`. tag to the slot

Text:`str`. Text of the slot

Start_char: `int`. First character of the span

End_char: `int`. Last character of the span

Example

An example looks as follows:

{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},

Data Splits

NLUCat.train: 9128 examples

NLUCat.dev: 1441 examples

NLUCat.test: 1441 examples

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

Source Data

Initial Data Collection and Normalization

We commissioned a company to create fictitious examples for the creation of this dataset.

Who are the source language producers?

We commissioned the writing of the examples to the company m47 labs.

Annotations

Annotation process

The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

Who are the annotators?

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

Personal and Sensitive Information

No personal or sensitive information included.

The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

Considerations for Using the Data

Social Impact of Dataset

We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

Discussion of Biases

When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
h
Funder-NER
huggingface.co
Updated Aug 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ZBW Leibniz Information Center for Economics (2023). Funder-NER [Dataset]. http://doi.org/10.57967/hf/1011
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1011
Dataset updated
Aug 25, 2023
Dataset authored and provided by
ZBW Leibniz Information Center for Economics
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for Dataset Named Entity Recognition of funders of scientific research

Dataset Summary

Training/test set for automatically identifying funder entities mentioned in scientific papers. This data set is generated from Open Access documents hosted at https://econstor.eu and manually curated/labeled.

Supported Tasks and Leaderboards

The dataset is for training and testing the automatic recognition of funders as they are acknowledged in scientific… See the full description on the dataset page: https://huggingface.co/datasets/ZBWatHF/Funder-NER.
Weekly supervised Multilingual Data Set to train Named Entity Recognition...
zenodo.org
Updated Apr 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc (2025). Weekly supervised Multilingual Data Set to train Named Entity Recognition for Symptom Extraction [Dataset]. http://doi.org/10.5281/zenodo.13918009
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13918009
Dataset updated
Apr 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Sets were generated using the Weakly Supervised NER pipeline (https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) to train the symptom extraction NER models.

Supported Languages and dataset locations for the specific language:

English (base language): https://huggingface.co/HUMADEX/english_medical_ner
German: https://huggingface.co/HUMADEX/german_medical_ner
Italian: https://huggingface.co/HUMADEX/italian_medical_ner
Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
Greek: https://huggingface.co/HUMADEX/german_medical_ner
Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
Polish: https://huggingface.co/HUMADEX/polish_medical_ner
Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner

Dataset Building

Data Integration and Preprocessing

Data Cleaning

Annotation with Stanza's i2b2 Clinical Model

Translation into the targeted language

Word Alignment

Data Augmentation

Acknowledgement
This dataset had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.

Authors:
dr. Izidor Mlakar, Rigona Sallauka, dr. Umut Arioz, dr. Matej Rojc

Please cite as:

Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
Doi: 10.20944/preprints202504.1356.v1
Website: https://www.preprints.org/manuscript/202504.1356/v1" href="https://www.preprints.org/manuscript/202504.1356/v1">https://www.preprints.org/manuscript/202504.1356/v1
Named Entity Recognition
hub.arcgis.com
sdiinnovation-geoplatform.hub.arcgis.com
Updated May 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri (2022). Named Entity Recognition [Dataset]. https://hub.arcgis.com/content/97369a6f1200428ba060410d13dbb078
Explore at:
Dataset updated
May 28, 2022
Dataset authored and provided by
Esrihttp://esri.com/
Description
This deep learning model is used to identify or categorize entities in unstructured text. An entity may refer to a word or a sequence of words such as the name of “Organizations,” “Persons,” “Country,” or “Date” and “Time” in the text. This model detects entities from the given text and classifies them into pre-determined categories.

Named entity recognition (NER) is useful when a high-level overview of a large quantity of text is required. NER can let you know crucial and important information in text by extracting the main entities from it. The extracted entities are categorized into pre-determined classes and can help in drawing meaningful decisions and conclusions.

Using the model

Follow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check the Deep Learning Libraries Installer for ArcGIS.

Fine-tuning the model

This model cannot be fine-tuned using ArcGIS tools.

Input

Text files on which named entity extraction will be performed.

Output

Classified tokens into the following pre-defined entity classes:

PERSON – People, including fictional NORP – Nationalities or religious or political groups FACILITY – Buildings, airports, highways, bridges, etc. ORGANIZATION – Companies, agencies, institutions, etc. GPE – Countries, cities, states LOCATION – Non-GPE locations, mountain ranges, bodies of water PRODUCT – Vehicles, weapons, foods, etc. (Not services) EVENT – Named hurricanes, battles, wars, sports events, etc. WORK OF ART – Titles of books, songs, etc. LAW – Named documents made into laws LANGUAGE – Any named language DATE – Absolute or relative dates or periods TIME – Times smaller than a day PERCENT – Percentage (including “%”) MONEY – Monetary values, including unit QUANTITY – Measurements, as of weight or distance ORDINAL – “first,” “second” CARDINAL – Numerals that do not fall under another type

Model architecture

This model uses the XLM-RoBERTa architecture implemented in Hugging Face transformers using the TNER library.

Accuracy metrics

This model has an accuracy of 91.6 percent.

Training dataThe model has been trained on the OntoNotes Release 5.0 dataset.

Sample resultsHere are a few results from the model.

Citations

Weischedel, Ralph, et al. OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013. Asahi Ushio and Jose Camacho-Collados. 2021. TNER: An all-round Python library for transformer based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 53–62, Online. Association for Computational Linguistics.
Wikipedia corpus for synthetic data made for Handwritten Text Recognition...
zenodo.org
txt, zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas CONSTUM; Thomas CONSTUM (2025). Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition [Dataset]. http://doi.org/10.1007/s10032-024-00511-9
Explore at:
zip, txtAvailable download formats
Unique identifier
https://doi.org/10.1007/s10032-024-00511-9
Dataset updated
Jul 3, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Thomas CONSTUM; Thomas CONSTUM
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.

Contents of the archive:

wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.

wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.

wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.

wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.

Data format for corpora in Hugging Face datasets structure:

Each record in the datasets follows the dictionary structure below:

{
"id": "
h
german-ler
huggingface.co
opendatalab.com
Updated Nov 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Leitner (2024). german-ler [Dataset]. http://doi.org/10.57967/hf/0046
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0046
Dataset updated
Nov 2, 2024
Authors
Elena Leitner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for "German LER"

Dataset Summary

A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. The dataset is human-annotated with 19 fine-grained entity classes. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities. NER tags use the BIO tagging scheme. The dataset includes two different versions of annotations, one with a set of 19 fine-grained semantic classes (ner_tags) and another one… See the full description on the dataset page: https://huggingface.co/datasets/elenanereiss/german-ler.
Few-NERD
opendatalab.com
huggingface.co
zip
Updated Sep 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alibaba Damo Academy (2022). Few-NERD [Dataset]. https://opendatalab.com/OpenDataLab/Few-NERD
Explore at:
zip(2843586339 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
达摩院
Tsinghua University
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)).
h
azerbaijani-ner-dataset
huggingface.co
Updated Jun 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LocalDoc (2024). azerbaijani-ner-dataset [Dataset]. http://doi.org/10.57967/hf/2484
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2484
Dataset updated
Jun 13, 2024
Dataset authored and provided by
LocalDoc
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Azerbaijani Named Entity Recognition (NER) Dataset

This repository contains the dataset for training and evaluating Named Entity Recognition (NER) models in the Azerbaijani language. The dataset includes annotated text data with various named entities.

Dataset Description

The dataset includes the following entity types:

0: O: Outside any named entity 1: PERSON: Names of individuals 2: LOCATION: Geographical locations, both man-made and natural 3: ORGANISATION: Names of… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset.
Data from: Multi-head CRF classifier for biomedical multi-class Named Entity...
zenodo.org
data.niaid.nih.gov
tsv, zip
Updated May 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Richard A A Jonker; Richard A A Jonker; Tiago Almeida; Tiago Almeida; Rui Antunes; Rui Antunes; João Rafael Almeida; João Rafael Almeida; Sérgio Matos; Sérgio Matos (2024). Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes [Dataset]. http://doi.org/10.5281/zenodo.11174163
Explore at:
tsv, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11174163
Dataset updated
May 10, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Richard A A Jonker; Richard A A Jonker; Tiago Almeida; Tiago Almeida; Rui Antunes; Rui Antunes; João Rafael Almeida; João Rafael Almeida; Sérgio Matos; Sérgio Matos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
May 6, 2024
Description
This contains the merged dataset as described in the work "Multi-head CRF classifier for biomedical multi-class Named Entity Recognition on Spanish clinical notes".

This dataset consists of 4 seperate datasets:

MedProcNer

DisTEMIST

PharmaCoNER

SympTEMIST

The dataset contains two tasks:

Task 1: This task is related to multi-class Named Entity Recognition. This dataset contains 5 possible classes: SYMPTOM, PROCEDURE, DISEASE, CHEMICAL and PROTEIN.

Task 2: This task is related to Named Entity Linking, where each code corresponds to a code within the SNOMED-CT corpus. The exact corpus used can be obtained here. Further for the MedProcNER, SympTEMIST and DisTEMIST datasets, a gazetteer is provided in the original datasets.

For more information on the construction of the dataset, aswell as dataloaders, we refer you to our GitHub repository.

Further this also contains the embeddings from the SapBERT model.

Please, cite:

@article{jonker2024a, title = {Multi-head {{CRF}} classifier for biomedical multi-class named entity recognition on {{Spanish}} clinical notes}, author = {Jonker, Richard A. A. and Almeida, Tiago and Antunes, Rui and Almeida, Jo{\~a}o R. and Matos, S{\'e}rgio}, year = {2024}, journal = {Database}, publisher = {Oxford University Press} }

Jonker, R. A. A., Almeida, T., Antunes, R., Almeida, J. R., & Matos, S. (2024). Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. (Submitted.)

License

This work is licensed under a Creative Commons Attribution 4.0 International License.
T
wikiann
tensorflow.org
huggingface.co
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). wikiann [Dataset]. https://www.tensorflow.org/datasets/catalog/wikiann
Explore at:
Dataset updated
Jan 4, 2023
Description
WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikiann', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
arabic-iahlt-NER
huggingface.co
Updated May 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Israel National NLP Program (2025). arabic-iahlt-NER [Dataset]. https://huggingface.co/datasets/HebArabNlpProject/arabic-iahlt-NER
Explore at:
Dataset updated
May 19, 2025
Dataset authored and provided by
Israel National NLP Program
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IAHLT Named Entities Dataset (Arabic Subset)

האיגוד הישראלי לטכנולוגיות שפת אנושالرابطة الإسرائيلية لتكنولوجيا اللغة البشريةThe Israeli Association of Human Language Technologieshttps://www.iahlt.org This dataset contains named entity annotations for Arabic texts from various sources, curated as part of the IAHLT multilingual NER project. The Arabic portion is provided here as a cleaned subset intended for training and evaluation in named entity recognition tasks.

Files… See the full description on the dataset page: https://huggingface.co/datasets/HebArabNlpProject/arabic-iahlt-NER.
h
ancora-ca-ner
huggingface.co
Updated Nov 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Projecte Aina (2021). ancora-ca-ner [Dataset]. https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2021
Dataset authored and provided by
Projecte Aina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for AnCora-Ca-NER

Dataset Summary

This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).

Supported Tasks and Leaderboards

Named Entities Recognition, Language Model

Languages

The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.
T
conll2003
tensorflow.org
opendatalab.com
+1more
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). conll2003 [Dataset]. https://www.tensorflow.org/datasets/catalog/conll2003
Explore at:
Dataset updated
Dec 22, 2022
Description
The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('conll2003', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
bioleaflets-biomedical-ner
huggingface.co
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruslan Yermak (2023). bioleaflets-biomedical-ner [Dataset]. https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2023
Authors
Ruslan Yermak
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for BioLeaflets Dataset

Dataset Summary

BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 10, 2022

Authors

Rafael Arias Calles

License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Clear search

Close search

Google apps

Main menu

kaggle-entity-annotated-corpus-ner-dataset

universal_ner_nordic_FL

Pile-NER-definition

Multilingual named entity recognition for medieval charters. Datasets and...

Data from: PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...

NLUCat

NLUCat

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Funder-NER

Weekly supervised Multilingual Data Set to train Named Entity Recognition...

Named Entity Recognition

Wikipedia corpus for synthetic data made for Handwritten Text Recognition...

german-ler

Few-NERD

azerbaijani-ner-dataset

Data from: Multi-head CRF classifier for biomedical multi-class Named Entity...

wikiann

arabic-iahlt-NER

ancora-ca-ner

conll2003

bioleaflets-biomedical-ner

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset