https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.
About Dataset
from Kaggle Datasets
Context
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Universal Named Entity Recognition (UNER) aims to fill a gap in multilingual NLP: high quality NER datasets in many languages with a shared tagset.
UNER is modeled after the Universal Dependencies project, in that it is intended to be a large community annotation effort with language-universal guidelines. Further, we use the same text corpora as Universal Dependencies.
Intro
Pile-NER-definition is a set of GPT-generated data for named entity recognition using the definition-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.
License
Attribution-NonCommercial 4.0 International
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.
The original raw texts for all charters were collected from four charters collections
- HOME-ALCAR corpus : https://zenodo.org/record/5600884
- CBMA : http://www.cbma-project.eu
- Diplomata Belgica : https://www.diplomata-belgica.be
- CODEA corpus : https://corpuscodea.es/
We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)
Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual
Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for aeroBERT-NER
Dataset Summary
This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme.
There are a total of 1432 sentences. The creation of this dataset is aimed at -
(1) Making available an open-source dataset for aerospace requirements which are often proprietary
(2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The SloNER is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library (https://github.com/huggingface/transformers).
The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0 (http://hdl.handle.net/11356/1397). The model was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747).The source code of the model is available on GitHub repository https://github.com/clarinsi/SloNER.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.
The dataset is available in the following formats:
JSONL format provided by Prodigy
binary spaCy format (ready to use with the spaCy train pipeline)
The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.
The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.
Tagset
NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.
NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.
ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.
Relation: spatial relation, e.g. dans, sur, à 10 lieues de.
Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.
NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.
NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.
ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.
NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique
ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.
Head: entry name
Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.
HuggingFace
The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA
spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries
This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.
Acknowledgement
The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.
The GermEval dataset is a valuable resource for natural language processing (NLP) tasks, specifically named entity recognition (NER), conducted in the German language. Here are some key details about this dataset:
Task: Token Classification (specifically, named entity recognition) Language: German Size: The dataset falls within the category of 100K < n < 1M tokens. Source: The data was sampled from German Wikipedia and News Corpora, comprising a collection of citations. Annotations: The annotations were created through crowdsourcing efforts. License: The dataset is available under the cc-by-4.0 license. Content: It covers over 31,000 sentences, corresponding to more than 590,000 tokens. Purpose: Researchers and practitioners can use this dataset to train and evaluate NER models for German text.
You can find more information and explore the dataset on the Hugging Face Datasets page ¹.
(1) germeval_14 · Datasets at Hugging Face. https://huggingface.co/datasets/germeval_14. (2) GermEval-2018 Corpus (DE) - Empirical Linguistics and ... - heiDATA. https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/0B5VML. (3) GermEval 2014 Named Entity Recognition Shared Task - Data and Task Setup. https://sites.google.com/site/germeval2014ner/data. (4) 6 Best German Language Datasets of 2022 | Twine - Twine Blog. https://www.twine.net/blog/best-german-language-datasets/. (5) germeval_14 | TensorFlow Datasets. https://www.tensorflow.org/datasets/community_catalog/huggingface/germeval_14. (6) undefined. http://www.stern.de/sport/fussball/krawalle-in-der-fussball-bundesliga-dfb-setzt-auf-falsche-konzepte-1553657.html. (7) undefined. http://www.fr-online.de/in_und_ausland/sport/aktuell/1618625_Frings-schaut-finster-in-die-Zukunft.html.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.
The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).
The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.
The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)
This dataset can be used to train models for intent classification, spans identification and examples generation.
This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.
In this repository you'll find the following items:
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.
Intent classification, spans identification and examples generation.
The dataset is in Catalan (ca-ES).
Three JSON files, one for each split.
Example
An example looks as follows:
{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},
We created this dataset to contribute to the development of language models in Catalan, a low-resource language.
When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.
Initial Data Collection and Normalization
We commissioned a company to create fictitious examples for the creation of this dataset.
Who are the source language producers?
We commissioned the writing of the examples to the company m47 labs.
Annotation process
The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.
Who are the annotators?
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
No personal or sensitive information included.
The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.
We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.
When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.
[N/A]
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Polyglot-NER A training dataset automatically generated from Wikipedia and Freebase the task of named entity recognition. The dataset contains the basic Wikipedia based training data for 40 languages we have (with coreference resolution) for the task of named entity recognition. The details of the procedure of generating them is outlined in Section 3 of the paper (https://arxiv.org/abs/1410.3791). Each config contains the data corresponding to a different language. For example, "es" includes only spanish examples.
This deep learning model is used to identify or categorize entities in unstructured text. An entity may refer to a word or a sequence of words such as the name of “Organizations,” “Persons,” “Country,” or “Date” and “Time” in the text. This model detects entities from the given text and classifies them into pre-determined categories.
Named entity recognition (NER) is useful when a high-level overview of a large quantity of text is required. NER can let you know crucial and important information in text by extracting the main entities from it. The extracted entities are categorized into pre-determined classes and can help in drawing meaningful decisions and conclusions.
Using the model
Follow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check the Deep Learning Libraries Installer for ArcGIS.
Fine-tuning the model
This model cannot be fine-tuned using ArcGIS tools.
Input
Text files on which named entity extraction will be performed.
Output
Classified tokens into the following pre-defined entity classes:
PERSON – People, including fictional NORP – Nationalities or religious or political groups FACILITY – Buildings, airports, highways, bridges, etc. ORGANIZATION – Companies, agencies, institutions, etc. GPE – Countries, cities, states LOCATION – Non-GPE locations, mountain ranges, bodies of water PRODUCT – Vehicles, weapons, foods, etc. (Not services) EVENT – Named hurricanes, battles, wars, sports events, etc. WORK OF ART – Titles of books, songs, etc. LAW – Named documents made into laws LANGUAGE – Any named language DATE – Absolute or relative dates or periods TIME – Times smaller than a day PERCENT – Percentage (including “%”) MONEY – Monetary values, including unit QUANTITY – Measurements, as of weight or distance ORDINAL – “first,” “second” CARDINAL – Numerals that do not fall under another type
Model architecture
This model uses the XLM-RoBERTa architecture implemented in Hugging Face transformers using the TNER library.
Accuracy metrics
This model has an accuracy of 91.6 percent.
Training dataThe model has been trained on the OntoNotes Release 5.0 dataset.
Sample resultsHere are a few results from the model.
Citations
Weischedel, Ralph, et al. OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013. Asahi Ushio and Jose Camacho-Collados. 2021. TNER: An all-round Python library for transformer based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 53–62, Online. Association for Computational Linguistics.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This the synthetic dataset used for training https://huggingface.co/urchade/gliner_multi_pii-v1. You can get it by browsing the files and dowloading the data.json file.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Azerbaijani Named Entity Recognition (NER) Dataset
This repository contains the dataset for training and evaluating Named Entity Recognition (NER) models in the Azerbaijani language. The dataset includes annotated text data with various named entities.
Dataset Description
The dataset includes the following entity types:
0: O: Outside any named entity 1: PERSON: Names of individuals 2: LOCATION: Geographical locations, both man-made and natural 3: ORGANISATION: Names of… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Tags: PER(人名), LOC(地点名), GPE(行政区名), ORG(机构名) Label Tag Meaning PER PER.NAM 名字(张三) PER.NOM 代称、类别名(穷人) LOC LOC.NAM 特指名称(紫玉山庄) LOC.NOM 泛称(大峡谷、宾馆) GPE GPE.NAM 行政区的名称(北京) ORG ORG.NAM 特定机构名称(通惠医院) ORG.NOM 泛指名称、统称(文艺公司)
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
dxiao/requirements-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
(NER) ontonotes-v5-eng-v4
This dataset is subset of conll2012_ontonotesv5 original dataset.
Language: english Version: v4
Dataset Examples
Training 75187
Testing 9479
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Groceries Named Entity Recognition (NER) Dataset
A specialized dataset for identifying food and grocery items in natural language text using Named Entity Recognition (NER).
Entity Types
The dataset includes the following grocery categories:
Fruits Vegetables: Fresh produce (e.g., apples, spinach) Lactose, Diary, Eggs, Cheese, Yoghurt: Dairy products and eggs Meat, Fish, Seafood: Protein sources Frozen, Prepared Meals: Ready-to-eat and frozen meals Baking, Cooking: Baking… See the full description on the dataset page: https://huggingface.co/datasets/empathyai/grocery-ner-dataset.
https://choosealicense.com/licenses/bsd/https://choosealicense.com/licenses/bsd/
Movie ner dataset
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.
About Dataset
from Kaggle Datasets
Context
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.