http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.
Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence
You can use Pandas Dataframe to read and manipulate this dataset.
Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```
data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string
You can use the following to convert it back to list type:
from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```
This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.
Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
Essential info about entities:
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Turkish Organization Named Entity Recognition (NER) Dataset
This dataset is designed for Named Entity Recognition (NER) tasks in the Turkish language, specifically focusing on organization entities. It contains sentences annotated with the ORGANIZATION label, making it a valuable resource for training and evaluating NER models.
Usage
To use this dataset, you can load it directly from Huggingface's datasets library: from datasets import load_dataset
dataset =… See the full description on the dataset page: https://huggingface.co/datasets/STNM-NLPhoenix/turkish-org-ner.
dev-ner-ontonotes
Validation set of NER dataset OntoNotes5 created with Argilla for a Argilla Meetup talk.
Usage
Load with Argilla
To load with Argilla, you'll just need to install Argilla as pip install argilla --upgrade and then use the following code: import argilla as rg
ds = rg.FeedbackDataset.from_huggingface("louisguitton/dev-ner-ontonotes")
Load with datasets
To load this dataset with datasets, you'll just need to install datasets as pip… See the full description on the dataset page: https://huggingface.co/datasets/louisguitton/dev-ner-ontonotes.
WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikiann', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
HealthE contains 3,400 pieces of health advice gathered 1) from public health websites (i.e. WebMD.com, MedlinePlus.gov, CDC.gov, and MayoClinic.org) 2) from the publicly available Preclude dataset. Each sample was hand-labeled for health entity recognition by a team of 14 annotators at the author's institution. Automatic recognition of health entities will enable further research in large-scale modeling of texts from online health communities.
The data is provided in two parts. Both are formatted using the popular, free python pickle
library and require use of the popular, free pandas
library.
healthe.pkl
is a pandas.DataFrame
object containing the 3,400 health-advice statement with hand-labeled health entities.
non_advice.pkl
is a pandas.DataFrame
object containing the 2,256 pieces of non-advice statements.
To load the files in python, use the following code block.
import pickle
import pandas as pd
healthe_df = pd.read_pickle('healthe.pkl')
non_advice_df = pd.read_pickle('non_advice_df.pkl')
healthe_df
has four columns.
* text
contains the health advice statement text
* entities
contains a python list of (entity, class) tuples
* tokenized_text
contains a list of tokens obtained by tokenizing the health advice statement text
* labels
contains a list of the same length as tokenized_text
, where each token is mapped to a class label.
non_advice_df
has one column, text
, referring to each non-health-advice-statement.
The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('conll2003', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
Dataset concatenating NER datasets, available in French and open-source, for 3 entities (LOC, PER, ORG).There are a total of 420,264 rows, of which 346,071 are for training, 32,951 for validation and 41,242 for testing.Our methodology is described in a blog post available in English or French.
Usage
from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchNER_3entities")
Dataset
Details of rows… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/frenchNER_3entities.
Large Weak Labelled NER corpus
Dataset Summary
The dataset is generated through weak labelling of the scraped and preprocessed news corpus (bloomberg's news). so, only to research purpose. In order of the tokenization, news were splitted into sentences using nltk.PunktSentenceTokenizer (so, sometimes, tokenization might be not perfect)
Usage
from datasets import load_dataset
articles_ds = load_dataset("imvladikon/english_news_weak_ner", "articles") # just… See the full description on the dataset page: https://huggingface.co/datasets/imvladikon/english_news_weak_ner.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset information
Dataset concatenating NER datasets, available in French and open-source, for 4 entities (LOC, PER, ORG, MISC).There are a total of 384,773 rows, of which 328,757 are for training, 24,131 for validation and 31,885 for testing.Our methodology is described in a blog post available in English or French.
Usage
from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchNER_4entities")
Dataset
Details of rows… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.
Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence
You can use Pandas Dataframe to read and manipulate this dataset.
Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```
data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string
You can use the following to convert it back to list type:
from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```
This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.
Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.
Essential info about entities: