9 datasets found

Named Entity Recognition (NER) Corpus
kaggle.com
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Naser Al-qaydeh
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

geo = Geographical Entity

org = Organization

per = Person

gpe = Geopolitical Entity

tim = Time indicator

art = Artifact

eve = Event

nat = Natural Phenomenon
h
turkish-org-ner
huggingface.co
Updated Aug 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
STNM - NLPhoenix (2024). turkish-org-ner [Dataset]. https://huggingface.co/datasets/STNM-NLPhoenix/turkish-org-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 7, 2024
Authors
STNM - NLPhoenix
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Turkish Organization Named Entity Recognition (NER) Dataset

This dataset is designed for Named Entity Recognition (NER) tasks in the Turkish language, specifically focusing on organization entities. It contains sentences annotated with the ORGANIZATION label, making it a valuable resource for training and evaluating NER models.

Usage

To use this dataset, you can load it directly from Huggingface's datasets library: from datasets import load_dataset

dataset =… See the full description on the dataset page: https://huggingface.co/datasets/STNM-NLPhoenix/turkish-org-ner.
h
dev-ner-ontonotes
huggingface.co
Updated Aug 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Louis Guitton (2024). dev-ner-ontonotes [Dataset]. https://huggingface.co/datasets/louisguitton/dev-ner-ontonotes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 14, 2024
Authors
Louis Guitton
Description
dev-ner-ontonotes

Validation set of NER dataset OntoNotes5 created with Argilla for a Argilla Meetup talk.

Usage Load with Argilla

To load with Argilla, you'll just need to install Argilla as pip install argilla --upgrade and then use the following code: import argilla as rg

ds = rg.FeedbackDataset.from_huggingface("louisguitton/dev-ner-ontonotes")

Load with datasets

To load this dataset with datasets, you'll just need to install datasets as pip… See the full description on the dataset page: https://huggingface.co/datasets/louisguitton/dev-ner-ontonotes.
T
wikiann
tensorflow.org
huggingface.co
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). wikiann [Dataset]. https://www.tensorflow.org/datasets/catalog/wikiann
Explore at:
Dataset updated
Jan 4, 2023
Description
WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikiann', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Z
Data from: HealthE
data.niaid.nih.gov
zenodo.org
Updated Jan 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parker Seegmiller (2023). HealthE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7539391
Explore at:
Dataset updated
Jan 16, 2023
Dataset provided by
Garrett Johnston
Parker Seegmiller
Madhusudan Basak
Sarah Masud Preum
Joseph Gatto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
HealthE Dataset

HealthE contains 3,400 pieces of health advice gathered 1) from public health websites (i.e. WebMD.com, MedlinePlus.gov, CDC.gov, and MayoClinic.org) 2) from the publicly available Preclude dataset. Each sample was hand-labeled for health entity recognition by a team of 14 annotators at the author's institution. Automatic recognition of health entities will enable further research in large-scale modeling of texts from online health communities.

The data is provided in two parts. Both are formatted using the popular, free python pickle library and require use of the popular, free pandas library.

healthe.pkl is a pandas.DataFrame object containing the 3,400 health-advice statement with hand-labeled health entities.

non_advice.pkl is a pandas.DataFrame object containing the 2,256 pieces of non-advice statements.

To load the files in python, use the following code block. import pickle import pandas as pd healthe_df = pd.read_pickle('healthe.pkl') non_advice_df = pd.read_pickle('non_advice_df.pkl')

healthe_df has four columns. * text contains the health advice statement text * entities contains a python list of (entity, class) tuples * tokenized_text contains a list of tokens obtained by tokenizing the health advice statement text * labels contains a list of the same length as tokenized_text, where each token is mapped to a class label.

non_advice_df has one column, text, referring to each non-health-advice-statement.
T
conll2003
tensorflow.org
opendatalab.com
+1more
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). conll2003 [Dataset]. https://www.tensorflow.org/datasets/catalog/conll2003
Explore at:
Dataset updated
Dec 22, 2022
Description
The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('conll2003', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
frenchNER_3entities
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CATIE (2024). frenchNER_3entities [Dataset]. https://huggingface.co/datasets/CATIE-AQ/frenchNER_3entities
Explore at:
Dataset updated
Feb 7, 2024
Dataset authored and provided by
CATIE
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset information

Dataset concatenating NER datasets, available in French and open-source, for 3 entities (LOC, PER, ORG).There are a total of 420,264 rows, of which 346,071 are for training, 32,951 for validation and 41,242 for testing.Our methodology is described in a blog post available in English or French.

Usage

from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchNER_3entities")

Dataset Details of rows… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/frenchNER_3entities.
h
english_news_weak_ner
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladimir Gurevich, english_news_weak_ner [Dataset]. https://huggingface.co/datasets/imvladikon/english_news_weak_ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Vladimir Gurevich
Description
Large Weak Labelled NER corpus

Dataset Summary

The dataset is generated through weak labelling of the scraped and preprocessed news corpus (bloomberg's news). so, only to research purpose. In order of the tokenization, news were splitted into sentences using nltk.PunktSentenceTokenizer (so, sometimes, tokenization might be not perfect)

Usage

from datasets import load_dataset

articles_ds = load_dataset("imvladikon/english_news_weak_ner", "articles") # just… See the full description on the dataset page: https://huggingface.co/datasets/imvladikon/english_news_weak_ner.
h
frenchNER_4entities
huggingface.co
Updated Feb 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CATIE (2024). frenchNER_4entities [Dataset]. http://doi.org/10.57967/hf/1751
Explore at:
Unique identifier
https://doi.org/10.57967/hf/1751
Dataset updated
Feb 8, 2024
Dataset authored and provided by
CATIE
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset information

Dataset concatenating NER datasets, available in French and open-source, for 4 entities (LOC, PER, ORG, MISC).There are a total of 384,773 rows, of which 328,757 are for training, 24,131 for validation and 31,885 for testing.Our methodology is described in a blog post available in English or French.

Usage

from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchNER_4entities")

Dataset Details of rows… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus

Named Entity Recognition (NER) Corpus

Ready to use Named Entity Recognition Corpus

Explore at:

28 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jan 14, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Naser Al-qaydeh

License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon

Clear search

Close search

Google apps

Main menu

Named Entity Recognition (NER) Corpus

Task

Dataset

Acknowledgements

turkish-org-ner

dev-ner-ontonotes

wikiann

Data from: HealthE

HealthE Dataset

conll2003

frenchNER_3entities

english_news_weak_ner

frenchNER_4entities

Named Entity Recognition (NER) Corpus

Ready to use Named Entity Recognition Corpus

Task

Dataset

Acknowledgements