100+ datasets found

h
kaggle-entity-annotated-corpus-ner-dataset
huggingface.co
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Description
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

About Dataset

from Kaggle Datasets

Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
h
Pile-NER-type
huggingface.co
Updated Aug 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Universal-NER (2023). Pile-NER-type [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-type
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 9, 2023
Authors
Universal-NER
Description
Intro

Pile-NER-type is a set of GPT-generated data for named entity recognition using the type-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

License

Attribution-NonCommercial 4.0 International
WIESP2022-NER
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SAO/NASA Astrophysics Data System, WIESP2022-NER [Dataset]. https://huggingface.co/datasets/adsabs/WIESP2022-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset provided by
Astrophysics Data Systemhttp://www.adsabs.harvard.edu/
Authors
SAO/NASA Astrophysics Data System
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset for the first Workshop on Information Extraction from Scientific Publications (WIESP/2022).

Dataset Description

Datasets with text fragments from astrophysics papers, provided by the NASA Astrophysical Data System with manually tagged astronomical facilities and other entities of interest (e.g., celestial objects).Datasets are in JSON Lines format (each line is a json dictionary).The datasets are formatted similarly to the CONLL2003 format. Each token is… See the full description on the dataset page: https://huggingface.co/datasets/adsabs/WIESP2022-NER.
h
Financial-NER-NLP
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Joseph G Flowers
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.
kpwr-ner
huggingface.co
Updated Apr 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLARIN-PL (2022). kpwr-ner [Dataset]. https://huggingface.co/datasets/clarin-pl/kpwr-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 1, 2022
Dataset authored and provided by
CLARIN-PL
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
KPWR-NER tagging dataset.
h
AnythingNER
huggingface.co
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CascadeNER (2024). AnythingNER [Dataset]. https://huggingface.co/datasets/CascadeNER/AnythingNER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Authors
CascadeNER
Description
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

This repository is supplement material for the paper: DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

💓Update!

DynamicNER is disclosed now! Please download it from Huggingface for train and evaluation!

We add more GEIC format existing datasets and also the format for fine-tuning and inferrence based on SWIFT! You can… See the full description on the dataset page: https://huggingface.co/datasets/CascadeNER/AnythingNER.
h
PII-NER
huggingface.co
Updated Jul 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 20, 2024
Authors
Joseph G Flowers
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
h
aeroBERT-NER
huggingface.co
Updated Apr 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0470
Dataset updated
Apr 7, 2023
Authors
Archana Tikayat Ray
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for aeroBERT-NER

Dataset Summary

This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
(1) Making available an open-source dataset for aerospace requirements which are often proprietary
(2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.
h
few-nerd
huggingface.co
opendatalab.com
+1more
Updated Dec 31, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Speech and Language Technology, DFKI (2021). few-nerd [Dataset]. https://huggingface.co/datasets/DFKI-SLT/few-nerd
Explore at:
Dataset updated
Dec 31, 2021
Dataset authored and provided by
Speech and Language Technology, DFKI
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Dataset Card for "Few-NERD"

dataset-description)

Dataset Summary Supported Tasks and Leaderboards Languages Dataset Structure Data Instances Data Fields Data Splits

Dataset Creation Curation Rationale Source Data Annotations Personal and Sensitive InformationConsiderations for Using the Data Social Impact of Dataset Discussion of Biases Other Known Limitations

Additional Information Dataset Curators Licensing Information Citation Information Contributions

Dataset… See the full description on the dataset page: https://huggingface.co/datasets/DFKI-SLT/few-nerd.
h
HiNER-original
huggingface.co
Updated May 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Computation for Indian Language Technology (2022). HiNER-original [Dataset]. https://huggingface.co/datasets/cfilt/HiNER-original
Explore at:
Dataset updated
May 2, 2022
Dataset authored and provided by
Computation for Indian Language Technology
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This is the dataset repository for HiNER Dataset accepted to be published at LREC 2022. The dataset can help build sequence labelling models for the task Named Entity Recognitin for the Hindi language.
h
bioleaflets-biomedical-ner
huggingface.co
Updated May 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruslan Yermak (2023). bioleaflets-biomedical-ner [Dataset]. https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2023
Authors
Ruslan Yermak
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for BioLeaflets Dataset

Dataset Summary

BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.
h
grocery-ner-dataset
huggingface.co
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
empathy.ai (2025). grocery-ner-dataset [Dataset]. https://huggingface.co/datasets/empathyai/grocery-ner-dataset
Explore at:
Dataset updated
May 13, 2025
Dataset provided by
empathy.ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Groceries Named Entity Recognition (NER) Dataset

A specialized dataset for identifying food and grocery items in natural language text using Named Entity Recognition (NER).

Entity Types

The dataset includes the following grocery categories:

Fruits Vegetables: Fresh produce (e.g., apples, spinach) Lactose, Diary, Eggs, Cheese, Yoghurt: Dairy products and eggs Meat, Fish, Seafood: Protein sources Frozen, Prepared Meals: Ready-to-eat and frozen meals Baking, Cooking: Baking… See the full description on the dataset page: https://huggingface.co/datasets/empathyai/grocery-ner-dataset.
h
finer-139
huggingface.co
opendatalab.com
Updated May 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AUEB NLP Group (2022). finer-139 [Dataset]. https://huggingface.co/datasets/nlpaueb/finer-139
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 9, 2022
Authors
AUEB NLP Group
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
FiNER-139 is a named entity recognition dataset consisting of 10K annual and quarterly English reports (filings) of publicly traded companies downloaded from the U.S. Securities and Exchange Commission (SEC) annotated with 139 XBRL tags in the IOB2 format.
h
Annotated_NER_PDF_Resumes
huggingface.co
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MehyarMlaweh (2024). Annotated_NER_PDF_Resumes [Dataset]. https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 22, 2024
Authors
MehyarMlaweh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
IT Skills Named Entity Recognition (NER) Dataset

Description:

This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.

Highlights:

5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.
h
RWCS-NER-DC
huggingface.co
Updated Mar 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shilin (2025). RWCS-NER-DC [Dataset]. https://huggingface.co/datasets/zsLin/RWCS-NER-DC
Explore at:
Dataset updated
Mar 24, 2025
Authors
Shilin
Description
zsLin/RWCS-NER-DC dataset hosted on Hugging Face and contributed by the HF Datasets community
h
synthetic-pii-ner-mistral-v1
huggingface.co
Updated Apr 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urchade Zaratiana (2024). synthetic-pii-ner-mistral-v1 [Dataset]. https://huggingface.co/datasets/urchade/synthetic-pii-ner-mistral-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 20, 2024
Authors
Urchade Zaratiana
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This the synthetic dataset used for training https://huggingface.co/urchade/gliner_multi_pii-v1. You can get it by browsing the files and dowloading the data.json file.
h
Persian-Text-NER
huggingface.co
Updated Nov 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seyed Ali Mir Mohammad Hosseini (2023). Persian-Text-NER [Dataset]. https://huggingface.co/datasets/SeyedAli/Persian-Text-NER
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2023
Authors
Seyed Ali Mir Mohammad Hosseini
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SeyedAli/Persian-Text-NER dataset hosted on Hugging Face and contributed by the HF Datasets community
h
cross_ner
huggingface.co
Updated Apr 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Speech and Language Technology, DFKI (2023). cross_ner [Dataset]. https://huggingface.co/datasets/DFKI-SLT/cross_ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 19, 2023
Dataset authored and provided by
Speech and Language Technology, DFKI
License
https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/
Description
CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

For details, see the paper: CrossNER: Evaluating Cross-Domain Named Entity Recognition
h
product-ner
huggingface.co
Updated Sep 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Irsh Vijay (2024). product-ner [Dataset]. https://huggingface.co/datasets/1rsh/product-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 14, 2024
Authors
Irsh Vijay
Description
1rsh/product-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
h
ancora-ca-ner
huggingface.co
Updated Nov 1, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Projecte Aina (2021). ancora-ca-ner [Dataset]. https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 1, 2021
Dataset authored and provided by
Projecte Aina
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for AnCora-Ca-NER

Dataset Summary

This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).

Supported Tasks and Leaderboards

Named Entities Recognition, Language Model

Languages

The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.

Facebook

Twitter

Click to copy link

Link copied

Cite

Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 10, 2022

Authors

Rafael Arias Calles

License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Clear search

Close search

Google apps

Main menu

kaggle-entity-annotated-corpus-ner-dataset

Pile-NER-type

WIESP2022-NER

Financial-NER-NLP

kpwr-ner

AnythingNER

PII-NER

aeroBERT-NER

few-nerd

dataset-description)

HiNER-original

bioleaflets-biomedical-ner

grocery-ner-dataset

finer-139

Annotated_NER_PDF_Resumes

RWCS-NER-DC

synthetic-pii-ner-mistral-v1

Persian-Text-NER

cross_ner

product-ner

ancora-ca-ner

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset