Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Universal Named Entity Recognition (UNER) aims to fill a gap in multilingual NLP: high quality NER datasets in many languages with a shared tagset.
UNER is modeled after the Universal Dependencies project, in that it is intended to be a large community annotation effort with language-universal guidelines. Further, we use the same text corpora as Universal Dependencies.
https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.
About Dataset
from Kaggle Datasets
Context
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.
The original raw texts for all charters were collected from four charters collections
- HOME-ALCAR corpus : https://zenodo.org/record/5600884
- CBMA : http://www.cbma-project.eu
- Diplomata Belgica : https://www.diplomata-belgica.be
- CODEA corpus : https://corpuscodea.es/
We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)
Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual
Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner
Intro
Pile-NER-type is a set of GPT-generated data for named entity recognition using the type-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.
License
Attribution-NonCommercial 4.0 International
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for aeroBERT-NER
Dataset Summary
This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme.
There are a total of 1432 sentences. The creation of this dataset is aimed at -
(1) Making available an open-source dataset for aerospace requirements which are often proprietary
(2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Sets were generated using the Weakly Supervised NER pipeline (https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) to train the symptom extraction NER models.
Supported Languages and dataset locations for the specific language:
English (base language): https://huggingface.co/HUMADEX/english_medical_ner
German: https://huggingface.co/HUMADEX/german_medical_ner
Italian: https://huggingface.co/HUMADEX/italian_medical_ner
Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
Greek: https://huggingface.co/HUMADEX/german_medical_ner
Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
Polish: https://huggingface.co/HUMADEX/polish_medical_ner
Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner
Dataset Building
Acknowledgement
This dataset had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.
Authors:
dr. Izidor Mlakar, Rigona Sallauka, dr. Umut Arioz, dr. Matej Rojc
Please cite as:
Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
Doi: 10.20944/preprints202504.1356.v1
Website: https://www.preprints.org/manuscript/202504.1356/v1" href="https://www.preprints.org/manuscript/202504.1356/v1">https://www.preprints.org/manuscript/202504.1356/v1
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset for training and evaluating Indian Legal Named Entity Recognition model.
Paper details
Named Entity Recognition in Indian court judgments Arxiv
Label Scheme
View label scheme (14 labels for 1 components)
ENTITY BELONGS TO
LAWYER PREAMBLE
COURT PREAMBLE, JUDGEMENT
JUDGE PREAMBLE, JUDGEMENT
PETITIONER PREAMBLE, JUDGEMENT
RESPONDENT PREAMBLE, JUDGEMENT
CASE_NUMBER JUDGEMENT
GPE JUDGEMENT
DATE JUDGEMENT
ORG JUDGEMENT
STATUTE JUDGEMENT… See the full description on the dataset page: https://huggingface.co/datasets/opennyaiorg/InLegalNER.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
IT Skills Named Entity Recognition (NER) Dataset
Description:
This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.
Highlights:
5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Dataset Card for Climate Change NER
The Climate Change NER is an English-language dataset containing 534 abstracts of climate-related papers. They have been sourced from the Semantic Scholar Academic Graph "abstracts" dataset. The abstracts have been manually annotated by classifying climate-related tokens in a set of 13 categories.
Dataset Details
Dataset Description
We introduce a comprehensive dataset for developing and evaluating NLP models tailored towards… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/Climate-Change-NER.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for BioLeaflets Dataset
Dataset Summary
BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.
Here we describe a new clinical corpus rich in nested entities and a series of neural models to identify them. The corpus comprises de-identified referrals from the waiting list in Chilean public hospitals. A subset of 9,000 referrals (medical and dental) was manually annotated with ten types of entities, six attributes, and pairs of relations with clinical relevance. A trained medical doctor or dentist annotated these referrals and then, together with three other researchers, consolidated each of the annotations. The annotated corpus has more than 48% of entities embedded in other entities or containing another. We use this corpus to build Named Entity Recognition (NER) models. The best results were achieved using Multiple Single-entity architectures with clinical word embeddings stacked with character and Flair contextual embeddings (refer to this paper: https://aclanthology.org/2022.coling-1.184/). The entity with the best performance is abbreviation, and the hardest to recognize is finding. NER models applied to this corpus can leverage statistics of diseases and pending procedures. This work constitutes the first annotated corpus using clinical narratives from Chile and one of the few in Spanish. The annotated corpus, clinical word embeddings, annotation guidelines, and neural models are freely released to the community.This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/. We are releasing the dataset in 3 formats: cwlc.zip: Contains the raw text files for each document along with its annotation file in Standoff format cwlc_conll-format: CoNLL format for training NER models. In addition, the dataset has been released in hugging face (https://huggingface.co/plncmm) to facilitate experiments with transformer-based architectures.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Azerbaijani Named Entity Recognition (NER) Dataset
This repository contains the dataset for training and evaluating Named Entity Recognition (NER) models in the Azerbaijani language. The dataset includes annotated text data with various named entities.
Dataset Description
The dataset includes the following entity types:
0: O: Outside any named entity 1: PERSON: Names of individuals 2: LOCATION: Geographical locations, both man-made and natural 3: ORGANISATION: Names of… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for AnCora-Ca-NER
Dataset Summary
This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).
Supported Tasks and Leaderboards
Named Entities Recognition, Language Model
Languages
The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Dataset Named Entity Recognition of funders of scientific research
Dataset Summary
Training/test set for automatically identifying funder entities mentioned in scientific papers. This data set is generated from Open Access documents hosted at https://econstor.eu and manually curated/labeled.
Supported Tasks and Leaderboards
The dataset is for training and testing the automatic recognition of funders as they are acknowledged in scientific… See the full description on the dataset page: https://huggingface.co/datasets/ZBWatHF/Funder-NER.
Polyglot-NER builds massive multilingual annotators with minimal human expertise and intervention.
dxiao/requirements-ner-id dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Dataset Card for RaTE-NER Dataset
GitHub | Paper
Dataset Summary
RaTE-NER dataset is a large-scale, radiological named entity recognition (NER) dataset, including 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, that spans 9 imaging modalities and 23 anatomical regions, ensuring comprehensive coverage. Additionally, we further enriched the dataset with 33,605 sentences from the 17,432 reports available on Radiopaedia, by… See the full description on the dataset page: https://huggingface.co/datasets/Angelakeke/RaTE-NER.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
FiNER-139 is a named entity recognition dataset consisting of 10K annual and quarterly English reports (filings) of publicly traded companies downloaded from the U.S. Securities and Exchange Commission (SEC) annotated with 139 XBRL tags in the IOB2 format.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Groceries Named Entity Recognition (NER) Dataset
A specialized dataset for identifying food and grocery items in natural language text using Named Entity Recognition (NER).
Entity Types
The dataset includes the following grocery categories:
Fruits Vegetables: Fresh produce (e.g., apples, spinach) Lactose, Diary, Eggs, Cheese, Yoghurt: Dairy products and eggs Meat, Fish, Seafood: Protein sources Frozen, Prepared Meals: Ready-to-eat and frozen meals Baking, Cooking: Baking… See the full description on the dataset page: https://huggingface.co/datasets/empathyai/grocery-ner-dataset.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Universal Named Entity Recognition (UNER) aims to fill a gap in multilingual NLP: high quality NER datasets in many languages with a shared tagset.
UNER is modeled after the Universal Dependencies project, in that it is intended to be a large community annotation effort with language-universal guidelines. Further, we use the same text corpora as Universal Dependencies.