Facebook
Twitterhttps://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.
About Dataset
from Kaggle Datasets
Context
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.
Facebook
TwitterIntro
Pile-NER-definition is a set of GPT-generated data for named entity recognition using the definition-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.
License
Attribution-NonCommercial 4.0 International
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
KPWR-NER tagging dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This file contains the pre-trained BERT-Base-Multilingual-Cased model files, originally provided by Google on Hugging Face. The model supports 104 languages and is ideal for a wide range of Natural Language Processing (NLP) tasks, including text classification, named entity recognition (NER), and question answering.
This model is particularly helpful for multilingual NLP applications due to its ability to process cased text (case-sensitive input). Key details:
Source: Hugging Face Architecture: 12-layer Transformer with 110M parameters Tasks: Text classification, NER, question answering, etc.
To install, run the line below :
pip install /kaggle/input/bert-base-multilingual-cased/Google Bert Multilingual
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MultiNERD Named Entity Recognition dataset: https://huggingface.co/datasets/Babelscape/multinerd Paper: https://aclanthology.org/2022.findings-naacl.60.pdf Sample token classification script: https://github.com/huggingface/notebooks/blob/main/examples/token_classification.ipynb RoBERTa-base pretrained model link: https://huggingface.co/docs/transformers/model_doc/roberta
Facebook
TwitterGLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.
data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than othersmodel.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.To use this model, you must install the GLiNER Python library:
!pip install gliner
Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.
from gliner import GLiNER
model = GLiNER.from_pretrained("urchade/gliner_base")
text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.
The original raw texts for all charters were collected from four charters collections
- HOME-ALCAR corpus : https://zenodo.org/record/5600884
- CBMA : http://www.cbma-project.eu
- Diplomata Belgica : https://www.diplomata-belgica.be
- CODEA corpus : https://corpuscodea.es/
We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)
Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual
Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner
Facebook
TwitterDynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
This repository is supplement material for the paper: DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
💓Update!
DynamicNER is disclosed now! Please download it from Huggingface for train and evaluation!
We add more GEIC format existing datasets and also the format for fine-tuning and inferrence based on SWIFT! You can… See the full description on the dataset page: https://huggingface.co/datasets/CascadeNER/AnythingNER.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By harem (From Huggingface) [source]
The dataset is available in two versions: a complete version with 10 different named entity classes, including Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other; and a selective version with only 5 classes (Person, Organization, Location ,Value,and Date). The selective version focuses on the most commonly recognized named entity types.
It's worth noting that the original HAREM dataset had two levels of NER details: Category and Sub-type. However,the processed version of the corpus presented in this Kaggle dataset only includes information up to the Category level.
Each entry in this dataset consists of tokenized words from the original text along with their corresponding NER tags assigned through annotation. The tokens column contains individual words or tokens extracted from the text while **tokens provide a duplicate column for consistency purposes.
Furthermore,the ner_tags column contains specific class labels assigned to each token indicating their corresponding named entity class such as Person or Organization.The **ner_tags serves as an additional identical column which contributes to ensuring consistency within datasets where both columns might co-occur.
This particular Kaggle dataset also contains three separate CSV files: train.csv for training data,a validation.csv subset file utilized for validating NER model performance on Portuguese texts,and test.csv comprising another subset of HAREM corpus where there are tokenized words alongside their respective NER tags.The availability of different files enables users to efficiently train,test,and validate NER models on Portuguese texts using reliable sources,
Introduction:
Dataset Overview:
Dataset Files: a) train.csv - Contains the training data with tokens (individual words or tokens) and their corresponding named entity recognition (NER) tags. b) validation.csv - Provides a subset of the corpus for validating model performance in identifying named entities. c) test.csv - Contains tokenized words from the corpus along with their respective NER tags.
Named Entity Classes: The dataset includes 10 different named entity classes: Person, Organization, Location, Value, Date,**+, Title,**part as-seq +,, Thing,+seq+ Abstraction,+adv , Event,+pron +no,. Other,+d_em , Type sequences[uTO, DoI, -DATETIME] represent substantive addresses,.
Understanding the Columns: a) tokens:**contains** - This column comprises individual tokens or words extracted from the text. b)**ner_tags:**contains** - The ner_tags column lists the assigned named entity recognition tags associated with each token in relation to its respective class.
Training and Evaluation: To use this dataset for training a NER model, you can utilize the train.csv file. The tokens column will provide you with the words or tokens, while the ner_tags column will guide you in labeling the named entities within your training data.
For evaluating your model's performance, the validation.csv file can be used. Similar to the train.csv file, it contains tokenized words and their corresponding NER tags.
- Applying Pretrained Models: You can also use this dataset to fine-tune or evaluate pretrained NER models in Portuguese. By utilizing transfer learning techniques on this corpus, you may improve their performance on relevant named entity recognition tasks specific
- Entity Recognition and Classification: This dataset can be used to train and evaluate models for named entity recognition (NER) tasks in Portuguese. The NER tags provided in the dataset can serve as labels for training models to accurately identify and classify entities such as person names, organization names, locations, dates, etc.
- Cross-lingual Transfer Learning: The dataset can also be leveraged for cross-lingual transfer learning tasks by training models on this dataset and then using the trained model to extract named entities from other languages as well. This would enable NER tasks in multiple languages using a single trained model by leveraging knowledge gained from this rich resource of labeled data in Portuguese
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, mod...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for the first Workshop on Information Extraction from Scientific Publications (WIESP/2022).
Dataset Description
Datasets with text fragments from astrophysics papers, provided by the NASA Astrophysical Data System with manually tagged astronomical facilities and other entities of interest (e.g., celestial objects).Datasets are in JSON Lines format (each line is a json dictionary).The datasets are formatted similarly to the CONLL2003 format. Each token is… See the full description on the dataset page: https://huggingface.co/datasets/adsabs/WIESP2022-NER.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for AnCora-Ca-NER
Dataset Summary
This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).
Supported Tasks and Leaderboards
Named Entities Recognition, Language Model
Languages
The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.
Facebook
TwitterzsLin/RWCS-NER-DC dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
IT Skills Named Entity Recognition (NER) Dataset
Description:
This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.
Highlights:
5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for aeroBERT-NER
Dataset Summary
This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme.
There are a total of 1432 sentences. The creation of this dataset is aimed at -
(1) Making available an open-source dataset for aerospace requirements which are often proprietary
(2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Dataset Card for "NER Model Tune"
Dataset Summary
[More Information Needed]
Supported Tasks and Leaderboards
[More Information Needed]
Languages
[More Information Needed]
Dataset Structure
Data Instances
[More Information Needed]
Data Fields
[More Information Needed]
Data Splits
[More Information Needed]
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data… See the full description on the dataset page: https://huggingface.co/datasets/ayuhamaro/ner-model-tune.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset aims to improve the representation of underrepresented topics and entities in entity extractors, thereby improving entity extraction accuracy and generalization, especially on the latest news events (dataset represents broad news coverage between February 20-March 31, 2024). The dataset is a collection of news article summaries, translated and summarized with Llama2, and then entities extracted with Llama3. The distribution of data origin… See the full description on the dataset page: https://huggingface.co/datasets/EmergentMethods/AskNews-NER-v0.
Facebook
TwitterMedMentions BioNER (Custom Processed)
This dataset is a custom preprocessed version of the MedMentions dataset for biomedical Named Entity Recognition (NER) tasks.It is compatible with Hugging Face Datasets and can be used directly for fine-tuning BERT-based models such as BERT or Bio_ClinicalBERT.
Dataset Summary
Task: Named Entity Recognition (NER) in biomedical text Source: MedMentions Language: English Entity Types: 128 entity classes derived from UMLS semantic… See the full description on the dataset page: https://huggingface.co/datasets/geraldamasi/medmentions-ner.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for BioLeaflets Dataset
Dataset Summary
BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.
Facebook
Twitterhttps://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/
Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.
About Dataset
from Kaggle Datasets
Context
Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.