100+ datasets found
  1. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  2. h

    Financial-NER-NLP

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Joseph G Flowers
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.

  3. h

    Pile-NER-type

    • huggingface.co
    Updated Aug 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universal-NER (2023). Pile-NER-type [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-type
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Authors
    Universal-NER
    Description

    Intro

    Pile-NER-type is a set of GPT-generated data for named entity recognition using the type-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

      License
    

    Attribution-NonCommercial 4.0 International

  4. E

    Data from: PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

    • live.european-language-grid.eu
    Updated Jan 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). PyTorch model for Slovenian Named Entity Recognition SloNER 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20980
    Explore at:
    Dataset updated
    Jan 26, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloNER is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library (https://github.com/huggingface/transformers).

    The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0 (http://hdl.handle.net/11356/1397). The model was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747).The source code of the model is available on GitHub repository https://github.com/clarinsi/SloNER.

  5. MultiNERD NER models

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jayant Yadav (2023). MultiNERD NER models [Dataset]. https://www.kaggle.com/datasets/jayantyadav/multinerd-ner-models/versions/5
    Explore at:
    zip(2588704751 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    Jayant Yadav
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
  6. Multilingual NER Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Multilingual NER Dataset [Dataset]. https://www.kaggle.com/thedevastator/multilingual-ner-dataset
    Explore at:
    zip(72419294 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Multilingual NER Dataset

    Multilingual NER Dataset for Named Entity Recognition

    By Babelscape (From Huggingface) [source]

    About this dataset

    The Babelscape/wikineural NER Dataset is a comprehensive and diverse collection of multilingual text data specifically designed for the task of Named Entity Recognition (NER). It offers an extensive range of labeled sentences in nine different languages: French, German, Portuguese, Spanish, Polish, Dutch, Russian, English, and Italian.

    Each sentence in the dataset contains tokens (words or characters) that have been labeled with named entity recognition tags. These tags provide valuable information about the type of named entity each token represents. The dataset also includes a language column to indicate the language in which each sentence is written.

    This dataset serves as an invaluable resource for developing and evaluating NER models across multiple languages. It encompasses various domains and contexts to ensure diversity and representativeness. Researchers and practitioners can utilize this dataset to train and test their NER models in real-world scenarios.

    By using this dataset for NER tasks, users can enhance their understanding of how named entities are recognized across different languages. Furthermore, it enables benchmarking performance comparisons between various NER models developed for specific languages or trained on multiple languages simultaneously.

    Whether you are an experienced researcher or a beginner exploring multilingual NER tasks, the Babelscape/wikineural NER Dataset provides a highly informative and versatile resource that can contribute to advancements in natural language processing and information extraction applications on a global scale

    How to use the dataset

    • Understand the Data Structure:

      • The dataset consists of labeled sentences in nine different languages: French (fr), German (de), Portuguese (pt), Spanish (es), Polish (pl), Dutch (nl), Russian (ru), English (en), and Italian (it).
      • Each sentence is represented by three columns: tokens, ner_tags, and lang.
      • The tokens column contains the individual words or characters in each labeled sentence.
      • The ner_tags column provides named entity recognition tags for each token, indicating their entity types.
      • The lang column specifies the language of each sentence.
    • Explore Different Languages:

      • Since this dataset covers multiple languages, you can choose to focus on a specific language or perform cross-lingual analysis.
      • Analyzing multiple languages can help uncover patterns and differences in named entities across various linguistic contexts.
    • Preprocessing and Cleaning:

      • Before training your NER models or applying any NLP techniques to this dataset, it's essential to preprocess and clean the data.
      • Consider removing any unnecessary punctuation marks or special characters unless they carry significant meaning in certain languages.
    • Training Named Entity Recognition Models: 4a. Data Splitting: Divide the dataset into training, validation, and testing sets based on your requirements using appropriate ratios. 4b. Feature Extraction: Prepare input features from tokenized text data such as word embeddings or character-level representations depending on your model choice. 4c. Model Training: Utilize state-of-the-art NER models (e.g., LSTM-CRF, Transformer-based models) to train on the labeled sentences and ner_tags columns. 4d. Evaluation: Evaluate your trained model's performance using the provided validation dataset or test datasets specific to each language.

    • Applying Pretrained Models:

      • Instead of training a model from scratch, you can leverage existing pretrained NER models like BERT, GPT-2, or SpaCy's named entity recognition capabilities.
      • Fine-tune these pre-trained models on your specific NER task using the labeled

    Research Ideas

    • Training NER models: This dataset can be used to train NER models in multiple languages. By providing labeled sentences and their corresponding named entity recognition tags, the dataset can help train models to accurately identify and classify named entities in different languages.
    • Evaluating NER performance: The dataset can be used as a benchmark to evaluate the performance of pre-trained or custom-built NER models. By using the labeled sentences as test data, developers and researchers can measure the accuracy, precision, recall, and F1-score of their models across multiple languages.
    • Cross-lingual analysis: With labeled sentences available in nine different languages, researchers can perform cross-lingual analysis...
  7. Z

    GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...

    • data.niaid.nih.gov
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moncla, Ludovic; Vigier, Denis; McDonough, Katherine (2024). GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530177
    Explore at:
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Lancaster University
    Laboratoire d'Informatique en Images et Systèmes d'Information
    Interactions, Corpus, Apprentissages, Représentations
    Authors
    Moncla, Ludovic; Vigier, Denis; McDonough, Katherine
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

    The dataset is available in the following formats:

    JSONL format provided by Prodigy

    binary spaCy format (ready to use with the spaCy train pipeline)

    The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

    The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

    Tagset

    NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.

    NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.

    ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.

    Relation: spatial relation, e.g. dans, sur, à 10 lieues de.

    Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.

    NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.

    NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.

    ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.

    NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique

    ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.

    Head: entry name

    Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

    HuggingFace

    The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

    spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

    This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

    Acknowledgement

    The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

  8. GLiNER Github Repo

    • kaggle.com
    zip
    Updated Oct 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2025). GLiNER Github Repo [Dataset]. https://www.kaggle.com/dschettler8845/gliner-github-repo
    Explore at:
    zip(545226 bytes)Available download formats
    Dataset updated
    Oct 26, 2025
    Authors
    Darien Schettler
    Description

    GLiNER : Generalist and Lightweight model for Named Entity Recognition

    GLiNER is a Named Entity Recognition (NER) model capable of identifying any entity type using a bidirectional transformer encoder (BERT-like). It provides a practical alternative to traditional NER models, which are limited to predefined entities, and Large Language Models (LLMs) that, despite their flexibility, are costly and large for resource-constrained scenarios.

    Demo Image

    Models Status

    📢 Updates

    • 📝 Finetuning notebook is available: examples/finetune.ipynb
    • 🗂 Training dataset preprocessing scripts are now available in the data/ directory, covering both Pile-NER 📚 and NuNER 📘 datasets.

    Available Models on Hugging Face

    To Release

    • [ ] ⏳ GLiNER-Multiv2
    • [ ] ⏳ GLiNER-Sup (trained on mixture of NER datasets)

    Area of improvements / research

    • [ ] Allow longer context (eg. train with long context transformers such as Longformer, LED, etc.)
    • [ ] Use Bi-encoder (entity encoder and span encoder) allowing precompute entity embeddings
    • [ ] Filtering mechanism to reduce number of spans before final classification to save memory and computation when the number entity types is large
    • [ ] Improve understanding of more detailed prompts/instruction, eg. "Find the first name of the person in the text"
    • [ ] Better loss function: for instance use Focal Loss (see this paper) instead of BCE to handle class imbalance, as some entity types are more frequent than others
    • [ ] Improve multi-lingual capabilities: train on more languages, and use multi-lingual training data
    • [ ] Decoding: allow a span to have multiple labels, eg: "Cristiano Ronaldo" is both a "person" and "football player"
    • [ ] Dynamic thresholding (in model.predict_entities(text, labels, threshold=0.5)): allow the model to predict more entities, or less entities, depending on the context. Actually, the model tend to predict less entities where the entity type or the domain are not well represented in the training data.
    • [ ] Train with EMAs (Exponential Moving Averages) or merge multiple checkpoints to improve model robustness (see this paper
    • [ ] Extend the model to relation extraction but need dataset with relation annotations. Our preliminary work ATG.

    Installation

    To use this model, you must install the GLiNER Python library: !pip install gliner

    Usage

    Once you've downloaded the GLiNER library, you can import the GLiNER class. You can then load this model using GLiNER.from_pretrained and predict entities with predict_entities.

    from gliner import GLiNER
    
    model = GLiNER.from_pretrained("urchade/gliner_base")
    
    text = """
    Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 offici...
    
  9. h

    HiNER-original

    • huggingface.co
    Updated May 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Computation for Indian Language Technology (2022). HiNER-original [Dataset]. https://huggingface.co/datasets/cfilt/HiNER-original
    Explore at:
    Dataset updated
    May 2, 2022
    Dataset authored and provided by
    Computation for Indian Language Technology
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is the dataset repository for HiNER Dataset accepted to be published at LREC 2022. The dataset can help build sequence labelling models for the task Named Entity Recognitin for the Hindi language.

  10. Z

    Multilingual named entity recognition for medieval charters. Datasets and...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Torres Aguilar, Sergio (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6463698
    Explore at:
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    École nationale des chartes
    Authors
    Torres Aguilar, Sergio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

    The original raw texts for all charters were collected from four charters collections

    We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

    Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

    Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner

  11. m

    PyVulDet-NER

    • data.mendeley.com
    • huggingface.co
    Updated Sep 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melanie Ehrenberg (2023). PyVulDet-NER [Dataset]. http://doi.org/10.17632/h22kxj6ydt.1
    Explore at:
    Dataset updated
    Sep 19, 2023
    Authors
    Melanie Ehrenberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data in this repository is associated with a manuscript paper, entitled "Python Source Code Vulnerability Detection with Named Entity Recognition". The paper has been submitted to the "DevSecOps: Advances for Secure Software Development" special issue in the "Computers & Security" journal. This research is part of an in-progress dissertation for George Washington University. In addition to the data shown in this repository, the following NER models were created with this data to identify 6 vulnerability types in Python source code: https://huggingface.co/mmeberg/RoRo_PyVulDet_NER https://huggingface.co/mmeberg/RoCo_PyVulDet_NER https://huggingface.co/mmeberg/DiDi_PyVulDet_NER https://huggingface.co/mmeberg/CoRo_PyVulDet_NER https://huggingface.co/mmeberg/CoCo_PyVulDet_NER

  12. NLUCat

    • zenodo.org
    • huggingface.co
    Updated Mar 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10362026
    Explore at:
    Dataset updated
    Mar 5, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    This work is licensed under a CC0 International License.

    In this repository you'll find the following items:

    • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
    • NLUCat_dataset.json: the completed NLUCat dataset
    • NLUCat_stats.tsv: statistics about de NLUCat dataset
    • dataset: folder with the dataset as published in [HuggingFace](https://huggingface.co/datasets/projecte-aina/NLUCat), splited and prepared for training and evaluating intent classifiers
    • reports: folder with the reports done as feedback to the annotators during the annotation process

    This work was funded by the Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya within the framework of Projecte AINA.

  13. kpwr-ner

    • huggingface.co
    Updated Apr 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLARIN-PL (2022). kpwr-ner [Dataset]. https://huggingface.co/datasets/clarin-pl/kpwr-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2022
    Dataset authored and provided by
    CLARIN-PL
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    KPWR-NER tagging dataset.

  14. Named Entity Recognition

    • sdiinnovation-geoplatform.hub.arcgis.com
    Updated May 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2022). Named Entity Recognition [Dataset]. https://sdiinnovation-geoplatform.hub.arcgis.com/content/97369a6f1200428ba060410d13dbb078
    Explore at:
    Dataset updated
    May 28, 2022
    Dataset authored and provided by
    Esrihttp://esri.com/
    Area covered
    Description

    This deep learning model is used to identify or categorize entities in unstructured text. An entity may refer to a word or a sequence of words such as the name of “Organizations,” “Persons,” “Country,” or “Date” and “Time” in the text. This model detects entities from the given text and classifies them into pre-determined categories. Named entity recognition (NER) is useful when a high-level overview of a large quantity of text is required. NER can let you know crucial and important information in text by extracting the main entities from it. The extracted entities are categorized into pre-determined classes and can help in drawing meaningful decisions and conclusions.Using the modelFollow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check the Deep Learning Libraries Installer for ArcGIS. Fine-tuning the modelThis model cannot be fine-tuned using ArcGIS tools.InputText files on which named entity extraction will be performed. OutputClassified tokens into the following pre-defined entity classes: PERSON – People, including fictionalNORP – Nationalities or religious or political groupsFACILITY – Buildings, airports, highways, bridges, etc.ORGANIZATION – Companies, agencies, institutions, etc.GPE – Countries, cities, statesLOCATION – Non-GPE locations, mountain ranges, bodies of waterPRODUCT – Vehicles, weapons, foods, etc. (Not services)EVENT – Named hurricanes, battles, wars, sports events, etc.WORK OF ART – Titles of books, songs, etc.LAW – Named documents made into lawsLANGUAGE – Any named languageDATE – Absolute or relative dates or periodsTIME – Times smaller than a dayPERCENT – Percentage (including “%”)MONEY – Monetary values, including unitQUANTITY – Measurements, as of weight or distanceORDINAL – “first,” “second”CARDINAL – Numerals that do not fall under another typeModel architectureThis model uses the XLM-RoBERTa architecture implemented in Hugging Face transformers using the TNER library.Accuracy metricsThis model has an accuracy of 91.6 percent.Training dataThe model has been trained on the OntoNotes Release 5.0 dataset.Sample resultsHere are a few results from the model. CitationsWeischedel, Ralph, et al. OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013. Asahi Ushio and Jose Camacho-Collados. 2021. TNER: An all-round Python library for transformer based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 53–62, Online. Association for Computational Linguistics.

  15. Few-NERD

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Jun 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicholas Broad (2021). Few-NERD [Dataset]. https://www.kaggle.com/nbroad/fewnerd
    Explore at:
    zip(49275322 bytes)Available download formats
    Dataset updated
    Jun 3, 2021
    Authors
    Nicholas Broad
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)). Few-NERD is collected by researchers from Tsinghua University and DAMO Academy, Alibaba Group.

    For more details about Few-NERD, please refer to the ACL-IJCNLP 2021 paper: https://arxiv.org/abs/2105.07464

    The official Few-NERD website is here: https://ningding97.github.io/fewnerd/

  16. h

    german-ler

    • huggingface.co
    • opendatalab.com
    Updated Nov 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Leitner (2024). german-ler [Dataset]. http://doi.org/10.57967/hf/0046
    Explore at:
    Dataset updated
    Nov 2, 2024
    Authors
    Elena Leitner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for "German LER"

      Dataset Summary
    

    A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. The dataset is human-annotated with 19 fine-grained entity classes. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities. NER tags use the BIO tagging scheme. The dataset includes two different versions of annotations, one with a set of 19 fine-grained semantic classes (ner_tags) and another one… See the full description on the dataset page: https://huggingface.co/datasets/elenanereiss/german-ler.

  17. HAREM Portuguese NER Corpus

    • kaggle.com
    zip
    Updated Dec 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). HAREM Portuguese NER Corpus [Dataset]. https://www.kaggle.com/thedevastator/harem-portuguese-ner-corpus
    Explore at:
    zip(258157 bytes)Available download formats
    Dataset updated
    Dec 2, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    HAREM Portuguese NER Corpus

    Portuguese NER Corpus with 10 Classes

    By harem (From Huggingface) [source]

    About this dataset

    The dataset is available in two versions: a complete version with 10 different named entity classes, including Person, Organization, Location, Value, Date, Title, Thing, Event, Abstraction, and Other; and a selective version with only 5 classes (Person, Organization, Location ,Value,and Date). The selective version focuses on the most commonly recognized named entity types.

    It's worth noting that the original HAREM dataset had two levels of NER details: Category and Sub-type. However,the processed version of the corpus presented in this Kaggle dataset only includes information up to the Category level.

    Each entry in this dataset consists of tokenized words from the original text along with their corresponding NER tags assigned through annotation. The tokens column contains individual words or tokens extracted from the text while **tokens provide a duplicate column for consistency purposes.

    Furthermore,the ner_tags column contains specific class labels assigned to each token indicating their corresponding named entity class such as Person or Organization.The **ner_tags serves as an additional identical column which contributes to ensuring consistency within datasets where both columns might co-occur.

    This particular Kaggle dataset also contains three separate CSV files: train.csv for training data,a validation.csv subset file utilized for validating NER model performance on Portuguese texts,and test.csv comprising another subset of HAREM corpus where there are tokenized words alongside their respective NER tags.The availability of different files enables users to efficiently train,test,and validate NER models on Portuguese texts using reliable sources,

    How to use the dataset

    Introduction:

    • Dataset Overview:

    • Dataset Files: a) train.csv - Contains the training data with tokens (individual words or tokens) and their corresponding named entity recognition (NER) tags. b) validation.csv - Provides a subset of the corpus for validating model performance in identifying named entities. c) test.csv - Contains tokenized words from the corpus along with their respective NER tags.

    • Named Entity Classes: The dataset includes 10 different named entity classes: Person, Organization, Location, Value, Date,**+, Title,**part as-seq +,, Thing,+seq+ Abstraction,+adv , Event,+pron +no,. Other,+d_em , Type sequences[uTO, DoI, -DATETIME] represent substantive addresses,.

    • Understanding the Columns: a) tokens:**contains** - This column comprises individual tokens or words extracted from the text. b)**ner_tags:**contains** - The ner_tags column lists the assigned named entity recognition tags associated with each token in relation to its respective class.

    • Training and Evaluation: To use this dataset for training a NER model, you can utilize the train.csv file. The tokens column will provide you with the words or tokens, while the ner_tags column will guide you in labeling the named entities within your training data.

    For evaluating your model's performance, the validation.csv file can be used. Similar to the train.csv file, it contains tokenized words and their corresponding NER tags.

    • Applying Pretrained Models: You can also use this dataset to fine-tune or evaluate pretrained NER models in Portuguese. By utilizing transfer learning techniques on this corpus, you may improve their performance on relevant named entity recognition tasks specific

    Research Ideas

    • Entity Recognition and Classification: This dataset can be used to train and evaluate models for named entity recognition (NER) tasks in Portuguese. The NER tags provided in the dataset can serve as labels for training models to accurately identify and classify entities such as person names, organization names, locations, dates, etc.
    • Cross-lingual Transfer Learning: The dataset can also be leveraged for cross-lingual transfer learning tasks by training models on this dataset and then using the trained model to extract named entities from other languages as well. This would enable NER tasks in multiple languages using a single trained model by leveraging knowledge gained from this rich resource of labeled data in Portuguese

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, mod...

  18. h

    AnythingNER

    • huggingface.co
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CascadeNER (2024). AnythingNER [Dataset]. https://huggingface.co/datasets/CascadeNER/AnythingNER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Authors
    CascadeNER
    Description

    DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

    This repository is supplement material for the paper: DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

      💓Update!
    

    DynamicNER is disclosed now! Please download it from Huggingface for train and evaluation!

    We add more GEIC format existing datasets and also the format for fine-tuning and inferrence based on SWIFT! You can… See the full description on the dataset page: https://huggingface.co/datasets/CascadeNER/AnythingNER.

  19. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

  20. The Chilean Waiting List Corpus

    • zenodo.org
    txt, zip
    Updated Jan 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Báez; Pablo Báez; Fabián Villena; Fabián Villena; Matías Rojas; Felipe Bravo-Marquez; Jocelyn Dunstan; Jocelyn Dunstan; Matías Rojas; Felipe Bravo-Marquez (2023). The Chilean Waiting List Corpus [Dataset]. http://doi.org/10.5281/zenodo.7555181
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jan 24, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pablo Báez; Pablo Báez; Fabián Villena; Fabián Villena; Matías Rojas; Felipe Bravo-Marquez; Jocelyn Dunstan; Jocelyn Dunstan; Matías Rojas; Felipe Bravo-Marquez
    Area covered
    Chile
    Description

    Here we describe a new clinical corpus rich in nested entities and a series of neural models to identify them. The corpus comprises de-identified referrals from the waiting list in Chilean public hospitals. A subset of 9,000 referrals (medical and dental) was manually annotated with ten types of entities, six attributes, and pairs of relations with clinical relevance. A trained medical doctor or dentist annotated these referrals and then, together with three other researchers, consolidated each of the annotations. The annotated corpus has more than 48% of entities embedded in other entities or containing another. We use this corpus to build Named Entity Recognition (NER) models. The best results were achieved using Multiple Single-entity architectures with clinical word embeddings stacked with character and Flair contextual embeddings (refer to this paper: https://aclanthology.org/2022.coling-1.184/). The entity with the best performance is abbreviation, and the hardest to recognize is finding. NER models applied to this corpus can leverage statistics of diseases and pending procedures. This work constitutes the first annotated corpus using clinical narratives from Chile and one of the few in Spanish. The annotated corpus, clinical word embeddings, annotation guidelines, and neural models are freely released to the community.This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
    We are releasing the dataset in 3 formats:

    • cwlc.zip: Contains the raw text files for each document along with its annotation file in Standoff format
    • cwlc_conll-format: CoNLL format for training NER models.

    In addition, the dataset has been released in hugging face (https://huggingface.co/plncmm) to facilitate experiments with transformer-based architectures.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Search
Clear search
Close search
Google apps
Main menu