100+ datasets found
  1. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  2. h

    universal_ner

    • huggingface.co
    Updated Sep 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universal NER (2024). universal_ner [Dataset]. https://huggingface.co/datasets/universalner/universal_ner
    Explore at:
    Dataset updated
    Sep 3, 2024
    Dataset authored and provided by
    Universal NER
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Universal Named Entity Recognition (UNER) aims to fill a gap in multilingual NLP: high quality NER datasets in many languages with a shared tagset.

    UNER is modeled after the Universal Dependencies project, in that it is intended to be a large community annotation effort with language-universal guidelines. Further, we use the same text corpora as Universal Dependencies.

  3. h

    Pile-NER-definition

    • huggingface.co
    Updated Aug 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universal-NER (2023). Pile-NER-definition [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-definition
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 13, 2023
    Authors
    Universal-NER
    Description

    Intro

    Pile-NER-definition is a set of GPT-generated data for named entity recognition using the definition-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

      License
    

    Attribution-NonCommercial 4.0 International

  4. Multilingual named entity recognition for medieval charters. Datasets and...

    • zenodo.org
    zip
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Torres Aguilar; Sergio Torres Aguilar (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. http://doi.org/10.5281/zenodo.6463699
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Torres Aguilar; Sergio Torres Aguilar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

    The original raw texts for all charters were collected from four charters collections

    - HOME-ALCAR corpus : https://zenodo.org/record/5600884

    - CBMA : http://www.cbma-project.eu

    - Diplomata Belgica : https://www.diplomata-belgica.be

    - CODEA corpus : https://corpuscodea.es/

    We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

    Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

    Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner

  5. E

    Data from: PyTorch model for Slovenian Named Entity Recognition SloNER 1.0

    • live.european-language-grid.eu
    Updated Jan 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). PyTorch model for Slovenian Named Entity Recognition SloNER 1.0 [Dataset]. https://live.european-language-grid.eu/catalogue/tool-service/20980
    Explore at:
    Dataset updated
    Jan 26, 2023
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The SloNER is a model for Slovenian Named Entity Recognition. It is is a PyTorch neural network model, intended for usage with the HuggingFace transformers library (https://github.com/huggingface/transformers).

    The model is based on the Slovenian RoBERTa contextual embeddings model SloBERTa 2.0 (http://hdl.handle.net/11356/1397). The model was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747).The source code of the model is available on GitHub repository https://github.com/clarinsi/SloNER.

  6. h

    aeroBERT-NER

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
    Explore at:
    Dataset updated
    Apr 7, 2023
    Authors
    Archana Tikayat Ray
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for aeroBERT-NER

      Dataset Summary
    

    This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
    (1) Making available an open-source dataset for aerospace requirements which are often proprietary
    (2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

  7. Z

    GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moncla, Ludovic (2024). GeoEDdA: A Gold Standard Dataset for Named Entity Recognition and Span Categorization Annotations of Diderot & d'Alembert's Encyclopédie [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10530177
    Explore at:
    Dataset updated
    Mar 20, 2024
    Dataset provided by
    Moncla, Ludovic
    McDonough, Katherine
    Vigier, Denis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains a gold standard dataset for named entity recognition and span categorization annotations from Diderot & d’Alembert’s Encyclopédie entries.

    The dataset is available in the following formats:

    JSONL format provided by Prodigy

    binary spaCy format (ready to use with the spaCy train pipeline)

    The Gold Standard dataset is composed of 2,200 paragraphs out of 2,001 Encyclopédie's entries randomly selected. All paragraphs were written in 19th-century French.

    The spans/entities were labeled by the project team along with using pre-labelling with early machine learning models to speed up the labelling process. A train/val/test split was used. Validation and test sets are composed of 200 paragraphs each: 100 classified under 'Géographie' and 100 from another knowledge domain. The datasets have the following breakdown of tokens and spans/entities.

    Tagset

    NC-Spatial: a common noun that identifies a spatial entity (nominal spatial entity) including natural features, e.g. ville, la rivière, royaume.

    NP-Spatial: a proper noun identifying the name of a place (spatial named entities), e.g. France, Paris, la Chine.

    ENE-Spatial: nested spatial entity , e.g. ville de France , royaume de Naples, la mer Baltique.

    Relation: spatial relation, e.g. dans, sur, à 10 lieues de.

    Latlong: geographic coordinates, e.g. Long. 19. 49. lat. 43. 55. 44.

    NC-Person: a common noun that identifies a person (nominal spatial entity), e.g. roi, l'empereur, les auteurs.

    NP-Person: a proper noun identifying the name of a person (person named entities), e.g. Louis XIV, Pline.

    ENE-Person: nested people entity, e.g. le czar Pierre, roi de Macédoine.

    NP-Misc: a proper noun identifying entities not classified as spatial or person, e.g. l'Eglise, 1702, Pélasgique

    ENE-Misc: nested named entity not classified as spatial or person, e.g. l'ordre de S. Jacques, la déclaration du 21 Mars 1671.

    Head: entry name

    Domain-Mark: words indicating the knowledge domain (usually after the head and between parenthesis), e.g. Géographie, Geog., en Anatomie.

    HuggingFace

    The GeoEDdA dataset is available on the HuggingFace Hub: https://huggingface.co/datasets/GEODE/GeoEDdA

    spaCy Custom Spancat trained on Diderot & d’Alembert’s Encyclopédie entries

    This dataset was used to train and evaluate a custom spancat model for French using spaCy. The model is available on HuggingFace's model hub: https://huggingface.co/GEODE/fr_spacy_custom_spancat_edda.

    Acknowledgement

    The authors are grateful to the ASLAN project (ANR-10-LABX-0081) of the Université de Lyon, for its financial support within the French program "Investments for the Future" operated by the National Research Agency (ANR). Data courtesy the ARTFL Encyclopédie Project, University of Chicago.

  8. P

    GermEval Dataset

    • paperswithcode.com
    Updated Sep 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). GermEval Dataset [Dataset]. https://paperswithcode.com/dataset/germeval-2021-toxic-engaging-fact-claiming
    Explore at:
    Dataset updated
    Sep 6, 2021
    Description

    The GermEval dataset is a valuable resource for natural language processing (NLP) tasks, specifically named entity recognition (NER), conducted in the German language. Here are some key details about this dataset:

    Task: Token Classification (specifically, named entity recognition) Language: German Size: The dataset falls within the category of 100K < n < 1M tokens. Source: The data was sampled from German Wikipedia and News Corpora, comprising a collection of citations. Annotations: The annotations were created through crowdsourcing efforts. License: The dataset is available under the cc-by-4.0 license. Content: It covers over 31,000 sentences, corresponding to more than 590,000 tokens. Purpose: Researchers and practitioners can use this dataset to train and evaluate NER models for German text.

    You can find more information and explore the dataset on the Hugging Face Datasets page ¹.

    (1) germeval_14 · Datasets at Hugging Face. https://huggingface.co/datasets/germeval_14. (2) GermEval-2018 Corpus (DE) - Empirical Linguistics and ... - heiDATA. https://heidata.uni-heidelberg.de/dataset.xhtml?persistentId=doi:10.11588/data/0B5VML. (3) GermEval 2014 Named Entity Recognition Shared Task - Data and Task Setup. https://sites.google.com/site/germeval2014ner/data. (4) 6 Best German Language Datasets of 2022 | Twine - Twine Blog. https://www.twine.net/blog/best-german-language-datasets/. (5) germeval_14 | TensorFlow Datasets. https://www.tensorflow.org/datasets/community_catalog/huggingface/germeval_14. (6) undefined. http://www.stern.de/sport/fussball/krawalle-in-der-fussball-bundesliga-dfb-setzt-auf-falsche-konzepte-1553657.html. (7) undefined. http://www.fr-online.de/in_und_ausland/sport/aktuell/1618625_Frings-schaut-finster-in-die-Zukunft.html.

  9. NLUCat

    • zenodo.org
    • huggingface.co
    • +1more
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLUCat

    Dataset Description

    Dataset Summary

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    In this repository you'll find the following items:

    • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
    • NLUCat_dataset.json: the completed NLUCat dataset
    • NLUCat_stats.tsv: statistics about de NLUCat dataset
    • dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers
    • reports: folder with the reports done as feedback to the annotators during the annotation process

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

    Supported Tasks and Leaderboards

    Intent classification, spans identification and examples generation.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    Data Instances

    Three JSON files, one for each split.

    Data Fields

    • example: `str`. Example
    • annotation: `dict`. Annotation of the example
    • intent: `str`. Intent tag
    • slots: `list`. List of slots
    • Tag:`str`. tag to the slot
    • Text:`str`. Text of the slot
    • Start_char: `int`. First character of the span
    • End_char: `int`. Last character of the span

    Example


    An example looks as follows:

    {
    "example": "Demana una ambulància; la meva dona està de part.",
    "annotation": {
    "intent": "call_emergency",
    "slots": [
    {
    "Tag": "service",
    "Text": "ambulància",
    "Start_char": 11,
    "End_char": 21
    },
    {
    "Tag": "situation",
    "Text": "la meva dona està de part",
    "Start_char": 23,
    "End_char": 48
    }
    ]
    }
    },


    Data Splits

    • NLUCat.train: 9128 examples
    • NLUCat.dev: 1441 examples
    • NLUCat.test: 1441 examples

    Dataset Creation

    Curation Rationale

    We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

    When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

    Source Data

    Initial Data Collection and Normalization

    We commissioned a company to create fictitious examples for the creation of this dataset.

    Who are the source language producers?

    We commissioned the writing of the examples to the company m47 labs.

    Annotations

    Annotation process

    The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
    * First step: translation or elaboration of the instructions given to the annotators to write the examples.
    * Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
    * Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

    Who are the annotators?

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

    Personal and Sensitive Information

    No personal or sensitive information included.

    The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

    Considerations for Using the Data

    Social Impact of Dataset

    We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

    Discussion of Biases

    When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
    Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
    Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

  10. h

    polyglot_ner

    • huggingface.co
    • paperswithcode.com
    • +1more
    Updated May 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rami Al-Rfou (2024). polyglot_ner [Dataset]. https://huggingface.co/datasets/rmyeid/polyglot_ner
    Explore at:
    Dataset updated
    May 17, 2024
    Authors
    Rami Al-Rfou
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Polyglot-NER A training dataset automatically generated from Wikipedia and Freebase the task of named entity recognition. The dataset contains the basic Wikipedia based training data for 40 languages we have (with coreference resolution) for the task of named entity recognition. The details of the procedure of generating them is outlined in Section 3 of the paper (https://arxiv.org/abs/1410.3791). Each config contains the data corresponding to a different language. For example, "es" includes only spanish examples.

  11. h

    bioleaflets-biomedical-ner

    • huggingface.co
    Updated May 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruslan Yermak (2023). bioleaflets-biomedical-ner [Dataset]. https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2023
    Authors
    Ruslan Yermak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for BioLeaflets Dataset

      Dataset Summary
    

    BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.

  12. Wikipedia corpus for synthetic data made for Handwritten Text Recognition...

    • zenodo.org
    txt, zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas CONSTUM; Thomas CONSTUM (2025). Wikipedia corpus for synthetic data made for Handwritten Text Recognition and Named Entity Recognition [Dataset]. http://doi.org/10.1007/s10032-024-00511-9
    Explore at:
    zip, txtAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Thomas CONSTUM; Thomas CONSTUM
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This repository contains the corpus necessary for the synthetic data generation of the DANIEL which is available on GitHub and described in the paper DANIEL: a fast document attention network for information extraction and labelling of handwritten documents, authored by Thomas Constum, Pierrick Tranouez, and Thierry Paquet (LITIS, University of Rouen Normandie).

    The paper has been accepted for publication in the International Journal on Document Analysis and Recognition (IJDAR) and is also accessible on arXiv.

    The contents of the archive should be placed in the Datasets/raw directory of the DANIEL codebase.

    Contents of the archive:

    • wiki_en: An English text corpus stored in the Hugging Face datasets library format. Each entry contains the full text of a Wikipedia article.

    • wiki_en_ner: An English text corpus enriched with named entity annotations following the OntoNotes v5 ontology. Named entities are encoded using special symbols. The corpus is stored in the Hugging Face datasets format, and each entry corresponds to a Wikipedia article with annotated entities.

    • wiki_fr: A French text corpus for synthetic data generation, also stored in the Hugging Face datasets format. Each entry contains the full text of a French Wikipedia article.

    • wiki_de.txt: A German text corpus in plain text format, with one sentence per line. The content originates from the Wortschatz Leipzig repository and has been normalized to match the vocabulary used in DANIEL.

    Data format for corpora in Hugging Face datasets structure:

    Each record in the datasets follows the dictionary structure below:

    {
    "id": "

  13. h

    ESG-DLT-NER

    • huggingface.co
    • paperswithcode.com
    Updated Aug 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Exponential Science (2024). ESG-DLT-NER [Dataset]. https://huggingface.co/datasets/ExponentialScience/ESG-DLT-NER
    Explore at:
    Dataset updated
    Aug 16, 2024
    Dataset authored and provided by
    Exponential Science
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for ESG/DLT Named Entity Recognition Dataset

    This dataset contains named entities related to Distributed Ledger Technology (DLT) and Environmental, Social, and Governance (ESG) topics created to support research in these areas and at the intersection of these domains.

      Dataset Details
    
    
    
    
    
      Dataset Sources
    

    Repository: https://github.com/dlt-science/ESG-DLT-LitReview Paper: https://arxiv.org/abs/2308.12420

      Use
    

    This dataset can be used for… See the full description on the dataset page: https://huggingface.co/datasets/ExponentialScience/ESG-DLT-NER.

  14. h

    synthetic-pii-ner-mistral-v1

    • huggingface.co
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urchade Zaratiana (2024). synthetic-pii-ner-mistral-v1 [Dataset]. https://huggingface.co/datasets/urchade/synthetic-pii-ner-mistral-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2024
    Authors
    Urchade Zaratiana
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This the synthetic dataset used for training https://huggingface.co/urchade/gliner_multi_pii-v1. You can get it by browsing the files and dowloading the data.json file.

  15. AttackER: NER Attack Attribution

    • zenodo.org
    bin, json, zip
    Updated Aug 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous; Anonymous (2024). AttackER: NER Attack Attribution [Dataset]. http://doi.org/10.5281/zenodo.10276922
    Explore at:
    bin, zip, jsonAvailable download formats
    Dataset updated
    Aug 15, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous; Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The folder contains 8 files. The files with .spacy extension are the files that can be used to train models using spaCy and the .JSON files can be used to train NER models using Huggingface transformers. The new_model.zip contains the fine-tuned transformer model using the .spacy files that can be used for NER tasks on Cyber attack attribution. The spacy_run_script.ipynb file can be used to view the contents of the .spacy files as well as run the model inside the .zip file. The script contains the necessary guidelines for the same. Since NER tasks in Huggingface transformers requires a JSON format, this folder contains the necessary train, test and dev files in the .json format.

  16. h

    ancora-ca-ner

    • huggingface.co
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Projecte Aina (2021). ancora-ca-ner [Dataset]. https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Dataset authored and provided by
    Projecte Aina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for AnCora-Ca-NER

      Dataset Summary
    

    This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).

      Supported Tasks and Leaderboards
    

    Named Entities Recognition, Language Model

      Languages
    

    The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.

  17. h

    Funder-NER

    • huggingface.co
    Updated Aug 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZBW Leibniz Information Center for Economics (2023). Funder-NER [Dataset]. http://doi.org/10.57967/hf/1011
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2023
    Dataset authored and provided by
    ZBW Leibniz Information Center for Economics
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Dataset Named Entity Recognition of funders of scientific research

      Dataset Summary
    

    Training/test set for automatically identifying funder entities mentioned in scientific papers. This data set is generated from Open Access documents hosted at https://econstor.eu and manually curated/labeled.

      Supported Tasks and Leaderboards
    

    The dataset is for training and testing the automatic recognition of funders as they are acknowledged in scientific… See the full description on the dataset page: https://huggingface.co/datasets/ZBWatHF/Funder-NER.

  18. h

    grocery-ner-dataset

    • huggingface.co
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    empathy.ai (2025). grocery-ner-dataset [Dataset]. https://huggingface.co/datasets/empathyai/grocery-ner-dataset
    Explore at:
    Dataset updated
    May 13, 2025
    Dataset provided by
    empathy.ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Groceries Named Entity Recognition (NER) Dataset

    A specialized dataset for identifying food and grocery items in natural language text using Named Entity Recognition (NER).

      Entity Types
    

    The dataset includes the following grocery categories:

    Fruits Vegetables: Fresh produce (e.g., apples, spinach) Lactose, Diary, Eggs, Cheese, Yoghurt: Dairy products and eggs Meat, Fish, Seafood: Protein sources Frozen, Prepared Meals: Ready-to-eat and frozen meals Baking, Cooking: Baking… See the full description on the dataset page: https://huggingface.co/datasets/empathyai/grocery-ner-dataset.

  19. h

    weibo_ner

    • huggingface.co
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JHU Human Language Technology Center of Excellence (2024). weibo_ner [Dataset]. https://huggingface.co/datasets/hltcoe/weibo_ner
    Explore at:
    Dataset updated
    May 29, 2024
    Dataset authored and provided by
    JHU Human Language Technology Center of Excellence
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    Tags: PER(人名), LOC(地点名), GPE(行政区名), ORG(机构名) Label Tag Meaning PER PER.NAM 名字(张三) PER.NOM 代称、类别名(穷人) LOC LOC.NAM 特指名称(紫玉山庄) LOC.NOM 泛称(大峡谷、宾馆) GPE GPE.NAM 行政区的名称(北京) ORG ORG.NAM 特定机构名称(通惠医院) ORG.NOM 泛指名称、统称(文艺公司)

  20. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Search
Clear search
Close search
Google apps
Main menu