100+ datasets found
  1. h

    universal_ner

    • huggingface.co
    Updated Sep 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universal NER (2024). universal_ner [Dataset]. https://huggingface.co/datasets/universalner/universal_ner
    Explore at:
    Dataset updated
    Sep 3, 2024
    Dataset authored and provided by
    Universal NER
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Universal Named Entity Recognition (UNER) aims to fill a gap in multilingual NLP: high quality NER datasets in many languages with a shared tagset.

    UNER is modeled after the Universal Dependencies project, in that it is intended to be a large community annotation effort with language-universal guidelines. Further, we use the same text corpora as Universal Dependencies.

  2. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  3. Multilingual named entity recognition for medieval charters. Datasets and...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sergio Torres Aguilar; Sergio Torres Aguilar (2023). Multilingual named entity recognition for medieval charters. Datasets and models [Dataset]. http://doi.org/10.5281/zenodo.6463699
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sergio Torres Aguilar; Sergio Torres Aguilar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Annotated dataset for training named entities recognition models for medieval charters in Latin, French and Spanish.

    The original raw texts for all charters were collected from four charters collections

    - HOME-ALCAR corpus : https://zenodo.org/record/5600884

    - CBMA : http://www.cbma-project.eu

    - Diplomata Belgica : https://www.diplomata-belgica.be

    - CODEA corpus : https://corpuscodea.es/

    We include (i) the annotated training datasets, (ii) the contextual and static embeddings trained on medieval multilingual texts and (iii) the named entity recognition models trained using two architectures: Bi-LSTM-CRF + stacked embeddings and fine-tuning on Bert-based models (mBert and RoBERTa)

    Codes, datasets and notebooks used to train models can be consulted in our gitlab repository: https://gitlab.com/magistermilitum/ner_medieval_multilingual

    Our best RoBERTa model is also available in the HuggingFace library: https://huggingface.co/magistermilitum/roberta-multilingual-medieval-ner

  4. h

    Pile-NER-type

    • huggingface.co
    Updated Aug 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universal-NER (2023). Pile-NER-type [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-type
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Authors
    Universal-NER
    Description

    Intro

    Pile-NER-type is a set of GPT-generated data for named entity recognition using the type-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

      License
    

    Attribution-NonCommercial 4.0 International

  5. h

    aeroBERT-NER

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
    Explore at:
    Dataset updated
    Apr 7, 2023
    Authors
    Archana Tikayat Ray
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for aeroBERT-NER

      Dataset Summary
    

    This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
    (1) Making available an open-source dataset for aerospace requirements which are often proprietary
    (2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

  6. Weekly supervised Multilingual Data Set to train Named Entity Recognition...

    • zenodo.org
    Updated Apr 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc (2025). Weekly supervised Multilingual Data Set to train Named Entity Recognition for Symptom Extraction [Dataset]. http://doi.org/10.5281/zenodo.13918009
    Explore at:
    Dataset updated
    Apr 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Izidor Mlakar; Izidor Mlakar; Rigon Sallauka; Rigon Sallauka; Umut Arioz; Umut Arioz; Matej Rojc; Matej Rojc
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Sets were generated using the Weakly Supervised NER pipeline (https://github.com/HUMADEX/Weekly-Supervised-NER-pipline) to train the symptom extraction NER models.

    Supported Languages and dataset locations for the specific language:

    English (base language): https://huggingface.co/HUMADEX/english_medical_ner
    German: https://huggingface.co/HUMADEX/german_medical_ner
    Italian: https://huggingface.co/HUMADEX/italian_medical_ner
    Spanish: https://huggingface.co/HUMADEX/spanish_medical_ner
    Greek: https://huggingface.co/HUMADEX/german_medical_ner
    Slovenian: https://huggingface.co/HUMADEX/slovenian_medical_ner
    Polish: https://huggingface.co/HUMADEX/polish_medical_ner
    Portuguese: https://huggingface.co/HUMADEX/portugese_medical_ner

    Dataset Building

    • Data Integration and Preprocessing
    • Data Cleaning
    • Annotation with Stanza's i2b2 Clinical Model
    • Translation into the targeted language
    • Word Alignment
    • Data Augmentation

    Acknowledgement
    This dataset had been created as part of joint research of HUMADEX research group (https://www.linkedin.com/company/101563689/) and has received funding by the European Union Horizon Europe Research and Innovation Program project SMILE (grant number 101080923) and Marie Skłodowska-Curie Actions (MSCA) Doctoral Networks, project BosomShield ((rant number 101073222). Responsibility for the information and views expressed herein lies entirely with the authors.

    Authors:
    dr. Izidor Mlakar, Rigona Sallauka, dr. Umut Arioz, dr. Matej Rojc

    Please cite as:

    Article title: Weakly-Supervised Multilingual Medical NER For Symptom Extraction For Low-Resource Languages
    Doi: 10.20944/preprints202504.1356.v1
    Website: https://www.preprints.org/manuscript/202504.1356/v1" href="https://www.preprints.org/manuscript/202504.1356/v1">https://www.preprints.org/manuscript/202504.1356/v1

  7. h

    InLegalNER

    • huggingface.co
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenNyAI (2024). InLegalNER [Dataset]. https://huggingface.co/datasets/opennyaiorg/InLegalNER
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset authored and provided by
    OpenNyAI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset for training and evaluating Indian Legal Named Entity Recognition model.

      Paper details
    

    Named Entity Recognition in Indian court judgments Arxiv

      Label Scheme
    

    View label scheme (14 labels for 1 components)

    ENTITY BELONGS TO

    LAWYER PREAMBLE

    COURT PREAMBLE, JUDGEMENT

    JUDGE PREAMBLE, JUDGEMENT

    PETITIONER PREAMBLE, JUDGEMENT

    RESPONDENT PREAMBLE, JUDGEMENT

    CASE_NUMBER JUDGEMENT

    GPE JUDGEMENT

    DATE JUDGEMENT

    ORG JUDGEMENT

    STATUTE JUDGEMENT… See the full description on the dataset page: https://huggingface.co/datasets/opennyaiorg/InLegalNER.

  8. h

    Annotated_NER_PDF_Resumes

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MehyarMlaweh (2024). Annotated_NER_PDF_Resumes [Dataset]. https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Authors
    MehyarMlaweh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    IT Skills Named Entity Recognition (NER) Dataset

      Description:
    

    This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.

      Highlights:
    

    5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.

  9. Climate-Change-NER

    • huggingface.co
    Updated Oct 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    IBM Research (2024). Climate-Change-NER [Dataset]. https://huggingface.co/datasets/ibm-research/Climate-Change-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 11, 2024
    Dataset provided by
    IBMhttp://ibm.com/
    IBM Research
    Authors
    IBM Research
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    Dataset Card for Climate Change NER

    The Climate Change NER is an English-language dataset containing 534 abstracts of climate-related papers. They have been sourced from the Semantic Scholar Academic Graph "abstracts" dataset. The abstracts have been manually annotated by classifying climate-related tokens in a set of 13 categories.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    We introduce a comprehensive dataset for developing and evaluating NLP models tailored towards… See the full description on the dataset page: https://huggingface.co/datasets/ibm-research/Climate-Change-NER.

  10. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

  11. h

    bioleaflets-biomedical-ner

    • huggingface.co
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruslan Yermak (2023). bioleaflets-biomedical-ner [Dataset]. https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2023
    Authors
    Ruslan Yermak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for BioLeaflets Dataset

      Dataset Summary
    

    BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.

  12. o

    The Chilean Waiting List Corpus

    • explore.openaire.eu
    • zenodo.org
    Updated Jul 1, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pablo Báez; Fabián Villena; Matías Rojas; Felipe Bravo-Marquez; Jocelyn Dunstan (2020). The Chilean Waiting List Corpus [Dataset]. http://doi.org/10.5281/zenodo.3926704
    Explore at:
    Dataset updated
    Jul 1, 2020
    Authors
    Pablo Báez; Fabián Villena; Matías Rojas; Felipe Bravo-Marquez; Jocelyn Dunstan
    Area covered
    Chile
    Description

    Here we describe a new clinical corpus rich in nested entities and a series of neural models to identify them. The corpus comprises de-identified referrals from the waiting list in Chilean public hospitals. A subset of 9,000 referrals (medical and dental) was manually annotated with ten types of entities, six attributes, and pairs of relations with clinical relevance. A trained medical doctor or dentist annotated these referrals and then, together with three other researchers, consolidated each of the annotations. The annotated corpus has more than 48% of entities embedded in other entities or containing another. We use this corpus to build Named Entity Recognition (NER) models. The best results were achieved using Multiple Single-entity architectures with clinical word embeddings stacked with character and Flair contextual embeddings (refer to this paper: https://aclanthology.org/2022.coling-1.184/). The entity with the best performance is abbreviation, and the hardest to recognize is finding. NER models applied to this corpus can leverage statistics of diseases and pending procedures. This work constitutes the first annotated corpus using clinical narratives from Chile and one of the few in Spanish. The annotated corpus, clinical word embeddings, annotation guidelines, and neural models are freely released to the community.This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/. We are releasing the dataset in 3 formats: cwlc.zip: Contains the raw text files for each document along with its annotation file in Standoff format cwlc_conll-format: CoNLL format for training NER models. In addition, the dataset has been released in hugging face (https://huggingface.co/plncmm) to facilitate experiments with transformer-based architectures.

  13. h

    azerbaijani-ner-dataset

    • huggingface.co
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LocalDoc (2024). azerbaijani-ner-dataset [Dataset]. http://doi.org/10.57967/hf/2484
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 13, 2024
    Dataset authored and provided by
    LocalDoc
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Azerbaijani Named Entity Recognition (NER) Dataset

    This repository contains the dataset for training and evaluating Named Entity Recognition (NER) models in the Azerbaijani language. The dataset includes annotated text data with various named entities.

      Dataset Description
    

    The dataset includes the following entity types:

    0: O: Outside any named entity 1: PERSON: Names of individuals 2: LOCATION: Geographical locations, both man-made and natural 3: ORGANISATION: Names of… See the full description on the dataset page: https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset.

  14. h

    ancora-ca-ner

    • huggingface.co
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Projecte Aina (2021). ancora-ca-ner [Dataset]. https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Dataset authored and provided by
    Projecte Aina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for AnCora-Ca-NER

      Dataset Summary
    

    This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).

      Supported Tasks and Leaderboards
    

    Named Entities Recognition, Language Model

      Languages
    

    The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.

  15. h

    Funder-NER

    • huggingface.co
    Updated Aug 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ZBW Leibniz Information Center for Economics (2023). Funder-NER [Dataset]. http://doi.org/10.57967/hf/1011
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2023
    Dataset authored and provided by
    ZBW Leibniz Information Center for Economics
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Dataset Named Entity Recognition of funders of scientific research

      Dataset Summary
    

    Training/test set for automatically identifying funder entities mentioned in scientific papers. This data set is generated from Open Access documents hosted at https://econstor.eu and manually curated/labeled.

      Supported Tasks and Leaderboards
    

    The dataset is for training and testing the automatic recognition of funders as they are acknowledged in scientific… See the full description on the dataset page: https://huggingface.co/datasets/ZBWatHF/Funder-NER.

  16. O

    Polyglot-NER

    • opendatalab.com
    • huggingface.co
    zip
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stony Brook University (2023). Polyglot-NER [Dataset]. https://opendatalab.com/OpenDataLab/Polyglot-NER
    Explore at:
    zip(3575536533 bytes)Available download formats
    Dataset updated
    Apr 7, 2023
    Dataset provided by
    Stony Brook University
    Description

    Polyglot-NER builds massive multilingual annotators with minimal human expertise and intervention.

  17. h

    requirements-ner-id

    • huggingface.co
    Updated Jul 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dekai Xiao (2023). requirements-ner-id [Dataset]. https://huggingface.co/datasets/dxiao/requirements-ner-id
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 12, 2023
    Authors
    Dekai Xiao
    Description

    dxiao/requirements-ner-id dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    RaTE-NER

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Weike Zhao (2024). RaTE-NER [Dataset]. https://huggingface.co/datasets/Angelakeke/RaTE-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Authors
    Weike Zhao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Dataset Card for RaTE-NER Dataset

    GitHub | Paper

      Dataset Summary
    

    RaTE-NER dataset is a large-scale, radiological named entity recognition (NER) dataset, including 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, that spans 9 imaging modalities and 23 anatomical regions, ensuring comprehensive coverage. Additionally, we further enriched the dataset with 33,605 sentences from the 17,432 reports available on Radiopaedia, by… See the full description on the dataset page: https://huggingface.co/datasets/Angelakeke/RaTE-NER.

  19. h

    finer-139

    • huggingface.co
    • opendatalab.com
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AUEB NLP Group (2022). finer-139 [Dataset]. https://huggingface.co/datasets/nlpaueb/finer-139
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2022
    Authors
    AUEB NLP Group
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    FiNER-139 is a named entity recognition dataset consisting of 10K annual and quarterly English reports (filings) of publicly traded companies downloaded from the U.S. Securities and Exchange Commission (SEC) annotated with 139 XBRL tags in the IOB2 format.

  20. h

    grocery-ner-dataset

    • huggingface.co
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    empathy.ai (2025). grocery-ner-dataset [Dataset]. https://huggingface.co/datasets/empathyai/grocery-ner-dataset
    Explore at:
    Dataset updated
    May 13, 2025
    Dataset provided by
    empathy.ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Groceries Named Entity Recognition (NER) Dataset

    A specialized dataset for identifying food and grocery items in natural language text using Named Entity Recognition (NER).

      Entity Types
    

    The dataset includes the following grocery categories:

    Fruits Vegetables: Fresh produce (e.g., apples, spinach) Lactose, Diary, Eggs, Cheese, Yoghurt: Dairy products and eggs Meat, Fish, Seafood: Protein sources Frozen, Prepared Meals: Ready-to-eat and frozen meals Baking, Cooking: Baking… See the full description on the dataset page: https://huggingface.co/datasets/empathyai/grocery-ner-dataset.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Universal NER (2024). universal_ner [Dataset]. https://huggingface.co/datasets/universalner/universal_ner

universal_ner

universalner/universal_ner

Explore at:
Dataset updated
Sep 3, 2024
Dataset authored and provided by
Universal NER
License

Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically

Description

Universal Named Entity Recognition (UNER) aims to fill a gap in multilingual NLP: high quality NER datasets in many languages with a shared tagset.

UNER is modeled after the Universal Dependencies project, in that it is intended to be a large community annotation effort with language-universal guidelines. Further, we use the same text corpora as Universal Dependencies.

Search
Clear search
Close search
Google apps
Main menu