100+ datasets found
  1. h

    kaggle-entity-annotated-corpus-ner-dataset

    • huggingface.co
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 10, 2022
    Authors
    Rafael Arias Calles
    License

    https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

    Description

    Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

      About Dataset
    

    from Kaggle Datasets

      Context
    

    Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

  2. h

    Pile-NER-type

    • huggingface.co
    Updated Aug 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Universal-NER (2023). Pile-NER-type [Dataset]. https://huggingface.co/datasets/Universal-NER/Pile-NER-type
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 9, 2023
    Authors
    Universal-NER
    Description

    Intro

    Pile-NER-type is a set of GPT-generated data for named entity recognition using the type-based data construction prompt. It was collected by prompting gpt-3.5-turbo-0301 and augmented by negative sampling. Check our project page for more information.

      License
    

    Attribution-NonCommercial 4.0 International

  3. WIESP2022-NER

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAO/NASA Astrophysics Data System, WIESP2022-NER [Dataset]. https://huggingface.co/datasets/adsabs/WIESP2022-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Astrophysics Data Systemhttp://www.adsabs.harvard.edu/
    Authors
    SAO/NASA Astrophysics Data System
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for the first Workshop on Information Extraction from Scientific Publications (WIESP/2022).

      Dataset Description
    

    Datasets with text fragments from astrophysics papers, provided by the NASA Astrophysical Data System with manually tagged astronomical facilities and other entities of interest (e.g., celestial objects).Datasets are in JSON Lines format (each line is a json dictionary).The datasets are formatted similarly to the CONLL2003 format. Each token is… See the full description on the dataset page: https://huggingface.co/datasets/adsabs/WIESP2022-NER.

  4. h

    Financial-NER-NLP

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers, Financial-NER-NLP [Dataset]. https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Joseph G Flowers
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Financial-NER-NLP Dataset Summary The Financial-NER-NLP Dataset is a derivative of the FiNER-139 dataset, which consists of 1.1 million sentences annotated with 139 XBRL tags. This new dataset transforms the original structured data into natural language prompts suitable for training language models. The dataset is designed to enhance models’ abilities in tasks such as named entity recognition (NER), summarization, and information extraction in the financial domain. The… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/Financial-NER-NLP.

  5. kpwr-ner

    • huggingface.co
    Updated Apr 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLARIN-PL (2022). kpwr-ner [Dataset]. https://huggingface.co/datasets/clarin-pl/kpwr-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 1, 2022
    Dataset authored and provided by
    CLARIN-PL
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    KPWR-NER tagging dataset.

  6. h

    AnythingNER

    • huggingface.co
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CascadeNER (2024). AnythingNER [Dataset]. https://huggingface.co/datasets/CascadeNER/AnythingNER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Authors
    CascadeNER
    Description

    DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

    This repository is supplement material for the paper: DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

      💓Update!
    

    DynamicNER is disclosed now! Please download it from Huggingface for train and evaluation!

    We add more GEIC format existing datasets and also the format for fine-tuning and inferrence based on SWIFT! You can… See the full description on the dataset page: https://huggingface.co/datasets/CascadeNER/AnythingNER.

  7. h

    PII-NER

    • huggingface.co
    Updated Jul 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joseph G Flowers (2024). PII-NER [Dataset]. https://huggingface.co/datasets/Josephgflowers/PII-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 20, 2024
    Authors
    Joseph G Flowers
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for NER PII Extraction Dataset Dataset Summary This dataset is designed for training and evaluating Named Entity Recognition (NER) models focused on extracting Personally Identifiable Information (PII) from text. It includes a variety of entities such as names, addresses, phone numbers, email addresses, and identification numbers. The dataset is suitable for tasks that involve PII detection, compliance checks, and data anonymization. Supported Tasks and Leaderboards Named Entity… See the full description on the dataset page: https://huggingface.co/datasets/Josephgflowers/PII-NER.

  8. h

    aeroBERT-NER

    • huggingface.co
    Updated Apr 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archana Tikayat Ray (2023). aeroBERT-NER [Dataset]. http://doi.org/10.57967/hf/0470
    Explore at:
    Dataset updated
    Apr 7, 2023
    Authors
    Archana Tikayat Ray
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for aeroBERT-NER

      Dataset Summary
    

    This dataset contains sentences from the aerospace requirements domain. The sentences are tagged for five NER categories (SYS, VAL, ORG, DATETIME, and RES) using the BIO tagging scheme. There are a total of 1432 sentences. The creation of this dataset is aimed at -
    (1) Making available an open-source dataset for aerospace requirements which are often proprietary
    (2) Fine-tuning language models for token identification… See the full description on the dataset page: https://huggingface.co/datasets/archanatikayatray/aeroBERT-NER.

  9. h

    few-nerd

    • huggingface.co
    • opendatalab.com
    • +1more
    Updated Dec 31, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Speech and Language Technology, DFKI (2021). few-nerd [Dataset]. https://huggingface.co/datasets/DFKI-SLT/few-nerd
    Explore at:
    Dataset updated
    Dec 31, 2021
    Dataset authored and provided by
    Speech and Language Technology, DFKI
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for "Few-NERD"

    dataset-description)

    Dataset Summary Supported Tasks and Leaderboards Languages Dataset Structure Data Instances Data Fields Data Splits

    Dataset Creation Curation Rationale Source Data Annotations Personal and Sensitive InformationConsiderations for Using the Data Social Impact of Dataset Discussion of Biases Other Known Limitations

    Additional Information Dataset Curators Licensing Information Citation Information Contributions

      Dataset… See the full description on the dataset page: https://huggingface.co/datasets/DFKI-SLT/few-nerd.
    
  10. h

    HiNER-original

    • huggingface.co
    Updated May 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Computation for Indian Language Technology (2022). HiNER-original [Dataset]. https://huggingface.co/datasets/cfilt/HiNER-original
    Explore at:
    Dataset updated
    May 2, 2022
    Dataset authored and provided by
    Computation for Indian Language Technology
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This is the dataset repository for HiNER Dataset accepted to be published at LREC 2022. The dataset can help build sequence labelling models for the task Named Entity Recognitin for the Hindi language.

  11. h

    bioleaflets-biomedical-ner

    • huggingface.co
    Updated May 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruslan Yermak (2023). bioleaflets-biomedical-ner [Dataset]. https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2023
    Authors
    Ruslan Yermak
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for BioLeaflets Dataset

      Dataset Summary
    

    BioLeaflets is a biomedical dataset for Data2Text generation. It is a corpus of 1,336 package leaflets of medicines authorised in Europe, which were obtained by scraping the European Medicines Agency (EMA) website. Package leaflets are included in the packaging of medicinal products and contain information to help patients use the product safely and appropriately. This dataset comprises the large majority (∼ 90%) of… See the full description on the dataset page: https://huggingface.co/datasets/ruslan/bioleaflets-biomedical-ner.

  12. h

    grocery-ner-dataset

    • huggingface.co
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    empathy.ai (2025). grocery-ner-dataset [Dataset]. https://huggingface.co/datasets/empathyai/grocery-ner-dataset
    Explore at:
    Dataset updated
    May 13, 2025
    Dataset provided by
    empathy.ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Groceries Named Entity Recognition (NER) Dataset

    A specialized dataset for identifying food and grocery items in natural language text using Named Entity Recognition (NER).

      Entity Types
    

    The dataset includes the following grocery categories:

    Fruits Vegetables: Fresh produce (e.g., apples, spinach) Lactose, Diary, Eggs, Cheese, Yoghurt: Dairy products and eggs Meat, Fish, Seafood: Protein sources Frozen, Prepared Meals: Ready-to-eat and frozen meals Baking, Cooking: Baking… See the full description on the dataset page: https://huggingface.co/datasets/empathyai/grocery-ner-dataset.

  13. h

    finer-139

    • huggingface.co
    • opendatalab.com
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AUEB NLP Group (2022). finer-139 [Dataset]. https://huggingface.co/datasets/nlpaueb/finer-139
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 9, 2022
    Authors
    AUEB NLP Group
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    FiNER-139 is a named entity recognition dataset consisting of 10K annual and quarterly English reports (filings) of publicly traded companies downloaded from the U.S. Securities and Exchange Commission (SEC) annotated with 139 XBRL tags in the IOB2 format.

  14. h

    Annotated_NER_PDF_Resumes

    • huggingface.co
    Updated Jul 22, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MehyarMlaweh (2024). Annotated_NER_PDF_Resumes [Dataset]. https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 22, 2024
    Authors
    MehyarMlaweh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    IT Skills Named Entity Recognition (NER) Dataset

      Description:
    

    This dataset includes 5,029 curriculum vitae (CV) samples, each annotated with IT skills using Named Entity Recognition (NER). The skills are manually labeled and extracted from PDFs, and the data is provided in JSON format. This dataset is ideal for training and evaluating NER models, especially for extracting IT skills from CVs.

      Highlights:
    

    5,029 CV samples with annotated IT skills Manual annotations for… See the full description on the dataset page: https://huggingface.co/datasets/Mehyaar/Annotated_NER_PDF_Resumes.

  15. h

    RWCS-NER-DC

    • huggingface.co
    Updated Mar 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shilin (2025). RWCS-NER-DC [Dataset]. https://huggingface.co/datasets/zsLin/RWCS-NER-DC
    Explore at:
    Dataset updated
    Mar 24, 2025
    Authors
    Shilin
    Description

    zsLin/RWCS-NER-DC dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    synthetic-pii-ner-mistral-v1

    • huggingface.co
    Updated Apr 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Urchade Zaratiana (2024). synthetic-pii-ner-mistral-v1 [Dataset]. https://huggingface.co/datasets/urchade/synthetic-pii-ner-mistral-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 20, 2024
    Authors
    Urchade Zaratiana
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This the synthetic dataset used for training https://huggingface.co/urchade/gliner_multi_pii-v1. You can get it by browsing the files and dowloading the data.json file.

  17. h

    Persian-Text-NER

    • huggingface.co
    Updated Nov 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyed Ali Mir Mohammad Hosseini (2023). Persian-Text-NER [Dataset]. https://huggingface.co/datasets/SeyedAli/Persian-Text-NER
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2023
    Authors
    Seyed Ali Mir Mohammad Hosseini
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SeyedAli/Persian-Text-NER dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    cross_ner

    • huggingface.co
    Updated Apr 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Speech and Language Technology, DFKI (2023). cross_ner [Dataset]. https://huggingface.co/datasets/DFKI-SLT/cross_ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 19, 2023
    Dataset authored and provided by
    Speech and Language Technology, DFKI
    License

    https://choosealicense.com/licenses/undefined/https://choosealicense.com/licenses/undefined/

    Description

    CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specialized entity categories for different domains. Additionally, CrossNER also includes unlabeled domain-related corpora for the corresponding five domains.

    For details, see the paper: CrossNER: Evaluating Cross-Domain Named Entity Recognition

  19. h

    product-ner

    • huggingface.co
    Updated Sep 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Irsh Vijay (2024). product-ner [Dataset]. https://huggingface.co/datasets/1rsh/product-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 14, 2024
    Authors
    Irsh Vijay
    Description

    1rsh/product-ner dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    ancora-ca-ner

    • huggingface.co
    Updated Nov 1, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Projecte Aina (2021). ancora-ca-ner [Dataset]. https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 1, 2021
    Dataset authored and provided by
    Projecte Aina
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for AnCora-Ca-NER

      Dataset Summary
    

    This is a dataset for Named Entity Recognition (NER) in Catalan. It adapts AnCora corpus for Machine Learning and Language Model evaluation purposes. This dataset was developed by BSC TeMU as part of the Projecte AINA, to enrich the Catalan Language Understanding Benchmark (CLUB).

      Supported Tasks and Leaderboards
    

    Named Entities Recognition, Language Model

      Languages
    

    The dataset is in Catalan… See the full description on the dataset page: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rafael Arias Calles (2022). kaggle-entity-annotated-corpus-ner-dataset [Dataset]. https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset

kaggle-entity-annotated-corpus-ner-dataset

rjac/kaggle-entity-annotated-corpus-ner-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 10, 2022
Authors
Rafael Arias Calles
License

https://choosealicense.com/licenses/odbl/https://choosealicense.com/licenses/odbl/

Description

Date: 2022-07-10 Files: ner_dataset.csv Source: Kaggle entity annotated corpus notes: The dataset only contains the tokens and ner tag labels. Labels are uppercase.

  About Dataset

from Kaggle Datasets

  Context

Annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set. Tip: Use Pandas Dataframe to load dataset if using Python for… See the full description on the dataset page: https://huggingface.co/datasets/rjac/kaggle-entity-annotated-corpus-ner-dataset.

Search
Clear search
Close search
Google apps
Main menu