9 datasets found
  1. Named Entity Recognition (NER) Corpus

    • kaggle.com
    Updated Jan 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Naser Al-qaydeh
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Task

    Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

    Dataset

    Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

    You can use Pandas Dataframe to read and manipulate this dataset.

    Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

    data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

    Acknowledgements

    This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

    Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

    Essential info about entities:

    • geo = Geographical Entity
    • org = Organization
    • per = Person
    • gpe = Geopolitical Entity
    • tim = Time indicator
    • art = Artifact
    • eve = Event
    • nat = Natural Phenomenon
  2. h

    turkish-org-ner

    • huggingface.co
    Updated Aug 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    STNM - NLPhoenix (2024). turkish-org-ner [Dataset]. https://huggingface.co/datasets/STNM-NLPhoenix/turkish-org-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 7, 2024
    Authors
    STNM - NLPhoenix
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Turkish Organization Named Entity Recognition (NER) Dataset

    This dataset is designed for Named Entity Recognition (NER) tasks in the Turkish language, specifically focusing on organization entities. It contains sentences annotated with the ORGANIZATION label, making it a valuable resource for training and evaluating NER models.

      Usage
    

    To use this dataset, you can load it directly from Huggingface's datasets library: from datasets import load_dataset

    dataset =… See the full description on the dataset page: https://huggingface.co/datasets/STNM-NLPhoenix/turkish-org-ner.

  3. h

    dev-ner-ontonotes

    • huggingface.co
    Updated Aug 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Louis Guitton (2024). dev-ner-ontonotes [Dataset]. https://huggingface.co/datasets/louisguitton/dev-ner-ontonotes
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2024
    Authors
    Louis Guitton
    Description

    dev-ner-ontonotes

    Validation set of NER dataset OntoNotes5 created with Argilla for a Argilla Meetup talk.

      Usage
    
    
    
    
    
      Load with Argilla
    

    To load with Argilla, you'll just need to install Argilla as pip install argilla --upgrade and then use the following code: import argilla as rg

    ds = rg.FeedbackDataset.from_huggingface("louisguitton/dev-ner-ontonotes")

      Load with datasets
    

    To load this dataset with datasets, you'll just need to install datasets as pip… See the full description on the dataset page: https://huggingface.co/datasets/louisguitton/dev-ner-ontonotes.

  4. T

    wikiann

    • tensorflow.org
    • huggingface.co
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). wikiann [Dataset]. https://www.tensorflow.org/datasets/catalog/wikiann
    Explore at:
    Dataset updated
    Jan 4, 2023
    Description

    WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikiann', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  5. Z

    Data from: HealthE

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parker Seegmiller (2023). HealthE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7539391
    Explore at:
    Dataset updated
    Jan 16, 2023
    Dataset provided by
    Garrett Johnston
    Parker Seegmiller
    Madhusudan Basak
    Sarah Masud Preum
    Joseph Gatto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    HealthE Dataset

    HealthE contains 3,400 pieces of health advice gathered 1) from public health websites (i.e. WebMD.com, MedlinePlus.gov, CDC.gov, and MayoClinic.org) 2) from the publicly available Preclude dataset. Each sample was hand-labeled for health entity recognition by a team of 14 annotators at the author's institution. Automatic recognition of health entities will enable further research in large-scale modeling of texts from online health communities.

    The data is provided in two parts. Both are formatted using the popular, free python pickle library and require use of the popular, free pandas library.

    healthe.pkl is a pandas.DataFrame object containing the 3,400 health-advice statement with hand-labeled health entities.

    non_advice.pkl is a pandas.DataFrame object containing the 2,256 pieces of non-advice statements.

    To load the files in python, use the following code block. import pickle import pandas as pd healthe_df = pd.read_pickle('healthe.pkl') non_advice_df = pd.read_pickle('non_advice_df.pkl')

    healthe_df has four columns. * text contains the health advice statement text * entities contains a python list of (entity, class) tuples * tokenized_text contains a list of tokens obtained by tokenizing the health advice statement text * labels contains a list of the same length as tokenized_text, where each token is mapped to a class label.

    non_advice_df has one column, text, referring to each non-health-advice-statement.

  6. T

    conll2003

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). conll2003 [Dataset]. https://www.tensorflow.org/datasets/catalog/conll2003
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('conll2003', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  7. h

    frenchNER_3entities

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2024). frenchNER_3entities [Dataset]. https://huggingface.co/datasets/CATIE-AQ/frenchNER_3entities
    Explore at:
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    CATIE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset information

    Dataset concatenating NER datasets, available in French and open-source, for 3 entities (LOC, PER, ORG).There are a total of 420,264 rows, of which 346,071 are for training, 32,951 for validation and 41,242 for testing.Our methodology is described in a blog post available in English or French.

      Usage
    

    from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchNER_3entities")

      Dataset
    
    
    
    
    
    
    
      Details of rows… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/frenchNER_3entities.
    
  8. h

    english_news_weak_ner

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vladimir Gurevich, english_news_weak_ner [Dataset]. https://huggingface.co/datasets/imvladikon/english_news_weak_ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Vladimir Gurevich
    Description

    Large Weak Labelled NER corpus

      Dataset Summary
    

    The dataset is generated through weak labelling of the scraped and preprocessed news corpus (bloomberg's news). so, only to research purpose. In order of the tokenization, news were splitted into sentences using nltk.PunktSentenceTokenizer (so, sometimes, tokenization might be not perfect)

      Usage
    

    from datasets import load_dataset

    articles_ds = load_dataset("imvladikon/english_news_weak_ner", "articles") # just… See the full description on the dataset page: https://huggingface.co/datasets/imvladikon/english_news_weak_ner.

  9. h

    frenchNER_4entities

    • huggingface.co
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2024). frenchNER_4entities [Dataset]. http://doi.org/10.57967/hf/1751
    Explore at:
    Dataset updated
    Feb 8, 2024
    Dataset authored and provided by
    CATIE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset information

    Dataset concatenating NER datasets, available in French and open-source, for 4 entities (LOC, PER, ORG, MISC).There are a total of 384,773 rows, of which 328,757 are for training, 24,131 for validation and 31,885 for testing.Our methodology is described in a blog post available in English or French.

      Usage
    

    from datasets import load_dataset dataset = load_dataset("CATIE-AQ/frenchNER_4entities")

      Dataset
    
    
    
    
    
    
    
      Details of rows… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/frenchNER_4entities.
    
  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Naser Al-qaydeh (2022). Named Entity Recognition (NER) Corpus [Dataset]. https://www.kaggle.com/datasets/naseralqaydeh/named-entity-recognition-ner-corpus
Organization logo

Named Entity Recognition (NER) Corpus

Ready to use Named Entity Recognition Corpus

Explore at:
28 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Naser Al-qaydeh
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Task

Named Entity Recognition(NER) is a task of categorizing the entities in a text into categories like names of persons, locations, organizations, etc.

Dataset

Each row in the CSV file is a complete sentence, list of POS tags for each word in the sentence, and list of NER tags for each word in the sentence

You can use Pandas Dataframe to read and manipulate this dataset.

Since each row in the CSV file contain lists, if we read the file with pandas.read_csv() and try to get tag lists by indexing the list will be a string. ```

data['tag'][0] "['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O']" type(data['tag'][0]) string You can use the following to convert it back to list type: from ast import literal_eval literal_eval(data['tag'][0] ) ['O', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-geo', 'O', 'O', 'O', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O'] type(literal_eval(data['tag'][0] )) list ```

Acknowledgements

This dataset is taken from Annotated Corpus for Named Entity Recognition by Abhinav Walia dataset and then processed.

Annotated Corpus for Named Entity Recognition is annotated Corpus for Named Entity Recognition using GMB(Groningen Meaning Bank) corpus for entity classification with enhanced and popular features by Natural Language Processing applied to the data set.

Essential info about entities:

  • geo = Geographical Entity
  • org = Organization
  • per = Person
  • gpe = Geopolitical Entity
  • tim = Time indicator
  • art = Artifact
  • eve = Event
  • nat = Natural Phenomenon
Search
Clear search
Close search
Google apps
Main menu