7 datasets found
  1. h

    wikineural

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TNER, wikineural [Dataset]. https://huggingface.co/datasets/tner/wikineural
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    TNER
    Description
  2. h

    Babelscape-wikineural-joined-small

    • huggingface.co
    Updated Apr 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Martín (2023). Babelscape-wikineural-joined-small [Dataset]. https://huggingface.co/datasets/dmargutierrez/Babelscape-wikineural-joined-small
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 3, 2023
    Authors
    David Martín
    Description

    Dataset Card for "Babelscape-wikineural-joined-small"

    More Information needed

  3. O

    WikiNEuRal

    • opendatalab.com
    zip
    Updated Mar 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Babelscape (2023). WikiNEuRal [Dataset]. https://opendatalab.com/OpenDataLab/WikiNEuRal
    Explore at:
    zip(577384924 bytes)Available download formats
    Dataset updated
    Mar 17, 2023
    Dataset provided by
    University of Rome
    Babelscape
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    WikiNEuRal is a high-quality automatically-generated dataset for Multilingual Named Entity Recognition.

  4. Multilingual NER Dataset

    • kaggle.com
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Multilingual NER Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/multilingual-ner-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Multilingual NER Dataset

    Multilingual NER Dataset for Named Entity Recognition

    By Babelscape (From Huggingface) [source]

    About this dataset

    The Babelscape/wikineural NER Dataset is a comprehensive and diverse collection of multilingual text data specifically designed for the task of Named Entity Recognition (NER). It offers an extensive range of labeled sentences in nine different languages: French, German, Portuguese, Spanish, Polish, Dutch, Russian, English, and Italian.

    Each sentence in the dataset contains tokens (words or characters) that have been labeled with named entity recognition tags. These tags provide valuable information about the type of named entity each token represents. The dataset also includes a language column to indicate the language in which each sentence is written.

    This dataset serves as an invaluable resource for developing and evaluating NER models across multiple languages. It encompasses various domains and contexts to ensure diversity and representativeness. Researchers and practitioners can utilize this dataset to train and test their NER models in real-world scenarios.

    By using this dataset for NER tasks, users can enhance their understanding of how named entities are recognized across different languages. Furthermore, it enables benchmarking performance comparisons between various NER models developed for specific languages or trained on multiple languages simultaneously.

    Whether you are an experienced researcher or a beginner exploring multilingual NER tasks, the Babelscape/wikineural NER Dataset provides a highly informative and versatile resource that can contribute to advancements in natural language processing and information extraction applications on a global scale

    How to use the dataset

    • Understand the Data Structure:

      • The dataset consists of labeled sentences in nine different languages: French (fr), German (de), Portuguese (pt), Spanish (es), Polish (pl), Dutch (nl), Russian (ru), English (en), and Italian (it).
      • Each sentence is represented by three columns: tokens, ner_tags, and lang.
      • The tokens column contains the individual words or characters in each labeled sentence.
      • The ner_tags column provides named entity recognition tags for each token, indicating their entity types.
      • The lang column specifies the language of each sentence.
    • Explore Different Languages:

      • Since this dataset covers multiple languages, you can choose to focus on a specific language or perform cross-lingual analysis.
      • Analyzing multiple languages can help uncover patterns and differences in named entities across various linguistic contexts.
    • Preprocessing and Cleaning:

      • Before training your NER models or applying any NLP techniques to this dataset, it's essential to preprocess and clean the data.
      • Consider removing any unnecessary punctuation marks or special characters unless they carry significant meaning in certain languages.
    • Training Named Entity Recognition Models: 4a. Data Splitting: Divide the dataset into training, validation, and testing sets based on your requirements using appropriate ratios. 4b. Feature Extraction: Prepare input features from tokenized text data such as word embeddings or character-level representations depending on your model choice. 4c. Model Training: Utilize state-of-the-art NER models (e.g., LSTM-CRF, Transformer-based models) to train on the labeled sentences and ner_tags columns. 4d. Evaluation: Evaluate your trained model's performance using the provided validation dataset or test datasets specific to each language.

    • Applying Pretrained Models:

      • Instead of training a model from scratch, you can leverage existing pretrained NER models like BERT, GPT-2, or SpaCy's named entity recognition capabilities.
      • Fine-tune these pre-trained models on your specific NER task using the labeled

    Research Ideas

    • Training NER models: This dataset can be used to train NER models in multiple languages. By providing labeled sentences and their corresponding named entity recognition tags, the dataset can help train models to accurately identify and classify named entities in different languages.
    • Evaluating NER performance: The dataset can be used as a benchmark to evaluate the performance of pre-trained or custom-built NER models. By using the labeled sentences as test data, developers and researchers can measure the accuracy, precision, recall, and F1-score of their models across multiple languages.
    • Cross-lingual analysis: With labeled sentences available in nine different languages, researchers can perform cross-lingual analysis...
  5. h

    multinerd

    • huggingface.co
    Updated Apr 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Babelscape (2023). multinerd [Dataset]. https://huggingface.co/datasets/Babelscape/multinerd
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 21, 2023
    Dataset authored and provided by
    Babelscape
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for MultiNERD dataset

      Description
    

    Summary: In a nutshell, MultiNERD is the first language-agnostic methodology for automatically creating multilingual, multi-genre and fine-grained annotations for Named Entity Recognition and Entity Disambiguation. Specifically, it can be seen an extension of the combination of two prior works from our research group that are WikiNEuRal, from which we took inspiration for the state-of-the-art silver-data creation methodology… See the full description on the dataset page: https://huggingface.co/datasets/Babelscape/multinerd.

  6. h

    wikineural_fr_prompt_ner

    • huggingface.co
    Updated Aug 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CATIE (2025). wikineural_fr_prompt_ner [Dataset]. https://huggingface.co/datasets/CATIE-AQ/wikineural_fr_prompt_ner
    Explore at:
    Dataset updated
    Aug 12, 2025
    Dataset authored and provided by
    CATIE
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    wikineural_fr_prompt_ner

      Summary
    

    wikineural_fr_prompt_ner is a subset of the Dataset of French Prompts (DFP).It contains 2,647,638 rows that can be used for a name entity recognition task.The original data (without prompts) comes from the dataset wikineural by Tedeschi et al. where only the French part has been kept.A list of prompts (see below) was then applied in order to build the input and target columns and thus obtain the same format as the xP3 dataset by… See the full description on the dataset page: https://huggingface.co/datasets/CATIE-AQ/wikineural_fr_prompt_ner.

  7. h

    es-ner-massive

    • huggingface.co
    Updated Mar 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Héctor López Hidalgo (2024). es-ner-massive [Dataset]. https://huggingface.co/datasets/hlhdatscience/es-ner-massive
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 25, 2024
    Authors
    Héctor López Hidalgo
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for es-ner-massive

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    The es-ner-massive dataset is a combination of three datasets: tner/wikineural, conll2002, and polyglot_ner. It is designed for Named Entity Recognition (NER) tasks. Tags are curated to be span-based and encoded according to the following convention:

    encodings_dictionary = { "O": 0, "PER": 1, 'ORG': 2, "LOC": 3, "MISC": 4 }

      Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/hlhdatscience/es-ner-massive.
    
  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TNER, wikineural [Dataset]. https://huggingface.co/datasets/tner/wikineural

wikineural

WikiNeural

tner/wikineural

Explore at:
119 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
TNER
Description
Search
Clear search
Close search
Google apps
Main menu