22 datasets found
  1. h

    conll2003

    • huggingface.co
    Updated Mar 1, 2003
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    conll2003 [Dataset]. https://huggingface.co/datasets/eriktks/conll2003
    Explore at:
    Dataset updated
    Mar 1, 2003
    Authors
    Erik Tjong Kim Sang
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

    The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

    For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419

  2. T

    conll2003

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Dec 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). conll2003 [Dataset]. https://www.tensorflow.org/datasets/catalog/conll2003
    Explore at:
    Dataset updated
    Dec 22, 2022
    Description

    The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('conll2003', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  3. h

    favs_bot

    • huggingface.co
    Updated Apr 14, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thien Tran (2023). favs_bot [Dataset]. https://huggingface.co/datasets/thientran/favs_bot
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2023
    Authors
    Thien Tran
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

    The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

    For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419

  4. P

    CoNLL Dataset

    • paperswithcode.com
    • library.toponeai.link
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). CoNLL Dataset [Dataset]. https://paperswithcode.com/dataset/conll-1
    Explore at:
    Dataset updated
    Mar 4, 2024
    Description

    The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.

  5. CoNLL 2003

    • kaggle.com
    Updated Mar 14, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GONG ZEQUN (2021). CoNLL 2003 [Dataset]. https://www.kaggle.com/gongzequn/conll-2003/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 14, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GONG ZEQUN
    Description

    Dataset

    This dataset was created by GONG ZEQUN

    Contents

  6. l

    NameTag 3 Multilingual CoNLL Model

    • lindat.cz
    • live.european-language-grid.eu
    • +1more
    Updated Mar 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jana Straková (2025). NameTag 3 Multilingual CoNLL Model [Dataset]. https://lindat.cz/repository/xmlui/handle/11234/1-5678?show=full
    Explore at:
    Dataset updated
    Mar 20, 2025
    Authors
    Jana Straková
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003, German CoNLL-2003, Dutch CoNLL-2002, Spanish CoNLL-2002, Ukrainian Lang-uk, and Czech CNEC 2.0, all harmonized to flat NEs with 4 labels PER, ORG, LOC, and MISC. NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual-conll.

  7. d

    NameTag 3 Multilingual CoNLL Model - Dataset - B2FIND

    • b2find.dkrz.de
    Updated Jan 21, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). NameTag 3 Multilingual CoNLL Model - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/8f2a7714-87a0-52e1-8a08-f69e9a15a827
    Explore at:
    Dataset updated
    Jan 21, 2025
    Description

    This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003, German CoNLL-2003, Dutch CoNLL-2002, Spanish CoNLL-2002, Ukrainian Lang-uk, and Czech CNEC 2.0, all harmonized to flat NEs with 4 labels PER, ORG, LOC, and MISC. NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual-conll.

  8. h

    conll2003

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sikka, conll2003 [Dataset]. https://huggingface.co/datasets/Ritu2/conll2003
    Explore at:
    Authors
    Sikka
    Description

    Ritu2/conll2003 dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. P

    FIN Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julio Cesar Salinas Alvarado; Karin Verspoor; Timothy Baldwin (2023). FIN Dataset [Dataset]. https://paperswithcode.com/dataset/fin
    Explore at:
    Dataset updated
    Sep 25, 2023
    Authors
    Julio Cesar Salinas Alvarado; Karin Verspoor; Timothy Baldwin
    Description

    A dataset of financial agreements made public through U.S. Security and Exchange Commission (SEC) filings. Eight documents (totalling 54,256 words) were randomly selected for manual annotation, based on the four NE types provided in the CoNLL-2003 dataset: LOCATION (LOC), ORGANISATION (ORG), PERSON (PER), and MISCELLANEOUS (MISC).

  10. Data from: Learning multilingual named entity recognition from Wikipedia

    • figshare.com
    bz2
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran (2023). Learning multilingual named entity recognition from Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5462500.v1
    Explore at:
    bz2Available download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the data associated with Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy and James R. Curran (2013), "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006). A preprint is included here as wikiner-preprint.pdfThis data was originally available at http://schwa.org/resources (which linked to http://schwa.org/projects/resources/wiki/Wikiner).The .bz2 files are NER training corpora produced as reported in the Artificial Intelligence paper. wp2 and wp3 are differentiated by wp3 using a higher level of link inference. They use a pipe-delimited format that can be converted to CoNLL 2003 format with system2conll.pl.nothman08types.tsv is a manual classification of articles first used in Joel Nothman, James R. Curran and Tara Murphy (2008), "Transforming Wikipedia into Named Entity Training Data", In Proceedings of the Australasian Language Technology Association Workshop 2008. http://aclanthology.coli.uni-saarland.de/pdf/U/U08/U08-1016.pdfpopular.tsv and random.tsv are manual article classifications developed for the Artifiical Intelligence paper based on different strategies for sampling articles from Wikipedia in order to account for Wikipedia's biased distribution (see that paper). scheme.tsv maps these fine-grained labels to coarser annotations including CoNLL 2003-style.wikigold.conll.txt is a manual NER annotation of some Wikipedia text as presented in Dominic Balasuriya and Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran (2009), in Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources (http://www.aclweb.org/anthology/W/W09/W09-3302).See also corpora produced similarly in an enhanced version of this work work (Pan et al., "Cross-lingual Name Tagging and Linking for 282 Languages", ACL 2017) at http://nlp.cs.rpi.edu/wikiann/.

  11. CoNLL2003 Dataset

    • kaggle.com
    zip
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Garratt (2022). CoNLL2003 Dataset [Dataset]. https://www.kaggle.com/datasets/juliangarratt/conll2003-dataset/suggestions?status=pending&yourSuggestions=true
    Explore at:
    zip(982780 bytes)Available download formats
    Dataset updated
    Jan 28, 2022
    Authors
    Julian Garratt
    Description

    Dataset

    This dataset was created by Julian Garratt

    Contents

  12. conll2003-eng

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vbichphuong (2023). conll2003-eng [Dataset]. https://www.kaggle.com/datasets/vbichphuong/conll2003-eng/discussion
    Explore at:
    zip(1090095 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    vbichphuong
    Description

    Dataset

    This dataset was created by vbichphuong

    Contents

  13. h

    autoeval-staging-eval-project-conll2003-70dc316d-10775449

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    autoeval-staging-eval-project-conll2003-70dc316d-10775449 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-conll2003-70dc316d-10775449
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: sarahmiller137/distilbert-base-uncased-ft-conll2003 Dataset: conll2003 Config: conll2003 Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @sarahmiller137 for evaluating this model.

  14. O

    DaNE (Danish Dependency Treebank)

    • opendatalab.com
    zip
    Updated Sep 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexandra Institute (2022). DaNE (Danish Dependency Treebank) [Dataset]. https://opendatalab.com/OpenDataLab/DaNE
    Explore at:
    zip(6986356 bytes)Available download formats
    Dataset updated
    Sep 22, 2022
    Dataset provided by
    University of Copenhagen
    Alexandra Institute
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.

  15. Indian Legal data (2003 conll format)

    • kaggle.com
    zip
    Updated Jan 18, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    shweta sharma (2021). Indian Legal data (2003 conll format) [Dataset]. https://www.kaggle.com/datasets/shweta2407/merged-conll-data
    Explore at:
    zip(237689 bytes)Available download formats
    Dataset updated
    Jan 18, 2021
    Authors
    shweta sharma
    Description

    Dataset

    This dataset was created by shweta sharma

    Contents

  16. Indian Court Decision Annotated Corpus.xlsx

    • figshare.com
    xlsx
    Updated Aug 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya (2022). Indian Court Decision Annotated Corpus.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.19719088.v4
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 22, 2022
    Dataset provided by
    figshare
    Authors
    Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.

  17. d

    Linguistically annotated multilingual comparable corpora of parliamentary...

    • b2find.dkrz.de
    Updated Oct 24, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/45bd5c1e-1870-5336-bc2d-1d4624197326
    Explore at:
    Dataset updated
    Oct 24, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora being between 9 and 125 million words in size. The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; and with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are also marked to the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). This entry contains the linguistically marked-up version of the corpora, while the text version is available at http://hdl.handle.net/11356/1486. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools. The compressed files include the ParlaMint.ana XML TEI-encoded linguistically annotated corpora; the derived corpora in CoNLL-U with TSV speech metadata; and the vertical files (with registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 3.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. As opposed to the previous version 2.1, this version corrects some errors in various corpora and adds the information on upper / lower house for bicameral parliaments. The vertical files have also been changed to make them easier to use in the concordancers.

  18. h

    autoeval-eval-conll2003-conll2003-623e8b-1865063749

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub, autoeval-eval-conll2003-conll2003-623e8b-1865063749 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-conll2003-conll2003-623e8b-1865063749
    Explore at:
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    autoevaluate/autoeval-eval-conll2003-conll2003-623e8b-1865063749 dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    autoeval-eval-conll2003-conll2003-11847a-96327146637

    • huggingface.co
    Updated Nov 21, 1996
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (1996). autoeval-eval-conll2003-conll2003-11847a-96327146637 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-conll2003-conll2003-11847a-96327146637
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 1996
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Token Classification Model: AIventurer/bert-finetuned-ner Dataset: conll2003 Config: conll2003 Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @Anmol-Hexaware for evaluating this model.

  20. h

    autoeval-staging-eval-project-conll2003-8cabc0e2-10785450

    • huggingface.co
    Updated Sep 1, 2003
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    autoeval-staging-eval-project-conll2003-8cabc0e2-10785450 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-conll2003-8cabc0e2-10785450
    Explore at:
    Dataset updated
    Sep 1, 2003
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    autoevaluate/autoeval-staging-eval-project-conll2003-8cabc0e2-10785450 dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
conll2003 [Dataset]. https://huggingface.co/datasets/eriktks/conll2003

conll2003

CoNLL-2003

eriktks/conll2003

Explore at:
Dataset updated
Mar 1, 2003
Authors
Erik Tjong Kim Sang
License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419

Search
Clear search
Close search
Google apps
Main menu