22 datasets found

h
conll2003
huggingface.co
Updated Mar 1, 2003
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
conll2003 [Dataset]. https://huggingface.co/datasets/eriktks/conll2003
Explore at:
Dataset updated
Mar 1, 2003
Authors
Erik Tjong Kim Sang
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419
T
conll2003
tensorflow.org
opendatalab.com
+1more
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). conll2003 [Dataset]. https://www.tensorflow.org/datasets/catalog/conll2003
Explore at:
Dataset updated
Dec 22, 2022
Description
The shared task of CoNLL-2003 concerns language-independent named entity recognition and concentrates on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('conll2003', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
favs_bot
huggingface.co
Updated Apr 14, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thien Tran (2023). favs_bot [Dataset]. https://huggingface.co/datasets/thientran/favs_bot
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2023
Authors
Thien Tran
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419
P
CoNLL Dataset
paperswithcode.com
library.toponeai.link
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). CoNLL Dataset [Dataset]. https://paperswithcode.com/dataset/conll-1
Explore at:
Dataset updated
Mar 4, 2024
Description
The CoNLL dataset is a widely used resource in the field of natural language processing (NLP). The term “CoNLL” stands for Conference on Natural Language Learning. It originates from a series of shared tasks organized at the Conferences of Natural Language Learning.
CoNLL 2003
kaggle.com
Updated Mar 14, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GONG ZEQUN (2021). CoNLL 2003 [Dataset]. https://www.kaggle.com/gongzequn/conll-2003/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 14, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GONG ZEQUN
Description
Dataset

This dataset was created by GONG ZEQUN

Contents
l
NameTag 3 Multilingual CoNLL Model
lindat.cz
live.european-language-grid.eu
+1more
Updated Mar 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jana Straková (2025). NameTag 3 Multilingual CoNLL Model [Dataset]. https://lindat.cz/repository/xmlui/handle/11234/1-5678?show=full
Explore at:
Dataset updated
Mar 20, 2025
Authors
Jana Straková
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003, German CoNLL-2003, Dutch CoNLL-2002, Spanish CoNLL-2002, Ukrainian Lang-uk, and Czech CNEC 2.0, all harmonized to flat NEs with 4 labels PER, ORG, LOC, and MISC. NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual-conll.
d
NameTag 3 Multilingual CoNLL Model - Dataset - B2FIND
b2find.dkrz.de
Updated Jan 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). NameTag 3 Multilingual CoNLL Model - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/8f2a7714-87a0-52e1-8a08-f69e9a15a827
Explore at:
Dataset updated
Jan 21, 2025
Description
This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/), trained jointly on several NE corpora: English CoNLL-2003, German CoNLL-2003, Dutch CoNLL-2002, Spanish CoNLL-2002, Ukrainian Lang-uk, and Czech CNEC 2.0, all harmonized to flat NEs with 4 labels PER, ORG, LOC, and MISC. NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual-conll.
h
conll2003
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sikka, conll2003 [Dataset]. https://huggingface.co/datasets/Ritu2/conll2003
Explore at:
Authors
Sikka
Description
Ritu2/conll2003 dataset hosted on Hugging Face and contributed by the HF Datasets community
P
FIN Dataset
paperswithcode.com
opendatalab.com
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julio Cesar Salinas Alvarado; Karin Verspoor; Timothy Baldwin (2023). FIN Dataset [Dataset]. https://paperswithcode.com/dataset/fin
Explore at:
Dataset updated
Sep 25, 2023
Authors
Julio Cesar Salinas Alvarado; Karin Verspoor; Timothy Baldwin
Description
A dataset of financial agreements made public through U.S. Security and Exchange Commission (SEC) filings. Eight documents (totalling 54,256 words) were randomly selected for manual annotation, based on the four NE types provided in the CoNLL-2003 dataset: LOCATION (LOC), ORGANISATION (ORG), PERSON (PER), and MISCELLANEOUS (MISC).
Data from: Learning multilingual named entity recognition from Wikipedia
figshare.com
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran (2023). Learning multilingual named entity recognition from Wikipedia [Dataset]. http://doi.org/10.6084/m9.figshare.5462500.v1
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5462500.v1
Dataset updated
May 30, 2023
Dataset provided by
figshare
Authors
Joel Nothman; Nicky Ringland; Will Radford; Tara Murphy; James R Curran
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the data associated with Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy and James R. Curran (2013), "Learning multilingual named entity recognition from Wikipedia", Artificial Intelligence 194 (DOI: 10.1016/j.artint.2012.03.006). A preprint is included here as wikiner-preprint.pdfThis data was originally available at http://schwa.org/resources (which linked to http://schwa.org/projects/resources/wiki/Wikiner).The .bz2 files are NER training corpora produced as reported in the Artificial Intelligence paper. wp2 and wp3 are differentiated by wp3 using a higher level of link inference. They use a pipe-delimited format that can be converted to CoNLL 2003 format with system2conll.pl.nothman08types.tsv is a manual classification of articles first used in Joel Nothman, James R. Curran and Tara Murphy (2008), "Transforming Wikipedia into Named Entity Training Data", In Proceedings of the Australasian Language Technology Association Workshop 2008. http://aclanthology.coli.uni-saarland.de/pdf/U/U08/U08-1016.pdfpopular.tsv and random.tsv are manual article classifications developed for the Artifiical Intelligence paper based on different strategies for sampling articles from Wikipedia in order to account for Wikipedia's biased distribution (see that paper). scheme.tsv maps these fine-grained labels to coarser annotations including CoNLL 2003-style.wikigold.conll.txt is a manual NER annotation of some Wikipedia text as presented in Dominic Balasuriya and Nicky Ringland and Joel Nothman and Tara Murphy and James R. Curran (2009), in Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources (http://www.aclweb.org/anthology/W/W09/W09-3302).See also corpora produced similarly in an enhanced version of this work work (Pan et al., "Cross-lingual Name Tagging and Linking for 282 Languages", ACL 2017) at http://nlp.cs.rpi.edu/wikiann/.
CoNLL2003 Dataset
kaggle.com
zip
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julian Garratt (2022). CoNLL2003 Dataset [Dataset]. https://www.kaggle.com/datasets/juliangarratt/conll2003-dataset/suggestions?status=pending&yourSuggestions=true
Explore at:
zip(982780 bytes)Available download formats
Dataset updated
Jan 28, 2022
Authors
Julian Garratt
Description
Dataset

This dataset was created by Julian Garratt

Contents
conll2003-eng
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vbichphuong (2023). conll2003-eng [Dataset]. https://www.kaggle.com/datasets/vbichphuong/conll2003-eng/discussion
Explore at:
zip(1090095 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
vbichphuong
Description
Dataset

This dataset was created by vbichphuong

Contents
h
autoeval-staging-eval-project-conll2003-70dc316d-10775449
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
autoeval-staging-eval-project-conll2003-70dc316d-10775449 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-conll2003-70dc316d-10775449
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Evaluation on the Hub
Description
Dataset Card for AutoTrain Evaluator

This repository contains model predictions generated by AutoTrain for the following task and dataset:

Task: Token Classification Model: sarahmiller137/distilbert-base-uncased-ft-conll2003 Dataset: conll2003 Config: conll2003 Split: test

To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

Contributions

Thanks to @sarahmiller137 for evaluating this model.
O
DaNE (Danish Dependency Treebank)
opendatalab.com
zip
Updated Sep 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexandra Institute (2022). DaNE (Danish Dependency Treebank) [Dataset]. https://opendatalab.com/OpenDataLab/DaNE
Explore at:
zip(6986356 bytes)Available download formats
Dataset updated
Sep 22, 2022
Dataset provided by
University of Copenhagen
Alexandra Institute
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.
Indian Legal data (2003 conll format)
kaggle.com
zip
Updated Jan 18, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shweta sharma (2021). Indian Legal data (2003 conll format) [Dataset]. https://www.kaggle.com/datasets/shweta2407/merged-conll-data
Explore at:
zip(237689 bytes)Available download formats
Dataset updated
Jan 18, 2021
Authors
shweta sharma
Description
Dataset

This dataset was created by shweta sharma

Contents
Indian Court Decision Annotated Corpus.xlsx
figshare.com
xlsx
Updated Aug 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya (2022). Indian Court Decision Annotated Corpus.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.19719088.v4
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19719088.v4
Dataset updated
Aug 22, 2022
Dataset provided by
figshare
Authors
Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.
d
Linguistically annotated multilingual comparable corpora of parliamentary...
b2find.dkrz.de
Updated Oct 24, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/45bd5c1e-1870-5336-bc2d-1d4624197326
Explore at:
Dataset updated
Oct 24, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora being between 9 and 125 million words in size. The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; and with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are also marked to the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). This entry contains the linguistically marked-up version of the corpora, while the text version is available at http://hdl.handle.net/11356/1486. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools. The compressed files include the ParlaMint.ana XML TEI-encoded linguistically annotated corpora; the derived corpora in CoNLL-U with TSV speech metadata; and the vertical files (with registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 3.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. As opposed to the previous version 2.1, this version corrects some errors in various corpora and adds the information on upper / lower house for bicameral parliaments. The vertical files have also been changed to make them easier to use in the concordancers.
h
autoeval-eval-conll2003-conll2003-623e8b-1865063749
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub, autoeval-eval-conll2003-conll2003-623e8b-1865063749 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-conll2003-conll2003-623e8b-1865063749
Explore at:
Dataset authored and provided by
Evaluation on the Hub
Description
autoevaluate/autoeval-eval-conll2003-conll2003-623e8b-1865063749 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
autoeval-eval-conll2003-conll2003-11847a-96327146637
huggingface.co
Updated Nov 21, 1996
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (1996). autoeval-eval-conll2003-conll2003-11847a-96327146637 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-conll2003-conll2003-11847a-96327146637
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 1996
Dataset authored and provided by
Evaluation on the Hub
Description
Dataset Card for AutoTrain Evaluator

This repository contains model predictions generated by AutoTrain for the following task and dataset:

Task: Token Classification Model: AIventurer/bert-finetuned-ner Dataset: conll2003 Config: conll2003 Split: test

To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

Contributions

Thanks to @Anmol-Hexaware for evaluating this model.
h
autoeval-staging-eval-project-conll2003-8cabc0e2-10785450
huggingface.co
Updated Sep 1, 2003
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
autoeval-staging-eval-project-conll2003-8cabc0e2-10785450 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-staging-eval-project-conll2003-8cabc0e2-10785450
Explore at:
Dataset updated
Sep 1, 2003
Dataset authored and provided by
Evaluation on the Hub
Description
autoevaluate/autoeval-staging-eval-project-conll2003-8cabc0e2-10785450 dataset hosted on Hugging Face and contributed by the HF Datasets community

Facebook

Twitter

Click to copy link

Link copied

Cite

conll2003 [Dataset]. https://huggingface.co/datasets/eriktks/conll2003

conll2003

CoNLL-2003

eriktks/conll2003

Explore at:

Dataset updated

Mar 1, 2003

Authors

Erik Tjong Kim Sang

License

https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

Description

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2 tagging scheme, whereas the original dataset uses IOB1.

For more details see https://www.clips.uantwerpen.be/conll2003/ner/ and https://www.aclweb.org/anthology/W03-0419

Clear search

Close search

Google apps

Main menu

conll2003

conll2003

favs_bot

CoNLL Dataset

CoNLL 2003

Dataset

Contents

NameTag 3 Multilingual CoNLL Model

NameTag 3 Multilingual CoNLL Model - Dataset - B2FIND

conll2003

FIN Dataset

Data from: Learning multilingual named entity recognition from Wikipedia

CoNLL2003 Dataset

Dataset

Contents

conll2003-eng

Dataset

Contents

autoeval-staging-eval-project-conll2003-70dc316d-10775449

DaNE (Danish Dependency Treebank)

Indian Legal data (2003 conll format)

Dataset

Contents

Indian Court Decision Annotated Corpus.xlsx

Linguistically annotated multilingual comparable corpora of parliamentary...

autoeval-eval-conll2003-conll2003-623e8b-1865063749

autoeval-eval-conll2003-conll2003-11847a-96327146637

autoeval-staging-eval-project-conll2003-8cabc0e2-10785450

conll2003

CoNLL-2003

eriktks/conll2003