22 datasets found

P
legal_NER Dataset
paperswithcode.com
Updated Oct 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). legal_NER Dataset [Dataset]. https://paperswithcode.com/dataset/legal-ner
Explore at:
Dataset updated
Oct 23, 2023
Description
legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.
h
InLegalNER
huggingface.co
Updated Apr 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenNyAI (2024). InLegalNER [Dataset]. https://huggingface.co/datasets/opennyaiorg/InLegalNER
Explore at:
Dataset updated
Apr 17, 2024
Dataset authored and provided by
OpenNyAI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset for training and evaluating Indian Legal Named Entity Recognition model.

Paper details

Named Entity Recognition in Indian court judgments Arxiv

Label Scheme

View label scheme (14 labels for 1 components)

ENTITY BELONGS TO

LAWYER PREAMBLE

COURT PREAMBLE, JUDGEMENT

JUDGE PREAMBLE, JUDGEMENT

PETITIONER PREAMBLE, JUDGEMENT

RESPONDENT PREAMBLE, JUDGEMENT

CASE_NUMBER JUDGEMENT

GPE JUDGEMENT

DATE JUDGEMENT

ORG JUDGEMENT

STATUTE JUDGEMENT… See the full description on the dataset page: https://huggingface.co/datasets/opennyaiorg/InLegalNER.
P
E-NER Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ting Wai Terence Au; Ingemar J. Cox; Vasileios Lampos, E-NER Dataset [Dataset]. https://paperswithcode.com/dataset/e-ner
Explore at:
Authors
Ting Wai Terence Au; Ingemar J. Cox; Vasileios Lampos
Description
E-NER is a publicly available legal Named Entity Recognition (NER) data set. It contains 52 filings from the US SEC EDGAR database. The named entity tags are hand annotated.
h
german-ler
huggingface.co
opendatalab.com
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elena Leitner (2025). german-ler [Dataset]. http://doi.org/10.57967/hf/0046
Explore at:
Unique identifier
https://doi.org/10.57967/hf/0046
Dataset updated
Mar 15, 2025
Authors
Elena Leitner
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. The dataset is human-annotated with 19 fine-grained entity classes. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities.
h
legalnero
huggingface.co
Updated Aug 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Niklaus (2022). legalnero [Dataset]. https://huggingface.co/datasets/joelniklaus/legalnero
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 26, 2022
Authors
Joel Niklaus
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Dataset Card for Romanian Named Entity Recognition in the Legal domain (LegalNERo)

Dataset Summary

LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).

Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/legalnero.
h
greek_legal_ner
huggingface.co
Updated May 31, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Niklaus (2013). greek_legal_ner [Dataset]. https://huggingface.co/datasets/joelniklaus/greek_legal_ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2013
Authors
Joel Niklaus
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Dataset Card for Greek Legal Named Entity Recognition

Dataset Summary

This dataset contains an annotated corpus for named entity recognition in Greek legislations. It is the first of its kind for the Greek language in such an extended form and one of the few that examines legal text in a full spectrum entity recognition.

Supported Tasks and Leaderboards

The dataset supports the task of named entity recognition.

Languages

The language in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/greek_legal_ner.
Z
Romanian Named Entity Recognition in the Legal domain (LegalNERo)
data.niaid.nih.gov
zenodo.org
Updated Aug 26, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gasan, Carol Luca (2022). Romanian Named Entity Recognition in the Legal domain (LegalNERo) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4772094
Explore at:
Dataset updated
Aug 26, 2022
Dataset provided by
Onuț, Andrei
Ianov, Alexandru
Păiș, Vasile
Mitrofan, Maria
Ghiță, Corvin
Gasan, Carol Luca
Coneschi, Vlad Silviu
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).

The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.

CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last). Automatic processing was performed through the RELATE platform (https://relate.racai.ro).

ANN files conform to BRAT format (https://brat.nlplab.org/).

The archive contains:

ann_LEGAL_PER_LOC_ORG_TIME_overlap Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations of organizations and time entities inside legal references were allowed.

ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.

ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.

conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).

conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).

rdf Folder containing the corpus in RDF-Turtle format. All the annotations are available here in both span and token format.

text Folder containing the raw texts.

NER System

A NER model generated using the LegalNERo corpus can be used online in the RELATE platform: https://relate.racai.ro/index.php?path=ner/demo

This system was described in: Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru. Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9--18, nov 2021

LICENSING

This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .

CONTACT

Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro
o
data-augmentation-ner-results
explore.openaire.eu
data.niaid.nih.gov
+1more
Updated Aug 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Erd; Leila Feddoul; Clara Lachenmaier; Marianne Jana Mauch (2022). data-augmentation-ner-results [Dataset]. http://doi.org/10.5281/zenodo.6956508
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.6956508
Dataset updated
Aug 3, 2022
Authors
Robin Erd; Leila Feddoul; Clara Lachenmaier; Marianne Jana Mauch
Description
Model evaluation results produced in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain. Detailed information can be found on the Github page.
h
legal-ner
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
daishen, legal-ner [Dataset]. https://huggingface.co/datasets/daishen/legal-ner
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
daishen
Description
daishen/legal-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
f
Indian Court Decision Annotated Corpus.xlsx
figshare.com
xlsx
Updated Aug 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya (2022). Indian Court Decision Annotated Corpus.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.19719088.v4
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19719088.v4
Dataset updated
Aug 22, 2022
Dataset provided by
figshare
Authors
Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.
h
lener_br
huggingface.co
Updated Sep 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pedro Henrique Luz de Araujo (2023). lener_br [Dataset]. https://huggingface.co/datasets/peluz/lener_br
Explore at:
Dataset updated
Sep 12, 2023
Authors
Pedro Henrique Luz de Araujo
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
LeNER-Br is a Portuguese language dataset for named entity recognition applied to legal documents. LeNER-Br consists entirely of manually annotated legislation and legal cases texts and contains tags for persons, locations, time entities, organizations, legislation and legal cases. To compose the dataset, 66 legal documents from several Brazilian Courts were collected. Courts of superior and state levels were considered, such as Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais and Tribunal de Contas da União. In addition, four legislation documents were collected, such as "Lei Maria da Penha", giving a total of 70 documents
openthesaurus_dump_20220530
zenodo.org
data.niaid.nih.gov
bz2
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Robin Erd; Robin Erd; Leila Feddoul; Leila Feddoul; Clara Lachenmaier; Clara Lachenmaier; Marianne Jana Mauch; Marianne Jana Mauch (2023). openthesaurus_dump_20220530 [Dataset]. http://doi.org/10.5281/zenodo.6956563
Explore at:
bz2Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.6956563
Dataset updated
May 30, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Robin Erd; Robin Erd; Leila Feddoul; Leila Feddoul; Clara Lachenmaier; Clara Lachenmaier; Marianne Jana Mauch; Marianne Jana Mauch
License
https://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.html
Description
OpenThesaurus Dump Version used in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain (Detailed information can be found on the Github page)

This OpenThesaurus mySQL-Dump version was downloaded on the 30th of May 2022 from: https://www.openthesaurus.de/about/download .

This data is made available under the Creative Commons Attribution-ShareAlike 4.0 or the GNU Lesser General Public License.
O
LeNER-Br
opendatalab.com
paperswithcode.com
zip
Updated Sep 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
University of Brasilia (2022). LeNER-Br [Dataset]. https://opendatalab.com/OpenDataLab/LeNER-Br
Explore at:
zip(24756792 bytes)Available download formats
Dataset updated
Sep 21, 2022
Dataset provided by
University of Brasilia
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.
h
uk_ner_contracts
huggingface.co
Updated Nov 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Law Insider (2023). uk_ner_contracts [Dataset]. https://huggingface.co/datasets/lawinsider/uk_ner_contracts
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2023
Dataset authored and provided by
Law Insider
Description
Dataset Description

Legal Contracts Dataset for Training NER Model This repository contains a specially curated dataset consisting of legal contracts. It is designed for the purpose of training a Named Entity Recognition (NER) model, with the aim to recognize and classify four types of entities in the text: Contract Type, Clause Title, Clause Number, Definition Title The dataset includes a broad variety of legal contracts, covering diverse domains such as employment, real estate… See the full description on the dataset page: https://huggingface.co/datasets/lawinsider/uk_ner_contracts.
h
mapa
huggingface.co
opendatalab.com
Updated Apr 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Niklaus (2023). mapa [Dataset]. https://huggingface.co/datasets/joelniklaus/mapa
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 9, 2023
Authors
Joel Niklaus
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Multilingual European Datasets for Sensitive Entity Detection in the Legal Domain

Dataset Summary

The dataset consists of 12 documents (9 for Spanish due to parsing errors) taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. The documents have been annotated for named entities following the guidelines of the MAPA project which foresees two annotation level, a general and a… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/mapa.
Z
Romanian micro-blogging named entity recognition (MicroBloggingNERo)
data.niaid.nih.gov
Updated Jul 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florea, Bianca (2022). Romanian micro-blogging named entity recognition (MicroBloggingNERo) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6905234
Explore at:
Dataset updated
Jul 27, 2022
Dataset provided by
Badila, Ana
Barbu-Mititelu, Verginica
Dicusar, Maria
Florea, Bianca
Marin, Laura
Irimia, Elena
Micu, Roxana
Păiș, Vasile
Mitrofan, Maria
Gasan, Carol Luca
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
MicroBloggingNERo is a manually annotated corpus for named entity recognition in Romanian micro-blogging texts. It provides gold annotations for organizations, locations, persons, time expressions, legal references, medical devices, chemicals, anatomical parts and disorders found in micro-blogging texts. The text was anonymized, by replacing all URLs with , user references with , person names, specific locations and organizations with new randomized names. Anonymization was realized in the same way, regardless of the micro-blogging platform specific format.

Since names were replaced with new random ones, any resemblance to real individuals is by pure chance of the random names generator. No real person is depicted in the included messages.

DATA

The MicroBloggingNERo corpus is available in different formats: text, span-based, and token-based.

Text files are in the folder "text" with .txt extension, in UTF-8 encoding.

Span-based annotations are given in BRAT (https://brat.nlplab.org/) ann format. These annotations can be found in folders starting with "ann_".

Token-based annotations are given in CONLLUP files, following the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Automatic processing was performed through the RELATE platform (https://relate.racai.ro).

The archive contains:

ann_EVERYTHING Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations of organizations and time entities inside legal references were allowed.

ann_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated. This affects primarily the legal references class.

ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations and time. There are no overlapping annotations.

ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.

ann_BIOMEDICAL Folder in which all the files are in .ann format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. There are no overlapping annotations.

conllup_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. There are no overlapping annotations.

conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal references, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.

conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.

conllup_BIOMEDICAL Folder in which all the files are in .conllup format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated.

text Folder containing the raw texts.

splits.tsv Proposed splits into train,test,valid following a distribution of 70-15-15% for each entity class, based on the ann_EVERYTHING_LARGEST_SPAN folder

LICENSING

This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .

CONTACT

Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro , vergi@racai.ro , elena@racai.ro
D
Data from: To NER or not to NER? A case study of low-resource deontic...
dataverse.nl
Updated Dec 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashank Chakravarthy; Shashank Chakravarthy; Gijs van Dijck; Gijs van Dijck; Anna Wilbik; Anna Wilbik (2024). To NER or not to NER? A case study of low-resource deontic modalities in EU legislation? [Dataset]. http://doi.org/10.34894/D9AKUS
Explore at:
application/x-ipynb+json(343282), csv(54271), application/x-ipynb+json(220670), csv(688377), application/x-ipynb+json(220604)Available download formats
Unique identifier
https://doi.org/10.34894/D9AKUS
Dataset updated
Dec 17, 2024
Dataset provided by
DataverseNL
Authors
Shashank Chakravarthy; Shashank Chakravarthy; Gijs van Dijck; Gijs van Dijck; Anna Wilbik; Anna Wilbik
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
European Union
Description
Deontic modality (obligation, permission, prohibition) in legal documents can convey critical information, and identification of deontic modalities is often performed using Natural Language Processing (NLP) techniques as a Deontic Modality Classification' (DMC) text classification task. As deontic modalities in legal text are not mutually exclusive, a key challenge with DMC is that it classifies the provided text into a single modality while in reality it might have multiple deontic modalities. To address this, this study analyzes the feasibility of performing deontic modality identification as a Named Entity Recognition (NER) task over DMC task approaches in a low-resource data setting with EU legislation. Low-resource NLP approaches can offer solutions to tackle the problem of scarce data. In this paper, we use a rule-based approach with modal verbs and a Decision Tree classifier for DMC task. For NER, we utilize Conditional Random Fields (CRFs) in a low-resource setting and report on the reliability and precision for identification of deontic modality. Our experiments reveal that simpler models, like decision trees, out perform larger models in the low-resource setting of DMC obtaining macro-F1 score of 0.83. For the NER task, the CRF models show consistent performance forobligation' labels with an F1-score of 0.51 but have wavering results for other classes with a max F1-score of 0.26 for permission', and 0.08 forprohibition'.
h
Regulations_NER
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SecureFinAI Lab, Regulations_NER [Dataset]. https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_NER
Explore at:
Dataset authored and provided by
SecureFinAI Lab
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
Overview

This question set is created to evaluate LLMs' ability for named entity recognition (NER) in financial regulatory texts. It is developed for a task at Regulations Challege @ COLING 2025. The objective is to accurately identify and classify entities, including organizations, legislation, dates, monetary values, and statistics. Financial regulations often require supervising and reporting on specific entities, such as organizations, financial products, and transactions, and… See the full description on the dataset page: https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_NER.
m
Dataset of Named Entity Recognition for Uzbek language
data.mendeley.com
Updated Oct 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Davlatyor Mengliev (2024). Dataset of Named Entity Recognition for Uzbek language [Dataset]. http://doi.org/10.17632/6hphxr74rp.2
Explore at:
Unique identifier
https://doi.org/10.17632/6hphxr74rp.2
Dataset updated
Oct 22, 2024
Authors
Davlatyor Mengliev
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
As part of the study, an annotated corpus of the Uzbek language was created for training and evaluating named entity recognition models. The corpus includes 2,000 sentences (25865 words) collected from various sources: • Legislative acts and legal documents: The bulk of the data was extracted from the publicly available lex.uz database, which contains official texts that are highly literate and have a formal language structure. • News sites: Articles and materials from Uzbek news portals (kun.uz, gazeta.uz) were used, which made it possible to include modern language structures and relevant vocabulary. • Manually created sentences: To increase the number of named entities in sentences and ensure diversity, author's sentences were developed containing several entities of different types. This enriched the corpus with complex structures and increased the efficiency of model training. Data annotation was carried out manually using the BIOES scheme, which provides detailed marking of boundaries and types of named entities. All abstracts were reviewed by Uzbek language experts to ensure accuracy and consistency of data.
HOME-Alcar: Aligned and Annotated Cartularies
zenodo.org
explore.openaire.eu
+1more
bin, json, pdf, zip
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominique Stutzmann; Dominique Stutzmann; Sergio Torres Aguilar; Sergio Torres Aguilar; Paul Chaffenet; Paul Chaffenet (2024). HOME-Alcar: Aligned and Annotated Cartularies [Dataset]. http://doi.org/10.5281/zenodo.5600884
Explore at:
pdf, zip, json, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5600884
Dataset updated
Jul 17, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominique Stutzmann; Dominique Stutzmann; Sergio Torres Aguilar; Sergio Torres Aguilar; Paul Chaffenet; Paul Chaffenet
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The HOME-Alcar (Aligned and Annotated Cartularies) corpus was produced as part of the European research project HOME History of Medieval Europe (https://www.heritageresearch-hub.eu/project/home/), led under the coordination oflinebreakof Institut de Recherche et d'Histoire des Textes (PI: D. Stutzmann), with the Universitat Politecnica de Valencia (PI: E. Vidal), the National Archives of the Czech Republic in Prague (PI: J. Kreckova), and Teklia SAS (PI: C. Kermorvant)
The HOME-Alcar (Aligned and Annotated Cartularies) corpus is a resource created to train Handwritten Text Recognition (HTR) and Named Entity Recognition (NER), and presents a collection of
(i) digital images of 17 medieval manuscripts;
(ii) scholarly editions thereof;
(iii) coordinates linking images and text at line level;
(iv) annotations of Named Entities (place and person names).
The 17 medieval manuscripts in this corpus are cartularies, i.e. books copying charters and legal acts, produced between the 12th and 14th centuries.

Facebook

Twitter

Click to copy link

Link copied

Cite

(2023). legal_NER Dataset [Dataset]. https://paperswithcode.com/dataset/legal-ner

legal_NER Dataset

Explore at:

10 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Oct 23, 2023

Description

legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

Clear search

Close search

Google apps

Main menu

legal_NER Dataset

InLegalNER

E-NER Dataset

german-ler

legalnero

greek_legal_ner

Romanian Named Entity Recognition in the Legal domain (LegalNERo)

data-augmentation-ner-results

legal-ner

Indian Court Decision Annotated Corpus.xlsx

lener_br

openthesaurus_dump_20220530

LeNER-Br

uk_ner_contracts

mapa

Romanian micro-blogging named entity recognition (MicroBloggingNERo)

Data from: To NER or not to NER? A case study of low-resource deontic...

Regulations_NER

Dataset of Named Entity Recognition for Uzbek language

HOME-Alcar: Aligned and Annotated Cartularies

legal_NER Dataset