legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset for training and evaluating Indian Legal Named Entity Recognition model.
Paper details
Named Entity Recognition in Indian court judgments Arxiv
Label Scheme
View label scheme (14 labels for 1 components)
ENTITY BELONGS TO
LAWYER PREAMBLE
COURT PREAMBLE, JUDGEMENT
JUDGE PREAMBLE, JUDGEMENT
PETITIONER PREAMBLE, JUDGEMENT
RESPONDENT PREAMBLE, JUDGEMENT
CASE_NUMBER JUDGEMENT
GPE JUDGEMENT
DATE JUDGEMENT
ORG JUDGEMENT
STATUTE JUDGEMENT… See the full description on the dataset page: https://huggingface.co/datasets/opennyaiorg/InLegalNER.
E-NER is a publicly available legal Named Entity Recognition (NER) data set. It contains 52 filings from the US SEC EDGAR database. The named entity tags are hand annotated.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. The dataset is human-annotated with 19 fine-grained entity classes. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Dataset Card for Romanian Named Entity Recognition in the Legal domain (LegalNERo)
Dataset Summary
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).
Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/legalnero.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Greek Legal Named Entity Recognition
Dataset Summary
This dataset contains an annotated corpus for named entity recognition in Greek legislations. It is the first of its kind for the Greek language in such an extended form and one of the few that examines legal text in a full spectrum entity recognition.
Supported Tasks and Leaderboards
The dataset supports the task of named entity recognition.
Languages
The language in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/greek_legal_ner.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).
The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.
CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last). Automatic processing was performed through the RELATE platform (https://relate.racai.ro).
ANN files conform to BRAT format (https://brat.nlplab.org/).
The archive contains:
ann_LEGAL_PER_LOC_ORG_TIME_overlap Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations of organizations and time entities inside legal references were allowed.
ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.
ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.
conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).
conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).
rdf Folder containing the corpus in RDF-Turtle format. All the annotations are available here in both span and token format.
text Folder containing the raw texts.
NER System
A NER model generated using the LegalNERo corpus can be used online in the RELATE platform: https://relate.racai.ro/index.php?path=ner/demo
This system was described in: Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru. Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9--18, nov 2021
LICENSING
This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .
CONTACT
Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro
Model evaluation results produced in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain. Detailed information can be found on the Github page.
daishen/legal-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
LeNER-Br is a Portuguese language dataset for named entity recognition applied to legal documents. LeNER-Br consists entirely of manually annotated legislation and legal cases texts and contains tags for persons, locations, time entities, organizations, legislation and legal cases. To compose the dataset, 66 legal documents from several Brazilian Courts were collected. Courts of superior and state levels were considered, such as Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais and Tribunal de Contas da União. In addition, four legislation documents were collected, such as "Lei Maria da Penha", giving a total of 70 documents
https://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.html
OpenThesaurus Dump Version used in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain (Detailed information can be found on the Github page)
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.
Dataset Description
Legal Contracts Dataset for Training NER Model This repository contains a specially curated dataset consisting of legal contracts. It is designed for the purpose of training a Named Entity Recognition (NER) model, with the aim to recognize and classify four types of entities in the text: Contract Type, Clause Title, Clause Number, Definition Title The dataset includes a broad variety of legal contracts, covering diverse domains such as employment, real estate… See the full description on the dataset page: https://huggingface.co/datasets/lawinsider/uk_ner_contracts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Multilingual European Datasets for Sensitive Entity Detection in the Legal Domain
Dataset Summary
The dataset consists of 12 documents (9 for Spanish due to parsing errors) taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. The documents have been annotated for named entities following the guidelines of the MAPA project which foresees two annotation level, a general and a… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/mapa.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
MicroBloggingNERo is a manually annotated corpus for named entity recognition in Romanian micro-blogging texts. It provides gold annotations for organizations, locations, persons, time expressions, legal references, medical devices, chemicals, anatomical parts and disorders found in micro-blogging texts. The text was anonymized, by replacing all URLs with , user references with , person names, specific locations and organizations with new randomized names. Anonymization was realized in the same way, regardless of the micro-blogging platform specific format.
Since names were replaced with new random ones, any resemblance to real individuals is by pure chance of the random names generator. No real person is depicted in the included messages.
DATA
The MicroBloggingNERo corpus is available in different formats: text, span-based, and token-based.
Text files are in the folder "text" with .txt extension, in UTF-8 encoding.
Span-based annotations are given in BRAT (https://brat.nlplab.org/) ann format. These annotations can be found in folders starting with "ann_".
Token-based annotations are given in CONLLUP files, following the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Automatic processing was performed through the RELATE platform (https://relate.racai.ro).
The archive contains:
ann_EVERYTHING Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations of organizations and time entities inside legal references were allowed.
ann_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated. This affects primarily the legal references class.
ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations and time. There are no overlapping annotations.
ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.
ann_BIOMEDICAL Folder in which all the files are in .ann format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. There are no overlapping annotations.
conllup_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. There are no overlapping annotations.
conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal references, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.
conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.
conllup_BIOMEDICAL Folder in which all the files are in .conllup format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated.
text Folder containing the raw texts.
splits.tsv Proposed splits into train,test,valid following a distribution of 70-15-15% for each entity class, based on the ann_EVERYTHING_LARGEST_SPAN folder
LICENSING
This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .
CONTACT
Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro , vergi@racai.ro , elena@racai.ro
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Deontic modality (obligation, permission, prohibition) in legal documents can convey critical information, and identification of deontic modalities is often performed using Natural Language Processing (NLP) techniques as a Deontic Modality Classification' (DMC) text classification task. As deontic modalities in legal text are not mutually exclusive, a key challenge with DMC is that it classifies the provided text into a single modality while in reality it might have multiple deontic modalities. To address this, this study analyzes the feasibility of performing deontic modality identification as a Named Entity Recognition (NER) task over DMC task approaches in a low-resource data setting with EU legislation. Low-resource NLP approaches can offer solutions to tackle the problem of scarce data. In this paper, we use a rule-based approach with modal verbs and a Decision Tree classifier for DMC task. For NER, we utilize Conditional Random Fields (CRFs) in a low-resource setting and report on the reliability and precision for identification of deontic modality. Our experiments reveal that simpler models, like decision trees, out perform larger models in the low-resource setting of DMC obtaining macro-F1 score of 0.83. For the NER task, the CRF models show consistent performance for
obligation' labels with an F1-score of 0.51 but have wavering results for other classes with a max F1-score of 0.26 for permission', and 0.08 for
prohibition'.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Overview
This question set is created to evaluate LLMs' ability for named entity recognition (NER) in financial regulatory texts. It is developed for a task at Regulations Challege @ COLING 2025. The objective is to accurately identify and classify entities, including organizations, legislation, dates, monetary values, and statistics. Financial regulations often require supervising and reporting on specific entities, such as organizations, financial products, and transactions, and… See the full description on the dataset page: https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_NER.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As part of the study, an annotated corpus of the Uzbek language was created for training and evaluating named entity recognition models. The corpus includes 2,000 sentences (25865 words) collected from various sources: • Legislative acts and legal documents: The bulk of the data was extracted from the publicly available lex.uz database, which contains official texts that are highly literate and have a formal language structure. • News sites: Articles and materials from Uzbek news portals (kun.uz, gazeta.uz) were used, which made it possible to include modern language structures and relevant vocabulary. • Manually created sentences: To increase the number of named entities in sentences and ensure diversity, author's sentences were developed containing several entities of different types. This enriched the corpus with complex structures and increased the efficiency of model training. Data annotation was carried out manually using the BIOES scheme, which provides detailed marking of boundaries and types of named entities. All abstracts were reviewed by Uzbek language experts to ensure accuracy and consistency of data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HOME-Alcar (Aligned and Annotated Cartularies) corpus was produced as part of the European research project HOME History of Medieval Europe (https://www.heritageresearch-hub.eu/project/home/), led under the coordination oflinebreakof Institut de Recherche et d'Histoire des Textes (PI: D. Stutzmann), with the Universitat Politecnica de Valencia (PI: E. Vidal), the National Archives of the Czech Republic in Prague (PI: J. Kreckova), and Teklia SAS (PI: C. Kermorvant)
The HOME-Alcar (Aligned and Annotated Cartularies) corpus is a resource created to train Handwritten Text Recognition (HTR) and Named Entity Recognition (NER), and presents a collection of
(i) digital images of 17 medieval manuscripts;
(ii) scholarly editions thereof;
(iii) coordinates linking images and text at line level;
(iv) annotations of Named Entities (place and person names).
The 17 medieval manuscripts in this corpus are cartularies, i.e. books copying charters and legal acts, produced between the 12th and 14th centuries.
legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.