Facebook
TwitterThis dataset was created by Pratik Pujari
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset for training and evaluating Indian Legal Named Entity Recognition model.
Paper details
Named Entity Recognition in Indian court judgments Arxiv
Label Scheme
View label scheme (14 labels for 1 components)
ENTITY BELONGS TO
LAWYER PREAMBLE
COURT PREAMBLE, JUDGEMENT
JUDGE PREAMBLE, JUDGEMENT
PETITIONER PREAMBLE, JUDGEMENT
RESPONDENT PREAMBLE, JUDGEMENT
CASE_NUMBER JUDGEMENT
GPE JUDGEMENT
DATE JUDGEMENT
ORG JUDGEMENT
STATUTE JUDGEMENT… See the full description on the dataset page: https://huggingface.co/datasets/opennyaiorg/InLegalNER.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By lener_br (From Huggingface) [source]
LeNER-Br is a comprehensive dataset specifically created for named entity recognition (NER) in the Portuguese language, particularly within the domain of legal documents. This dataset consists of manually annotated texts extracted from legislation and legal cases. Each text has undergone meticulous tagging to identify various types of named entities, including persons, locations, time entities, organizations, legislation references, and legal case references.
To curate this dataset, a total of 66 legal documents were collected from diverse Brazilian Courts encompassing both superior and state levels. Prominent courts such as the Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais, and Tribunal de Contas da União contributed to this collection. Additionally, four significant legislation documents like Lei Maria da Penha were also included to ensure a comprehensive representation. In total, 70 unique documents form part of this extensive dataset.
The primary purpose of LeNER-Br is to facilitate the development and evaluation of NER models specifically tailored for Portuguese legal text analysis. The labeled data provided in this dataset enables researchers and data scientists to train their NER models effectively by leveraging insights from varied legal contexts present in Brazil's jurisdiction system.
The columns included within each instance of annotated text include tokens which represent individual words or tokens found within the original texts. The ner_tags column provides valuable information through assigned NER tags for each token that specify their entity type representation - whether it be a person's name or organization name specific to law or any other relevant category that falls under legislative contexts.
Researchers may use LeNER-Br as a benchmark test set against which they can evaluate the performance and efficacy of their own NER models designed for Portuguese legal documents. Moreover,**tokens**column is repeated twice with additional tagged descriptions including ner_tagswhich contains relevant NER information assigned uniquely for each token.
In conclusion,**LeNER-Br dataset** is an invaluable resource for advancing NER techniques within the Portuguese language, particularly within the legal domain. It provides a high-quality, manually annotated collection of legal texts specifically chosen to accurately represent Brazil's legislative landscape and entities involved. This dataset serves as a strong foundation for training and evaluating NER models and facilitates advancements in information extraction from Portuguese legal documents
The LeNER-Br dataset is a valuable resource for researchers and practitioners working on named entity recognition (NER) in the context of Portuguese legal documents. This guide will provide you with an overview of the dataset and how to effectively utilize it for your NER tasks.
Dataset Overview
LeNER-Br is composed of 70 manually annotated legal documents written in Portuguese. These documents were collected from various Brazilian Courts, including superior and state levels such as the Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais, and Tribunal de Contas da União. The dataset also includes four legislation documents, such as Lei Maria da Penha.
The dataset provides tags for different types of named entities commonly found in legal texts. These named entity types include persons, locations, time entities, organizations, legislations, and legal cases. Additionally, there are two main columns in the dataset that you should pay attention to:
tokensor tokens: This column contains individual words or tokens present in the text of the legal documents.ner_tagsor ner_tags: This column contains named entity recognition (NER) tags assigned to each token in the text. These tags indicate the type of named entity that each token represents.Utilizing the Dataset
Here are some steps you can follow to make effective use of this dataset:
Data Exploration: Start by loading and exploring the data using your preferred programming language or data analysis tools like Python's pandas library.
- Load
train.csvfile for training your NER models with manually annotated texts.- Utilize
test.csvfile as a test set for evaluating model performance.- Use
validation.csvfile for additional validation during model development.Preprocessing:
- Perform necessary preprocess...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for "German LER"
Dataset Summary
A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. The dataset is human-annotated with 19 fine-grained entity classes. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities. NER tags use the BIO tagging scheme. The dataset includes two different versions of annotations, one with a set of 19 fine-grained semantic classes (ner_tags) and another one… See the full description on the dataset page: https://huggingface.co/datasets/elenanereiss/german-ler.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Arabic Legal Dataset - Legal Named Entity Recognition
Dataset Description
Named entity recognition dataset for Arabic legal texts with specialized legal entity types and relationships. This dataset contains 1,046 examples of ner data derived from Egyptian legal texts, including criminal law, civil law, procedural law, and personal status law. The dataset is designed for training and evaluating Arabic legal AI models.
Dataset Summary
Language: Arabic (Egyptian… See the full description on the dataset page: https://huggingface.co/datasets/fr3on/eg-legal-ner.
Facebook
TwitterThis dataset was created by Shivam Kumar
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model evaluation results produced in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain.
Detailed information can be found on the Github page.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
It is a collection of NER datasets in five languages:
German (https://github.com/elenanereiss/Legal-Entity-Recognition) Greek (https://github.com/nmpartzio/elNER) Japanese (https://github.com/stockmarkteam/ner-wikipedia-dataset) Russian (https://github.com/dialogue-evaluation/factRuEval-2016/) Turkish (https://data.mendeley.com/datasets/cdcztymf4k/1)
The annotation was adapted to OntoNotes standard and converted to IOB format. The main purpose of this dataset is evaluation of XLM models.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).
The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.
CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last). Automatic processing was performed through the RELATE platform (https://relate.racai.ro).
ANN files conform to BRAT format (https://brat.nlplab.org/).
The archive contains:
ann_LEGAL_PER_LOC_ORG_TIME_overlap Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations of organizations and time entities inside legal references were allowed.
ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.
ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.
conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).
conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).
rdf Folder containing the corpus in RDF-Turtle format. All the annotations are available here in both span and token format.
text Folder containing the raw texts.
NER System
A NER model generated using the LegalNERo corpus can be used online in the RELATE platform: https://relate.racai.ro/index.php?path=ner/demo
This system was described in: Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru. Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9--18, nov 2021
LICENSING
This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .
CONTACT
Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro
Facebook
Twitterhf-tuner/indian-legal-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The NKP_Legal_Cases datasets (Raw_Legal_Cases.json and Para_Legal_Cases.json) is a curated collection of Nepali legal case texts sourced from publicly available Nepali court documents. It was prepared to focus on legal NLP, text summarization, and evaluation of multilingual LLMs.
The dataset aims to support research in: - Legal document summarization - Named entity recognition (NER) - Legal information retrieval - Document classification - Multilingual NLP for low-resource languages (specifically Nepali)
This dataset addresses the significant gap in publicly accessible legal datasets for Nepali, enabling researchers, students, and practitioners to explore legal AI applications in low-resource contexts.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset was created by Ishaan Bhattacharjee
Released under MIT
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Homepage: https://www.darrow.ai/ Repository: https://github.com/darrow-labs/LegalLens Paper: https://arxiv.org/pdf/2402.04335.pdf Point of Contact: Dor Bernsohn,Gil Semo
Overview
LegalLensNER is a dedicated dataset designed for Named Entity Recognition (NER) in the legal domain, with a specific emphasis on detecting legal violations in unstructured texts.
Data Fields
id: (int) A unique identifier for each record. word: (str) The specific word or token in the… See the full description on the dataset page: https://huggingface.co/datasets/darrow-ai/LegalLensNER.
Facebook
Twitterdaishen/legal-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterDataset Description
Legal Contracts Dataset for Training SpaCy NER Model This repository contains a specially curated dataset consisting of legal contracts. It is designed for the purpose of training a Named Entity Recognition (NER) model using SpaCy, with the aim to recognize and classify four types of entities in the text: Contract Type, Clause Title, Clause Number, Definition Title The dataset includes a broad variety of legal contracts, covering diverse domains such as… See the full description on the dataset page: https://huggingface.co/datasets/lawinsider/uk_ner_contracts_spacy.
Facebook
TwitterReplication materials for "Power in Text: Implementing Networks and Institutional Complexity in American Law". Contains webscrapers, scraped text, fit NER models, network extraction code, and Bayesian modeling code/results. All data were originally collected in late 2018, so re-scraped data may differ. For details, see comments in individual scripts, as well as the included README file. If at all possible, maintain the original file structure of this repository for easier replication.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for Greek Legal Named Entity Recognition
Dataset Summary
This dataset contains an annotated corpus for named entity recognition in Greek legislations. It is the first of its kind for the Greek language in such an extended form and one of the few that examines legal text in a full spectrum entity recognition.
Supported Tasks and Leaderboards
The dataset supports the task of named entity recognition.
Languages
The language in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/greek_legal_ner.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The HOME-Alcar (Aligned and Annotated Cartularies) corpus was produced as part of the European research project HOME History of Medieval Europe (https://www.heritageresearch-hub.eu/project/home/), led under the coordination oflinebreakof Institut de Recherche et d'Histoire des Textes (PI: D. Stutzmann), with the Universitat Politecnica de Valencia (PI: E. Vidal), the National Archives of the Czech Republic in Prague (PI: J. Kreckova), and Teklia SAS (PI: C. Kermorvant)
The HOME-Alcar (Aligned and Annotated Cartularies) corpus is a resource created to train Handwritten Text Recognition (HTR) and Named Entity Recognition (NER), and presents a collection of
(i) digital images of 17 medieval manuscripts;
(ii) scholarly editions thereof;
(iii) coordinates linking images and text at line level;
(iv) annotations of Named Entities (place and person names).
The 17 medieval manuscripts in this corpus are cartularies, i.e. books copying charters and legal acts, produced between the 12th and 14th centuries.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
(:unav)...........................................
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Arabic Legal Dataset - Multi-Task Legal Learning
Dataset Description
Multi-task learning dataset combining classification, QA, NER, and summarization tasks in unified format. This dataset contains 1,046 examples of multi_task data derived from Egyptian legal texts, including criminal law, civil law, procedural law, and personal status law. The dataset is designed for training and evaluating Arabic legal AI models.
Dataset Summary
Language: Arabic (Egyptian… See the full description on the dataset page: https://huggingface.co/datasets/fr3on/eg-legal-multi-task.
Facebook
TwitterThis dataset was created by Pratik Pujari