Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model evaluation results produced in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain.
Detailed information can be found on the Github page.
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).
The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.
CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last). Automatic processing was performed through the RELATE platform (https://relate.racai.ro).
ANN files conform to BRAT format (https://brat.nlplab.org/).
The archive contains:
ann_LEGAL_PER_LOC_ORG_TIME_overlap Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations of organizations and time entities inside legal references were allowed.
ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.
ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.
conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).
conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).
rdf Folder containing the corpus in RDF-Turtle format. All the annotations are available here in both span and token format.
text Folder containing the raw texts.
NER System
A NER model generated using the LegalNERo corpus can be used online in the RELATE platform: https://relate.racai.ro/index.php?path=ner/demo
This system was described in: Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru. Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9--18, nov 2021
LICENSING
This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .
CONTACT
Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Our latest project involved applying Named Entity Recognition (NER) to legal documents.
daishen/legal-ner dataset hosted on Hugging Face and contributed by the HF Datasets community
This deep learning model is used to identify or categorize entities in unstructured text. An entity may refer to a word or a sequence of words such as the name of “Organizations,” “Persons,” “Country,” or “Date” and “Time” in the text. This model detects entities from the given text and classifies them into pre-determined categories.
Named entity recognition (NER) is useful when a high-level overview of a large quantity of text is required. NER can let you know crucial and important information in text by extracting the main entities from it. The extracted entities are categorized into pre-determined classes and can help in drawing meaningful decisions and conclusions.
Using the model
Follow the guide to use the model. Before using this model, ensure that the supported deep learning libraries are installed. For more details, check the Deep Learning Libraries Installer for ArcGIS.
Fine-tuning the model
This model cannot be fine-tuned using ArcGIS tools.
Input
Text files on which named entity extraction will be performed.
Output
Classified tokens into the following pre-defined entity classes:
PERSON – People, including fictional NORP – Nationalities or religious or political groups FACILITY – Buildings, airports, highways, bridges, etc. ORGANIZATION – Companies, agencies, institutions, etc. GPE – Countries, cities, states LOCATION – Non-GPE locations, mountain ranges, bodies of water PRODUCT – Vehicles, weapons, foods, etc. (Not services) EVENT – Named hurricanes, battles, wars, sports events, etc. WORK OF ART – Titles of books, songs, etc. LAW – Named documents made into laws LANGUAGE – Any named language DATE – Absolute or relative dates or periods TIME – Times smaller than a day PERCENT – Percentage (including “%”) MONEY – Monetary values, including unit QUANTITY – Measurements, as of weight or distance ORDINAL – “first,” “second” CARDINAL – Numerals that do not fall under another type
Model architecture
This model uses the XLM-RoBERTa architecture implemented in Hugging Face transformers using the TNER library.
Accuracy metrics
This model has an accuracy of 91.6 percent.
Training dataThe model has been trained on the OntoNotes Release 5.0 dataset.
Sample resultsHere are a few results from the model.
Citations
Weischedel, Ralph, et al. OntoNotes Release 5.0 LDC2013T19. Web Download. Philadelphia: Linguistic Data Consortium, 2013. Asahi Ushio and Jose Camacho-Collados. 2021. TNER: An all-round Python library for transformer based named entity recognition. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 53–62, Online. Association for Computational Linguistics.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of Austrian court decisions in German language prepared by Christian Sageder from Cybly in JSON-LD format compliant with LynxDocument schema (https://lynx-project.eu/doc/lkg/) - folder "original_json".
Additionally, named entities annotations produced by a Bert-based transformer trained on WikiNer corpus - Per, Loc, Org, Misc - by DFKI team in N3 RDF notations, compliant with NIF2.1 (https://github.com/NLP2RDF/documentation/blob/f63715b951d03324390edbbd3e84babdf43bc60e/docs/index.rst) - folder "ner_annotations_nif".
Additionally, manually annotated sample of 9 fine-grained named entity types - folder "manual/manually_annotated", see file names for the NE types - and further manually verified predictions by a classifier trained on manually annotated sample - folder "manual/manually_verified". Manual work was done by all authors.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
NERCat Dataset
Dataset Summary
The NERCat dataset is a manually annotated collection of Catalan-language television transcriptions, designed to improve Named Entity Recognition (NER) performance for the Catalan language. The dataset covers diverse domains such as politics, sports, and culture, and includes 9,242 sentences with 13,732 named entities annotated across eight categories: Person, Facility, Organization, Location, Product, Event, Date, and Law. The dataset was… See the full description on the dataset page: https://huggingface.co/datasets/Ugiat/ner-cat.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Replication materials for "Power in Text: Implementing Networks and Institutional Complexity in American Law". Contains webscrapers, scraped text, fit NER models, network extraction code, and Bayesian modeling code/results. All data were originally collected in late 2018, so re-scraped data may differ. For details, see comments in individual scripts, as well as the included README file. If at all possible, maintain the original file structure of this repository for easier replication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: The landscape of drug-drug interactions (DDIs) has evolved significantly over the past 60 years, necessitating a retrospective analysis to identify research trends and under-explored areas. While methodologies like bibliometric analysis provide valuable quantitative perspectives on DDI research, they have not successfully delineated the complex interrelations between drugs. Understanding these intricate relationships is essential for deciphering the evolving architecture and progressive transformation of DDI research structures over time. We utilize network analysis to unearth the multifaceted relationships between drugs, offering a richer, more nuanced comprehension of shifts in research focus within the DDI landscape.Methods: This groundbreaking investigation employs natural language processing, techniques, specifically Named Entity Recognition (NER) via ScispaCy, and the information extraction model, SciFive, to extract pharmacokinetic (PK) and pharmacodynamic (PD) DDI evidence from PubMed articles spanning January 1962 to July 2023. It reveals key trends and patterns through an innovative network analysis approach. Static network analysis is deployed to discern structural patterns in DDI research, while evolving network analysis is employed to monitor changes in the DDI research trend structures over time.Results: Our compelling results shed light on the scale-free characteristics of pharmacokinetic, pharmacodynamic, and their combined networks, exhibiting power law exponent values of 2.5, 2.82, and 2.46, respectively. In these networks, a select few drugs serve as central hubs, engaging in extensive interactions with a multitude of other drugs. Interestingly, the networks conform to a densification power law, illustrating that the number of DDIs grows exponentially as new drugs are added to the DDI network. Notably, we discovered that drugs connected in PK and PD networks predominantly belong to the same categories defined by the Anatomical Therapeutic Chemical (ATC) classification system, with fewer interactions observed between drugs from different categories.Discussion: The finding suggests that PK and PD DDIs between drugs from different ATC categories have not been studied as extensively as those between drugs within the same categories. By unearthing these hidden patterns, our study paves the way for a deeper understanding of the DDI landscape, providing valuable information for future DDI research, clinical practice, and drug development focus areas.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Model evaluation results produced in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain.
Detailed information can be found on the Github page.