22 datasets found
  1. P

    legal_NER Dataset

    • paperswithcode.com
    Updated Oct 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). legal_NER Dataset [Dataset]. https://paperswithcode.com/dataset/legal-ner
    Explore at:
    Dataset updated
    Oct 23, 2023
    Description

    legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

  2. h

    InLegalNER

    • huggingface.co
    Updated Apr 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenNyAI (2024). InLegalNER [Dataset]. https://huggingface.co/datasets/opennyaiorg/InLegalNER
    Explore at:
    Dataset updated
    Apr 17, 2024
    Dataset authored and provided by
    OpenNyAI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset for training and evaluating Indian Legal Named Entity Recognition model.

      Paper details
    

    Named Entity Recognition in Indian court judgments Arxiv

      Label Scheme
    

    View label scheme (14 labels for 1 components)

    ENTITY BELONGS TO

    LAWYER PREAMBLE

    COURT PREAMBLE, JUDGEMENT

    JUDGE PREAMBLE, JUDGEMENT

    PETITIONER PREAMBLE, JUDGEMENT

    RESPONDENT PREAMBLE, JUDGEMENT

    CASE_NUMBER JUDGEMENT

    GPE JUDGEMENT

    DATE JUDGEMENT

    ORG JUDGEMENT

    STATUTE JUDGEMENT… See the full description on the dataset page: https://huggingface.co/datasets/opennyaiorg/InLegalNER.

  3. P

    E-NER Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ting Wai Terence Au; Ingemar J. Cox; Vasileios Lampos, E-NER Dataset [Dataset]. https://paperswithcode.com/dataset/e-ner
    Explore at:
    Authors
    Ting Wai Terence Au; Ingemar J. Cox; Vasileios Lampos
    Description

    E-NER is a publicly available legal Named Entity Recognition (NER) data set. It contains 52 filings from the US SEC EDGAR database. The named entity tags are hand annotated.

  4. h

    german-ler

    • huggingface.co
    • opendatalab.com
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elena Leitner (2025). german-ler [Dataset]. http://doi.org/10.57967/hf/0046
    Explore at:
    Dataset updated
    Mar 15, 2025
    Authors
    Elena Leitner
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. The dataset is human-annotated with 19 fine-grained entity classes. The dataset consists of approx. 67,000 sentences and contains 54,000 annotated entities.

  5. h

    legalnero

    • huggingface.co
    Updated Aug 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Niklaus (2022). legalnero [Dataset]. https://huggingface.co/datasets/joelniklaus/legalnero
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 26, 2022
    Authors
    Joel Niklaus
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Dataset Card for Romanian Named Entity Recognition in the Legal domain (LegalNERo)

      Dataset Summary
    

    LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).

      Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/legalnero.
    
  6. h

    greek_legal_ner

    • huggingface.co
    Updated May 31, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Niklaus (2013). greek_legal_ner [Dataset]. https://huggingface.co/datasets/joelniklaus/greek_legal_ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2013
    Authors
    Joel Niklaus
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for Greek Legal Named Entity Recognition

      Dataset Summary
    

    This dataset contains an annotated corpus for named entity recognition in Greek legislations. It is the first of its kind for the Greek language in such an extended form and one of the few that examines legal text in a full spectrum entity recognition.

      Supported Tasks and Leaderboards
    

    The dataset supports the task of named entity recognition.

      Languages
    

    The language in the dataset… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/greek_legal_ner.

  7. Z

    Romanian Named Entity Recognition in the Legal domain (LegalNERo)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 26, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gasan, Carol Luca (2022). Romanian Named Entity Recognition in the Legal domain (LegalNERo) [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_4772094
    Explore at:
    Dataset updated
    Aug 26, 2022
    Dataset provided by
    OnuÈ›, Andrei
    Ianov, Alexandru
    Păiș, Vasile
    Mitrofan, Maria
    Ghiță, Corvin
    Gasan, Carol Luca
    Coneschi, Vlad Silviu
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    LegalNERo is a manually annotated corpus for named entity recognition in the Romanian legal domain. It provides gold annotations for organizations, locations, persons, time and legal resources mentioned in legal documents. Additionally it offers GEONAMES codes for the named entities annotated as location (where a link could be established).

    The LegalNERo corpus is available in different formats: span-based, token-based and RDF. The Linguistic Linked Open Data (LLOD) version is provided in RDF-Turtle format.

    CONLLUP files conform to the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Similarly GEONAMES references are in the column "RELATE:GEONAMES" (the 12th column, last). Automatic processing was performed through the RELATE platform (https://relate.racai.ro).

    ANN files conform to BRAT format (https://brat.nlplab.org/).

    The archive contains:

    • ann_LEGAL_PER_LOC_ORG_TIME_overlap Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations of organizations and time entities inside legal references were allowed.

    • ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.

    • ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.

    • conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal resources mentioned, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).

    • conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated. The annotation of these files was enhanced with GEONAMES codes (where linking was possible).

    • rdf Folder containing the corpus in RDF-Turtle format. All the annotations are available here in both span and token format.

    • text Folder containing the raw texts.

    NER System

    A NER model generated using the LegalNERo corpus can be used online in the RELATE platform: https://relate.racai.ro/index.php?path=ner/demo

    This system was described in: Păiș, Vasile and Mitrofan, Maria and Gasan, Carol Luca and Coneschi, Vlad and Ianov, Alexandru. Named Entity Recognition in the Romanian Legal Domain. In Proceedings of the Natural Legal Language Processing Workshop 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, pp. 9--18, nov 2021

    LICENSING

    This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .

    CONTACT

    Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro

  8. o

    data-augmentation-ner-results

    • explore.openaire.eu
    • data.niaid.nih.gov
    • +1more
    Updated Aug 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Erd; Leila Feddoul; Clara Lachenmaier; Marianne Jana Mauch (2022). data-augmentation-ner-results [Dataset]. http://doi.org/10.5281/zenodo.6956508
    Explore at:
    Dataset updated
    Aug 3, 2022
    Authors
    Robin Erd; Leila Feddoul; Clara Lachenmaier; Marianne Jana Mauch
    Description

    Model evaluation results produced in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain. Detailed information can be found on the Github page.

  9. h

    legal-ner

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    daishen, legal-ner [Dataset]. https://huggingface.co/datasets/daishen/legal-ner
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    daishen
    Description

    daishen/legal-ner dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. f

    Indian Court Decision Annotated Corpus.xlsx

    • figshare.com
    xlsx
    Updated Aug 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya (2022). Indian Court Decision Annotated Corpus.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.19719088.v4
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 22, 2022
    Dataset provided by
    figshare
    Authors
    Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.

  11. h

    lener_br

    • huggingface.co
    Updated Sep 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pedro Henrique Luz de Araujo (2023). lener_br [Dataset]. https://huggingface.co/datasets/peluz/lener_br
    Explore at:
    Dataset updated
    Sep 12, 2023
    Authors
    Pedro Henrique Luz de Araujo
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    LeNER-Br is a Portuguese language dataset for named entity recognition applied to legal documents. LeNER-Br consists entirely of manually annotated legislation and legal cases texts and contains tags for persons, locations, time entities, organizations, legislation and legal cases. To compose the dataset, 66 legal documents from several Brazilian Courts were collected. Courts of superior and state levels were considered, such as Supremo Tribunal Federal, Superior Tribunal de Justiça, Tribunal de Justiça de Minas Gerais and Tribunal de Contas da União. In addition, four legislation documents were collected, such as "Lei Maria da Penha", giving a total of 70 documents

  12. openthesaurus_dump_20220530

    • zenodo.org
    • data.niaid.nih.gov
    bz2
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robin Erd; Robin Erd; Leila Feddoul; Leila Feddoul; Clara Lachenmaier; Clara Lachenmaier; Marianne Jana Mauch; Marianne Jana Mauch (2023). openthesaurus_dump_20220530 [Dataset]. http://doi.org/10.5281/zenodo.6956563
    Explore at:
    bz2Available download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Robin Erd; Robin Erd; Leila Feddoul; Leila Feddoul; Clara Lachenmaier; Clara Lachenmaier; Marianne Jana Mauch; Marianne Jana Mauch
    License

    https://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/lgpl-2.1-standalone.html

    Description

    OpenThesaurus Dump Version used in the context of evaluating data augmentation for Named Entity Recognition over the German legal domain (Detailed information can be found on the Github page)

  13. O

    LeNER-Br

    • opendatalab.com
    • paperswithcode.com
    zip
    Updated Sep 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    University of Brasilia (2022). LeNER-Br [Dataset]. https://opendatalab.com/OpenDataLab/LeNER-Br
    Explore at:
    zip(24756792 bytes)Available download formats
    Dataset updated
    Sep 21, 2022
    Dataset provided by
    University of Brasilia
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    LeNER-Br is a dataset for named entity recognition (NER) in Brazilian Legal Text.

  14. h

    uk_ner_contracts

    • huggingface.co
    Updated Nov 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Law Insider (2023). uk_ner_contracts [Dataset]. https://huggingface.co/datasets/lawinsider/uk_ner_contracts
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2023
    Dataset authored and provided by
    Law Insider
    Description

    Dataset Description

    Legal Contracts Dataset for Training NER Model This repository contains a specially curated dataset consisting of legal contracts. It is designed for the purpose of training a Named Entity Recognition (NER) model, with the aim to recognize and classify four types of entities in the text: Contract Type, Clause Title, Clause Number, Definition Title The dataset includes a broad variety of legal contracts, covering diverse domains such as employment, real estate… See the full description on the dataset page: https://huggingface.co/datasets/lawinsider/uk_ner_contracts.

  15. h

    mapa

    • huggingface.co
    • opendatalab.com
    Updated Apr 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Niklaus (2023). mapa [Dataset]. https://huggingface.co/datasets/joelniklaus/mapa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 9, 2023
    Authors
    Joel Niklaus
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for Multilingual European Datasets for Sensitive Entity Detection in the Legal Domain

      Dataset Summary
    

    The dataset consists of 12 documents (9 for Spanish due to parsing errors) taken from EUR-Lex, a multilingual corpus of court decisions and legal dispositions in the 24 official languages of the European Union. The documents have been annotated for named entities following the guidelines of the MAPA project which foresees two annotation level, a general and a… See the full description on the dataset page: https://huggingface.co/datasets/joelniklaus/mapa.

  16. Z

    Romanian micro-blogging named entity recognition (MicroBloggingNERo)

    • data.niaid.nih.gov
    Updated Jul 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florea, Bianca (2022). Romanian micro-blogging named entity recognition (MicroBloggingNERo) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6905234
    Explore at:
    Dataset updated
    Jul 27, 2022
    Dataset provided by
    Badila, Ana
    Barbu-Mititelu, Verginica
    Dicusar, Maria
    Florea, Bianca
    Marin, Laura
    Irimia, Elena
    Micu, Roxana
    Păiș, Vasile
    Mitrofan, Maria
    Gasan, Carol Luca
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    MicroBloggingNERo is a manually annotated corpus for named entity recognition in Romanian micro-blogging texts. It provides gold annotations for organizations, locations, persons, time expressions, legal references, medical devices, chemicals, anatomical parts and disorders found in micro-blogging texts. The text was anonymized, by replacing all URLs with , user references with , person names, specific locations and organizations with new randomized names. Anonymization was realized in the same way, regardless of the micro-blogging platform specific format.

    Since names were replaced with new random ones, any resemblance to real individuals is by pure chance of the random names generator. No real person is depicted in the included messages.

    DATA

    The MicroBloggingNERo corpus is available in different formats: text, span-based, and token-based.

    Text files are in the folder "text" with .txt extension, in UTF-8 encoding.

    Span-based annotations are given in BRAT (https://brat.nlplab.org/) ann format. These annotations can be found in folders starting with "ann_".

    Token-based annotations are given in CONLLUP files, following the CoNLL-U Plus format https://universaldependencies.org/ext-format.html . Part-of-speech tagging was realized using UDPIPE. Named entity annotations are placed in the column "RELATE:NE" (the 11th column) as defined in the "global.columns" metadata field. Automatic processing was performed through the RELATE platform (https://relate.racai.ro).

    The archive contains:

    • ann_EVERYTHING Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations of organizations and time entities inside legal references were allowed.

    • ann_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated. This affects primarily the legal references class.

    • ann_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations and time. There are no overlapping annotations.

    • ann_PER_LOC_ORG_TIME Folder in which all the files are in .ann format and contains annotations of: persons, locations, organizations and time. There are no overlapping annotations.

    • ann_BIOMEDICAL Folder in which all the files are in .ann format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. There are no overlapping annotations.

    • conllup_EVERYTHING_LARGEST_SPAN Folder in which all the files are in .ann format and contains annotations of: legal references, persons, locations, organizations, time, chemicals, medical devices, anatomical parts and disorders. There are no overlapping annotations.

    • conllup_LEGAL_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: legal references, persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.

    • conllup_PER_LOC_ORG_TIME Folder in which all the files are in .conllup format and contains annotations of: persons, locations, organizations and time. Overlapping annotations were not allowed and only the longest named entities were annotated.

    • conllup_BIOMEDICAL Folder in which all the files are in .conllup format and contains annotations of: medical devices, chemicals, anatomical parts and disorders. Overlapping annotations were not allowed and only the longest named entities were annotated.

    • text Folder containing the raw texts.

    • splits.tsv Proposed splits into train,test,valid following a distribution of 70-15-15% for each entity class, based on the ann_EVERYTHING_LARGEST_SPAN folder

    LICENSING

    This work is provided under the license CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives 4.0 International). The license can be viewed online here: https://creativecommons.org/licenses/by-nc-nd/4.0/ and the full text here: https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode .

    CONTACT

    Research Institute for Artificial Intelligence "Mihai Draganescu", Romanian Academy Web: http://www.racai.ro Contact emails: vasile@racai.ro , maria@racai.ro , vergi@racai.ro , elena@racai.ro

  17. D

    Data from: To NER or not to NER? A case study of low-resource deontic...

    • dataverse.nl
    Updated Dec 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shashank Chakravarthy; Shashank Chakravarthy; Gijs van Dijck; Gijs van Dijck; Anna Wilbik; Anna Wilbik (2024). To NER or not to NER? A case study of low-resource deontic modalities in EU legislation? [Dataset]. http://doi.org/10.34894/D9AKUS
    Explore at:
    application/x-ipynb+json(343282), csv(54271), application/x-ipynb+json(220670), csv(688377), application/x-ipynb+json(220604)Available download formats
    Dataset updated
    Dec 17, 2024
    Dataset provided by
    DataverseNL
    Authors
    Shashank Chakravarthy; Shashank Chakravarthy; Gijs van Dijck; Gijs van Dijck; Anna Wilbik; Anna Wilbik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    European Union
    Description

    Deontic modality (obligation, permission, prohibition) in legal documents can convey critical information, and identification of deontic modalities is often performed using Natural Language Processing (NLP) techniques as a Deontic Modality Classification' (DMC) text classification task. As deontic modalities in legal text are not mutually exclusive, a key challenge with DMC is that it classifies the provided text into a single modality while in reality it might have multiple deontic modalities. To address this, this study analyzes the feasibility of performing deontic modality identification as a Named Entity Recognition (NER) task over DMC task approaches in a low-resource data setting with EU legislation. Low-resource NLP approaches can offer solutions to tackle the problem of scarce data. In this paper, we use a rule-based approach with modal verbs and a Decision Tree classifier for DMC task. For NER, we utilize Conditional Random Fields (CRFs) in a low-resource setting and report on the reliability and precision for identification of deontic modality. Our experiments reveal that simpler models, like decision trees, out perform larger models in the low-resource setting of DMC obtaining macro-F1 score of 0.83. For the NER task, the CRF models show consistent performance forobligation' labels with an F1-score of 0.51 but have wavering results for other classes with a max F1-score of 0.26 for permission', and 0.08 forprohibition'.

  18. h

    Regulations_NER

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SecureFinAI Lab, Regulations_NER [Dataset]. https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_NER
    Explore at:
    Dataset authored and provided by
    SecureFinAI Lab
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    Overview

    This question set is created to evaluate LLMs' ability for named entity recognition (NER) in financial regulatory texts. It is developed for a task at Regulations Challege @ COLING 2025. The objective is to accurately identify and classify entities, including organizations, legislation, dates, monetary values, and statistics. Financial regulations often require supervising and reporting on specific entities, such as organizations, financial products, and transactions, and… See the full description on the dataset page: https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_NER.

  19. m

    Dataset of Named Entity Recognition for Uzbek language

    • data.mendeley.com
    Updated Oct 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Davlatyor Mengliev (2024). Dataset of Named Entity Recognition for Uzbek language [Dataset]. http://doi.org/10.17632/6hphxr74rp.2
    Explore at:
    Dataset updated
    Oct 22, 2024
    Authors
    Davlatyor Mengliev
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    As part of the study, an annotated corpus of the Uzbek language was created for training and evaluating named entity recognition models. The corpus includes 2,000 sentences (25865 words) collected from various sources: • Legislative acts and legal documents: The bulk of the data was extracted from the publicly available lex.uz database, which contains official texts that are highly literate and have a formal language structure. • News sites: Articles and materials from Uzbek news portals (kun.uz, gazeta.uz) were used, which made it possible to include modern language structures and relevant vocabulary. • Manually created sentences: To increase the number of named entities in sentences and ensure diversity, author's sentences were developed containing several entities of different types. This enriched the corpus with complex structures and increased the efficiency of model training. Data annotation was carried out manually using the BIOES scheme, which provides detailed marking of boundaries and types of named entities. All abstracts were reviewed by Uzbek language experts to ensure accuracy and consistency of data.

  20. HOME-Alcar: Aligned and Annotated Cartularies

    • zenodo.org
    • explore.openaire.eu
    • +1more
    bin, json, pdf, zip
    Updated Jul 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominique Stutzmann; Dominique Stutzmann; Sergio Torres Aguilar; Sergio Torres Aguilar; Paul Chaffenet; Paul Chaffenet (2024). HOME-Alcar: Aligned and Annotated Cartularies [Dataset]. http://doi.org/10.5281/zenodo.5600884
    Explore at:
    pdf, zip, json, binAvailable download formats
    Dataset updated
    Jul 17, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominique Stutzmann; Dominique Stutzmann; Sergio Torres Aguilar; Sergio Torres Aguilar; Paul Chaffenet; Paul Chaffenet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The HOME-Alcar (Aligned and Annotated Cartularies) corpus was produced as part of the European research project HOME History of Medieval Europe (https://www.heritageresearch-hub.eu/project/home/), led under the coordination oflinebreakof Institut de Recherche et d'Histoire des Textes (PI: D. Stutzmann), with the Universitat Politecnica de Valencia (PI: E. Vidal), the National Archives of the Czech Republic in Prague (PI: J. Kreckova), and Teklia SAS (PI: C. Kermorvant)
    The HOME-Alcar (Aligned and Annotated Cartularies) corpus is a resource created to train Handwritten Text Recognition (HTR) and Named Entity Recognition (NER), and presents a collection of
    (i) digital images of 17 medieval manuscripts;
    (ii) scholarly editions thereof;
    (iii) coordinates linking images and text at line level;
    (iv) annotations of Named Entities (place and person names).
    The 17 medieval manuscripts in this corpus are cartularies, i.e. books copying charters and legal acts, produced between the 12th and 14th centuries.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2023). legal_NER Dataset [Dataset]. https://paperswithcode.com/dataset/legal-ner

legal_NER Dataset

Explore at:
10 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Oct 23, 2023
Description

legal_NER is a corpus of 46545 annotated legal named entities mapped to 14 legal entity types. It is designed for named entity recognition in indian court judgement.

Search
Clear search
Close search
Google apps
Main menu