100+ datasets found
  1. E

    Corpora of legal text

    • live.european-language-grid.eu
    • data.europa.eu
    xml
    Updated Aug 30, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Corpora of legal text [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18873
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Aug 30, 2022
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Corpora of legal texts supplied by Dr. Claudia Foti. The texts include:

    1) Report of the Special Rapporteur on violence against women, its causes and consequences, Ms. Rashida Manjoo 2) General Recommendation 3) Third Evaluation Round Evaluation Report on Italy Incriminations (ETS 173 and 191, GPC 2) 4) Law No. 190 of 6 November 2012, Provisions on preventing and combating corruption and other illegal activities in the Public Administration 5) Guidelines on Justice in Matters involving Child Victims and Witnesses of Crime 6) Certificate of the State of Enforcement of a Criminal Sentence 7) Extradition request for non-EU countries 8) Request for international cooperation made on 26.11.2015 by the Court of Matera 9) Video-conference link for the examination of ZW at the hearing of 27 October 2014 at 9.30 am and other subsequent hearings, if necessary. 10) State performance with letter accompanied.

  2. e

    Cambridge Law Corpus, 1550-2023 - Dataset - B2FIND

    • b2find.eudat.eu
    Updated Feb 23, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Cambridge Law Corpus, 1550-2023 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a4a45a66-767a-57fb-9200-6dfe91963059
    Explore at:
    Dataset updated
    Feb 23, 2024
    Description

    The Cambridge Law Corpus (CLC) is a corpus designed for legal AI research. It consists of over 250,000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. Together with the corpus, annotations on case outcomes for 638 cases, done by legal experts, are provided. The Word files were cleaned and transformed into an XML format. PDF files were converted to textual form via optical character recognition (OCR). The resulting text files were then converted to the XML standard format. Because of legal and ethical considerations, the full Cambridge Law Corpus (CLC) is only available for research purposes under restrictions and available via Related Resources. A smaller dataset consisting of 15 selected cases from the CLC is available on the University of Cambridge Apollo Data Repository which can be accessed via Related Resources.The Cambridge Law Corpus is a corpus designed for legal AI research. It consists of over 250,000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases dating from the 16th century. It was funded by the research project, Legal Systems and Artificial Intelligence, which was jointly supported by the UK’s Economic and Social Research Council, part of UKRI, and the Japanese Society and Technology Agency (JST), and involved collaboration between Cambridge University (the Centre for Business Research, Department of Computer Science and Faculty of Law) and Hitotsubashi University, Tokyo (the Graduate Schools of Law and Business Administration). The original cases of the Cambridge Law Corpus were supplied by the legal technology company CourtCorrect in raw form, including Microsoft Word and PDF files.

  3. Legal Text Classification Dataset

    • kaggle.com
    Updated Oct 17, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A.Mohan kumar (2023). Legal Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/amohankumar/legal-text-classification-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 17, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    A.Mohan kumar
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.

  4. h

    Multi_Legal_Pile

    • huggingface.co
    Updated Oct 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Niklaus (2023). Multi_Legal_Pile [Dataset]. https://huggingface.co/datasets/joelniklaus/Multi_Legal_Pile
    Explore at:
    Dataset updated
    Oct 23, 2023
    Authors
    Joel Niklaus
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Multi Legal Pile is a dataset of legal documents in the 24 EU languages.

  5. h

    open-australian-legal-corpus

    • huggingface.co
    Updated Mar 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaacus (2025). open-australian-legal-corpus [Dataset]. http://doi.org/10.57967/hf/2833
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 10, 2025
    Dataset authored and provided by
    Isaacus
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Area covered
    Australia
    Description

    Open Australian Legal Corpus ‍⚖️

    The Open Australian Legal Corpus by Isaacus is the first and only multijurisdictional open corpus of Australian legislative and judicial documents. Comprised of 229,122 texts totalling over 60 million lines and 1.4 billion tokens, the Corpus includes every in force statute and regulation in the Commonwealth, New South Wales, Queensland, Western Australia, South Australia, Tasmania and Norfolk Island, in addition to thousands of bills and hundreds of… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-corpus.

  6. h

    open-australian-legal-embeddings

    • huggingface.co
    Updated Nov 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Isaacus (2023). open-australian-legal-embeddings [Dataset]. http://doi.org/10.57967/hf/1347
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset authored and provided by
    Isaacus
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Area covered
    Australia
    Description

    Open Australian Legal Embeddings ‍⚖️

    The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents. Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5. The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-embeddings.

  7. F

    English-French Parallel Corpus for the Legal Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-French Parallel Corpus for the Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/french-english-translated-parallel-corpus-for-legal-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    French
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English-French Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.

    Dataset Content

    Volume and Translator Diversity
    Sentence Count: Over 50,000 bilingual sentence pairs
    Translator Base: More than 200 native French linguists with domain familiarity contributed to the translation process
    Dataset Origin: Built from scratch with legal use cases in mind, ensuring domain relevance and application readiness
    Sentence Variety
    Length Range: Sentences contain 7 to 25 words
    Grammatical Structures: Includes simple, compound, and complex sentences
    Form Types: Covers questions, commands, affirmations, and negations
    Voice Representation: Balanced use of active and passive sentence constructions
    Cross Translation: Dataset includes both English-to-French and French-to-English segments to ensure bidirectional support
    Linguistic Features:
    Idiomatic expressions and legal jargon
    Sentence connectors and discourse markers to preserve argument structure and legal reasoning

    Legal Domain Specialization

    Legal Terminology Coverage

    This dataset includes terminology across a wide range of legal subdomains such as:

    Contracts, agreements, and commercial law
    Criminal and civil litigation
    Legal procedures, rulings, and statutory interpretation
    Administrative, constitutional, and regulatory terms
    Courtroom dialogue, judgments, and legal advisories
    Contextual Diversity

    Sentence pairs are drawn from realistic legal content types, including:

    Legal briefs, affidavits, and memoranda
    Terms of service and data protection policies
    Research articles and legal scholarship
    Standard forms and templates
    Legislative, policy, and compliance language
    Cross-Domain Elements

    To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:

    Government policy
    Business and finance
    Technology, IP, and cybersecurity law

    Format and Structure

    Available Formats: Delivered in Excel, with optional conversions to TMX, JSON, XML, XLIFF, or other localization formats
    Included Fields:
    Serial Number
    Unique ID
    Source Sentence and Word Count
    Target Sentence and Word Count

    Use Cases and Applications

    Legal Machine Translation: Build accurate translation engines for contracts, laws, and compliance documentation
    Multilingual NLP Tools: Develop legal summarization tools, AI writing assistants, and terminology alignment engines

  8. Document corpus of court case judgments

    • figshare.com
    application/gzip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    rupali wagh (2023). Document corpus of court case judgments [Dataset]. http://doi.org/10.6084/m9.figshare.8063186.v2
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    rupali wagh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data set contains Indian Supreme Court judgments. These cases are extracted using a web crawler from the website www.judis.nic.in. Preliminary pre-processing of removal of header/ metadata information from the document is performed.

  9. E

    OpenLegalData (2022 - Corpus)

    • live.european-language-grid.eu
    binary format
    Updated Jul 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). OpenLegalData (2022 - Corpus) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22980
    Explore at:
    binary formatAvailable download formats
    Dataset updated
    Jul 30, 2023
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of jurisprudence with the help of open data and to help people without legal training to understand the justice system. The project is committed to the Open Data principles and the Free Access to Justice Movement.

    OpenLegalData's DUMP as of 2022-10-18 was used to create this corpus. The data was cleaned, automatically annotated (TreeTagger: POS & Lemma) and grouped based on the metadata (jurisdiction - BundeslandID - sub-size if applicable - ex: Verwaltungsgerichtsbarkeit_11_05.cec6.gz - jurisdiction: administrative jurisdiction, BundeslandID = 11 - sub-corpus = 05). Sub-corpora are randomly split into 50 MB each.

    Corpus data is available in CEC6 format. This can be converted into many different corpus formats - use the software www.CorpusExplorer.de if necessary.

  10. O

    Pile of Law

    • opendatalab.com
    • huggingface.co
    zip
    Updated Mar 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stanford University (2023). Pile of Law [Dataset]. https://opendatalab.com/OpenDataLab/Pile_of_Law
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 24, 2023
    Dataset provided by
    Stanford University
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  11. F

    English-Gujarati Parallel Corpus for the Legal Domain

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English-Gujarati Parallel Corpus for the Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/gujarati-english-translated-parallel-corpus-for-legal-domain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English-Gujarati Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.

    Dataset Content

    Volume and Translator Diversity
    Sentence Count: Over 50,000 bilingual sentence pairs
    Translator Base: More than 200 native Gujarati linguists with domain familiarity contributed to the translation process
    Dataset Origin: Built from scratch with legal use cases in mind, ensuring domain relevance and application readiness
    Sentence Variety
    Length Range: Sentences contain 7 to 25 words
    Grammatical Structures: Includes simple, compound, and complex sentences
    Form Types: Covers questions, commands, affirmations, and negations
    Voice Representation: Balanced use of active and passive sentence constructions
    Cross Translation: Dataset includes both English-to-Gujarati and Gujarati-to-English segments to ensure bidirectional support
    Linguistic Features:
    Idiomatic expressions and legal jargon
    Sentence connectors and discourse markers to preserve argument structure and legal reasoning

    Legal Domain Specialization

    Legal Terminology Coverage

    This dataset includes terminology across a wide range of legal subdomains such as:

    Contracts, agreements, and commercial law
    Criminal and civil litigation
    Legal procedures, rulings, and statutory interpretation
    Administrative, constitutional, and regulatory terms
    Courtroom dialogue, judgments, and legal advisories
    Contextual Diversity

    Sentence pairs are drawn from realistic legal content types, including:

    Legal briefs, affidavits, and memoranda
    Terms of service and data protection policies
    Research articles and legal scholarship
    Standard forms and templates
    Legislative, policy, and compliance language
    Cross-Domain Elements

    To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:

    Government policy
    Business and finance
    Technology, IP, and cybersecurity law

    Format and Structure

    Available Formats: Delivered in Excel, with optional conversions to TMX, JSON, XML, XLIFF, or other localization formats
    Included Fields:
    Serial Number
    Unique ID
    Source Sentence and Word Count
    Target Sentence and Word Count

    Use Cases and Applications

    Legal Machine Translation: Build accurate translation engines for contracts, laws, and compliance documentation
    Multilingual NLP Tools: Develop legal summarization tools, AI writing assistants, and terminology alignment engines

  12. e

    EU Law Labour Market Corpus - Dataset - B2FIND

    • b2find.eudat.eu
    Updated May 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). EU Law Labour Market Corpus - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/be849415-bb5d-5629-aac0-528a80d29233
    Explore at:
    Dataset updated
    May 18, 2024
    Area covered
    European Union
    Description

    The datasets consist of a corpus of legal sources in the EU legal system mentioning the expression “labour market” (1968 – March 2023). Data were collected from the Eur-Lex website and then coded by hand according to pre-established definitions of “labour market”. The database contains information on the year, subject matter, and institutional author of the legal text, as well as hyperlinks to the original documents.

  13. P

    The legal consultation data and corpus of the thesis from China law...

    • opendata.pku.edu.cn
    Updated Jun 7, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peking University Open Research Data Platform (2018). The legal consultation data and corpus of the thesis from China law network.Replication Data for: Design and research of legal consultation text classification system. [Dataset]. http://doi.org/10.18170/DVN/OLO4G8
    Explore at:
    text/plain; charset=utf-8(324707151), text/plain; charset=utf-8(145265567), text/plain; charset=utf-8(262774345), zip(1803167), text/plain; charset=utf-8(163247882)Available download formats
    Dataset updated
    Jun 7, 2018
    Dataset provided by
    Peking University Open Research Data Platform
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    China
    Description

    Data source: the whole question list of the "legal consultation" section of China law network and all the answers to the questions answered; Data collection time: December 2017 to February 2018. Data collection: Python script is written to automatically acquire text crawler and comply with the webpage robots protocol; Processing method: store MongoDB database and export; Data format: CSV and json; Data description: Question_1, question_2 contains around 1.2 million data volume, for China law web plate "legal advice" question list information, including consultants belong to, the content of counseling problems, consulting problem belongs to field, consulting; Answer_1, answer_2 contains around 2.1 million data, for the China law of "legal advice" plate has to solve network problems list all lawyers answer content, including lawyers answer content, lawyers answer, lawyers answer time for details. Instructions: mainly extracted the question text and sorted into corpus.

  14. Spanish Legal Domain Word & Sub-Word Embeddings

    • zenodo.org
    • data.niaid.nih.gov
    bin, txt
    Updated Nov 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño (2022). Spanish Legal Domain Word & Sub-Word Embeddings [Dataset]. http://doi.org/10.5281/zenodo.5036147
    Explore at:
    bin, txtAvailable download formats
    Dataset updated
    Nov 4, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Spanish Legal Word and Sub-word Embeddings in FastText

    These embeddings have been generated from the largest corpus (9GB) ever made from Spanish Legal resources till the date.

    More legal domain resources: https://github.com/PlanTL-GOB-ES/lm-legal-es

    Citation

    @misc{gutierrezfandino2021legal,
       title={Spanish Legalese Language Model and Corpora}, 
       author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Aitor Gonzalez-Agirre and Marta Villegas},
       year={2021},
       eprint={2110.12201},
       archivePrefix={arXiv},
       primaryClass={cs.CL}
    }

    Copyright

    Copyright (c) 2021 Secretaría de Estado de Digitalización e Inteligencia Artificial

  15. D

    English and Polish judicial Eurolects: grammar patterns

    • danebadawcze.uw.edu.pl
    zip
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Koźbiał, Dariusz (2025). English and Polish judicial Eurolects: grammar patterns [Dataset]. http://doi.org/10.58132/GRQRLB
    Explore at:
    zip(10307175)Available download formats
    Dataset updated
    Jul 3, 2025
    Dataset provided by
    Dane Badawcze UW
    Authors
    Koźbiał, Dariusz
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The corpus has been compiled to examine the phraseological profiles of the English and Polish judicial Eurolects, using grammar patterns from a comparative, corpus- and genre-based perspective. The English and Polish judicial Eurolects are exemplified by two interconnected genres: Advocate Generals’ (AGs’) opinions and judgments issued by the Court of Justice of the European Union (CJEU).The focus corpus consists of a bilingual (English and Polish) genre-based corpus made up of four subcorpora, each made up of 55 texts:(1) English language versions of AGs’ opinions (807 547 tokens),(2) English language versions of Court of Justice (CJ) judgments issued following an AG’s opinion (756 829 tokens),(3) Polish language versions of AGs’ opinions (703 469 tokens),(4) Polish language versions of CJ judgments issued following an AG’s opinion (641 188 tokens).To limit the scope of the study and to increase its feasibility, the decision was made to include in the corpus only those AGs’ opinions and judgments which were given in cases concerning actions for annulment of an EU act.There are two reference corpora:(1) a sample of the British Law Report Corpus (BLaRC) (cf. Pérez and Rizzo 2012) which consists of 50 judgments handed out by the UK Supreme Court (UKSC) between 2008 and 2010 (735 338 tokens), serves as a benchmark of non-translated English judicial language against which translated EU judgments are compared, and(2) a corpus of 56 judgments issued by the Constitutional Tribunal of the Republic of Poland (1 089 817 tokens).All documents in the focus corpus were issued within the time frame of 2020 and 2022 (calendar years), with the only exception of UKSC judgments which were included in a premade corpus available on Sketch Engine (Kilgarriff et al. 2014) and handed out between 2008 and 2010.

  16. f

    Indian Court Decision Annotated Corpus.xlsx

    • figshare.com
    xlsx
    Updated Aug 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya (2022). Indian Court Decision Annotated Corpus.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.19719088.v4
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Aug 22, 2022
    Dataset provided by
    figshare
    Authors
    Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    India
    Description

    Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.

  17. A

    Corpus of Law, Academic, and News

    • abacus.library.ubc.ca
    iso, txt
    Updated Mar 18, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abacus Data Network (2022). Corpus of Law, Academic, and News [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/VMWYC0
    Explore at:
    txt(1308), iso(4886528)Available download formats
    Dataset updated
    Mar 18, 2022
    Dataset provided by
    Abacus Data Network
    Description

    AbstractIntroduction Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news. The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020. Data The document and token counts are as follows: 48 legal documents, 88,170 tokens; 274 academic documents, 85,765 tokens; and 78 news documents, 101,055 tokens. Each document contains metadata in the file's header with information such as specific text type, dates and source, and also contains annotations marking title and body paragraphs. All documents are presented as UTF-8 encoded XML with internal DTDs.

  18. h

    Pretrain-1-Legal-Corpus

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syed hasan, Pretrain-1-Legal-Corpus [Dataset]. https://huggingface.co/datasets/Syed-Hasan-8503/Pretrain-1-Legal-Corpus
    Explore at:
    Authors
    Syed hasan
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Syed-Hasan-8503/Pretrain-1-Legal-Corpus dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. o

    Classification of Culture and Process in U.S. Supreme Court Language,...

    • explore.openaire.eu
    • data.niaid.nih.gov
    Updated Nov 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David S. Tanenhaus; Eric C. Nystrom (2020). Classification of Culture and Process in U.S. Supreme Court Language, 1793-2011 [Dataset]. http://doi.org/10.5281/zenodo.4279858
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    David S. Tanenhaus; Eric C. Nystrom
    Area covered
    United States
    Description

    In an effort to distantly read U.S. Supreme Court opinions, the authors extracted and classified the noun or noun phrase following every usage of the possessive pronoun "our" in every court opinion from 1793 through the end of the 2011 term. All terms were classified by legal historian Professor David S. Tanenhaus, generally with reference to the original opinion text for context, and were re-checked by Tanenhaus after a period of several months. Thus the classifications represent his qualitative, expert opinion. Each term was classified as representing a "process" oriented usage (e.g. the judicial process), a "culture" or heritage usage, or, in some cases, either "ambiguous" or "unclassified" uses. "Ambiguous" uses are those where classification of the individual uses of each term revealed that no consensus emerged about the generalized meaning of the term (in other words, some were process, and some were culture). For more details about the classification process, please see the data paper. The data is contained in a TSV file containing three fields, with no header. The first field is the number of times this particular follower-term appeared in the entire corpus. The second field is the follower term itself (lowercased), which may be a phrase. The third field is the classification of that term. As of the version 11 update, the file contains 9527 unique follower-terms, representing the terms following "our" in 79,693 uses across the corpus of U.S. Supreme Court opinions from 1793-2011.

  20. E

    French Monolingual legal corpus from Official Journal of France

    • live.european-language-grid.eu
    • data.europa.eu
    txt
    Updated Aug 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). French Monolingual legal corpus from Official Journal of France [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/19397
    Explore at:
    txtAvailable download formats
    Dataset updated
    Aug 31, 2022
    License

    Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
    License information was derived automatically

    Area covered
    France, French
    Description

    French Monolingual legal corpus from Official Journal of France as collected from https://www.legifrance.gouv.fr/ web site

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
(2022). Corpora of legal text [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18873

Corpora of legal text

Explore at:
92 scholarly articles cite this dataset (View in Google Scholar)
xmlAvailable download formats
Dataset updated
Aug 30, 2022
License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

Corpora of legal texts supplied by Dr. Claudia Foti. The texts include:

1) Report of the Special Rapporteur on violence against women, its causes and consequences, Ms. Rashida Manjoo 2) General Recommendation 3) Third Evaluation Round Evaluation Report on Italy Incriminations (ETS 173 and 191, GPC 2) 4) Law No. 190 of 6 November 2012, Provisions on preventing and combating corruption and other illegal activities in the Public Administration 5) Guidelines on Justice in Matters involving Child Victims and Witnesses of Crime 6) Certificate of the State of Enforcement of a Criminal Sentence 7) Extradition request for non-EU countries 8) Request for international cooperation made on 26.11.2015 by the Court of Matera 9) Video-conference link for the examination of ZW at the hearing of 27 October 2014 at 9.30 am and other subsequent hearings, if necessary. 10) State performance with letter accompanied.

Search
Clear search
Close search
Google apps
Main menu