U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Corpora of legal texts supplied by Dr. Claudia Foti. The texts include:
1) Report of the Special Rapporteur on violence against women, its causes and consequences, Ms. Rashida Manjoo 2) General Recommendation 3) Third Evaluation Round Evaluation Report on Italy Incriminations (ETS 173 and 191, GPC 2) 4) Law No. 190 of 6 November 2012, Provisions on preventing and combating corruption and other illegal activities in the Public Administration 5) Guidelines on Justice in Matters involving Child Victims and Witnesses of Crime 6) Certificate of the State of Enforcement of a Criminal Sentence 7) Extradition request for non-EU countries 8) Request for international cooperation made on 26.11.2015 by the Court of Matera 9) Video-conference link for the examination of ZW at the hearing of 27 October 2014 at 9.30 am and other subsequent hearings, if necessary. 10) State performance with letter accompanied.
The Cambridge Law Corpus (CLC) is a corpus designed for legal AI research. It consists of over 250,000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. Together with the corpus, annotations on case outcomes for 638 cases, done by legal experts, are provided. The Word files were cleaned and transformed into an XML format. PDF files were converted to textual form via optical character recognition (OCR). The resulting text files were then converted to the XML standard format. Because of legal and ethical considerations, the full Cambridge Law Corpus (CLC) is only available for research purposes under restrictions and available via Related Resources. A smaller dataset consisting of 15 selected cases from the CLC is available on the University of Cambridge Apollo Data Repository which can be accessed via Related Resources.The Cambridge Law Corpus is a corpus designed for legal AI research. It consists of over 250,000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases dating from the 16th century. It was funded by the research project, Legal Systems and Artificial Intelligence, which was jointly supported by the UK’s Economic and Social Research Council, part of UKRI, and the Japanese Society and Technology Agency (JST), and involved collaboration between Cambridge University (the Centre for Business Research, Department of Computer Science and Faculty of Law) and Hitotsubashi University, Tokyo (the Graduate Schools of Law and Business Administration). The original cases of the Cambridge Law Corpus were supplied by the legal technology company CourtCorrect in raw form, including Microsoft Word and PDF files.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Multi Legal Pile is a dataset of legal documents in the 24 EU languages.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Open Australian Legal Corpus ⚖️
The Open Australian Legal Corpus by Isaacus is the first and only multijurisdictional open corpus of Australian legislative and judicial documents. Comprised of 229,122 texts totalling over 60 million lines and 1.4 billion tokens, the Corpus includes every in force statute and regulation in the Commonwealth, New South Wales, Queensland, Western Australia, South Australia, Tasmania and Norfolk Island, in addition to thousands of bills and hundreds of… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-corpus.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Open Australian Legal Embeddings ⚖️
The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents. Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5. The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-embeddings.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English-French Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.
This dataset includes terminology across a wide range of legal subdomains such as:
Sentence pairs are drawn from realistic legal content types, including:
To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set contains Indian Supreme Court judgments. These cases are extracted using a web crawler from the website www.judis.nic.in. Preliminary pre-processing of removal of header/ metadata information from the document is performed.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of jurisprudence with the help of open data and to help people without legal training to understand the justice system. The project is committed to the Open Data principles and the Free Access to Justice Movement.
OpenLegalData's DUMP as of 2022-10-18 was used to create this corpus. The data was cleaned, automatically annotated (TreeTagger: POS & Lemma) and grouped based on the metadata (jurisdiction - BundeslandID - sub-size if applicable - ex: Verwaltungsgerichtsbarkeit_11_05.cec6.gz - jurisdiction: administrative jurisdiction, BundeslandID = 11 - sub-corpus = 05). Sub-corpora are randomly split into 50 MB each.
Corpus data is available in CEC6 format. This can be converted into many different corpus formats - use the software www.CorpusExplorer.de if necessary.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English-Gujarati Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.
This dataset includes terminology across a wide range of legal subdomains such as:
Sentence pairs are drawn from realistic legal content types, including:
To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:
The datasets consist of a corpus of legal sources in the EU legal system mentioning the expression “labour market” (1968 – March 2023). Data were collected from the Eur-Lex website and then coded by hand according to pre-established definitions of “labour market”. The database contains information on the year, subject matter, and institutional author of the legal text, as well as hyperlinks to the original documents.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Data source: the whole question list of the "legal consultation" section of China law network and all the answers to the questions answered; Data collection time: December 2017 to February 2018. Data collection: Python script is written to automatically acquire text crawler and comply with the webpage robots protocol; Processing method: store MongoDB database and export; Data format: CSV and json; Data description: Question_1, question_2 contains around 1.2 million data volume, for China law web plate "legal advice" question list information, including consultants belong to, the content of counseling problems, consulting problem belongs to field, consulting; Answer_1, answer_2 contains around 2.1 million data, for the China law of "legal advice" plate has to solve network problems list all lawyers answer content, including lawyers answer content, lawyers answer, lawyers answer time for details. Instructions: mainly extracted the question text and sorted into corpus.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Spanish Legal Word and Sub-word Embeddings in FastText
These embeddings have been generated from the largest corpus (9GB) ever made from Spanish Legal resources till the date.
More legal domain resources: https://github.com/PlanTL-GOB-ES/lm-legal-es
Citation
@misc{gutierrezfandino2021legal,
title={Spanish Legalese Language Model and Corpora},
author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Aitor Gonzalez-Agirre and Marta Villegas},
year={2021},
eprint={2110.12201},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Copyright
Copyright (c) 2021 Secretaría de Estado de Digitalización e Inteligencia Artificial
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The corpus has been compiled to examine the phraseological profiles of the English and Polish judicial Eurolects, using grammar patterns from a comparative, corpus- and genre-based perspective. The English and Polish judicial Eurolects are exemplified by two interconnected genres: Advocate Generals’ (AGs’) opinions and judgments issued by the Court of Justice of the European Union (CJEU).The focus corpus consists of a bilingual (English and Polish) genre-based corpus made up of four subcorpora, each made up of 55 texts:(1) English language versions of AGs’ opinions (807 547 tokens),(2) English language versions of Court of Justice (CJ) judgments issued following an AG’s opinion (756 829 tokens),(3) Polish language versions of AGs’ opinions (703 469 tokens),(4) Polish language versions of CJ judgments issued following an AG’s opinion (641 188 tokens).To limit the scope of the study and to increase its feasibility, the decision was made to include in the corpus only those AGs’ opinions and judgments which were given in cases concerning actions for annulment of an EU act.There are two reference corpora:(1) a sample of the British Law Report Corpus (BLaRC) (cf. Pérez and Rizzo 2012) which consists of 50 judgments handed out by the UK Supreme Court (UKSC) between 2008 and 2010 (735 338 tokens), serves as a benchmark of non-translated English judicial language against which translated EU judgments are compared, and(2) a corpus of 56 judgments issued by the Constitutional Tribunal of the Republic of Poland (1 089 817 tokens).All documents in the focus corpus were issued within the time frame of 2020 and 2022 (calendar years), with the only exception of UKSC judgments which were included in a premade corpus available on Sketch Engine (Kilgarriff et al. 2014) and handed out between 2008 and 2010.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.
AbstractIntroduction Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news. The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020. Data The document and token counts are as follows: 48 legal documents, 88,170 tokens; 274 academic documents, 85,765 tokens; and 78 news documents, 101,055 tokens. Each document contains metadata in the file's header with information such as specific text type, dates and source, and also contains annotations marking title and body paragraphs. All documents are presented as UTF-8 encoded XML with internal DTDs.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Syed-Hasan-8503/Pretrain-1-Legal-Corpus dataset hosted on Hugging Face and contributed by the HF Datasets community
In an effort to distantly read U.S. Supreme Court opinions, the authors extracted and classified the noun or noun phrase following every usage of the possessive pronoun "our" in every court opinion from 1793 through the end of the 2011 term. All terms were classified by legal historian Professor David S. Tanenhaus, generally with reference to the original opinion text for context, and were re-checked by Tanenhaus after a period of several months. Thus the classifications represent his qualitative, expert opinion. Each term was classified as representing a "process" oriented usage (e.g. the judicial process), a "culture" or heritage usage, or, in some cases, either "ambiguous" or "unclassified" uses. "Ambiguous" uses are those where classification of the individual uses of each term revealed that no consensus emerged about the generalized meaning of the term (in other words, some were process, and some were culture). For more details about the classification process, please see the data paper. The data is contained in a TSV file containing three fields, with no header. The first field is the number of times this particular follower-term appeared in the entire corpus. The second field is the follower term itself (lowercased), which may be a phrase. The third field is the classification of that term. As of the version 11 update, the file contains 9527 unique follower-terms, representing the terms following "our" in 79,693 uses across the corpus of U.S. Supreme Court opinions from 1793-2011.
Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
License information was derived automatically
French Monolingual legal corpus from Official Journal of France as collected from https://www.legifrance.gouv.fr/ web site
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Corpora of legal texts supplied by Dr. Claudia Foti. The texts include:
1) Report of the Special Rapporteur on violence against women, its causes and consequences, Ms. Rashida Manjoo 2) General Recommendation 3) Third Evaluation Round Evaluation Report on Italy Incriminations (ETS 173 and 191, GPC 2) 4) Law No. 190 of 6 November 2012, Provisions on preventing and combating corruption and other illegal activities in the Public Administration 5) Guidelines on Justice in Matters involving Child Victims and Witnesses of Crime 6) Certificate of the State of Enforcement of a Criminal Sentence 7) Extradition request for non-EU countries 8) Request for international cooperation made on 26.11.2015 by the Court of Matera 9) Video-conference link for the examination of ZW at the hearing of 27 October 2014 at 9.30 am and other subsequent hearings, if necessary. 10) State performance with letter accompanied.