100+ datasets found

E
Corpora of legal text
live.european-language-grid.eu
data.europa.eu
xml
Updated Aug 30, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Corpora of legal text [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18873
Explore at:
xmlAvailable download formats
Dataset updated
Aug 30, 2022
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Corpora of legal texts supplied by Dr. Claudia Foti. The texts include:

1) Report of the Special Rapporteur on violence against women, its causes and consequences, Ms. Rashida Manjoo 2) General Recommendation 3) Third Evaluation Round Evaluation Report on Italy Incriminations (ETS 173 and 191, GPC 2) 4) Law No. 190 of 6 November 2012, Provisions on preventing and combating corruption and other illegal activities in the Public Administration 5) Guidelines on Justice in Matters involving Child Victims and Witnesses of Crime 6) Certificate of the State of Enforcement of a Criminal Sentence 7) Extradition request for non-EU countries 8) Request for international cooperation made on 26.11.2015 by the Court of Matera 9) Video-conference link for the examination of ZW at the hearing of 27 October 2014 at 9.30 am and other subsequent hearings, if necessary. 10) State performance with letter accompanied.
e
Cambridge Law Corpus, 1550-2023 - Dataset - B2FIND
b2find.eudat.eu
Updated Feb 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Cambridge Law Corpus, 1550-2023 - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/a4a45a66-767a-57fb-9200-6dfe91963059
Explore at:
Dataset updated
Feb 23, 2024
Description
The Cambridge Law Corpus (CLC) is a corpus designed for legal AI research. It consists of over 250,000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. Together with the corpus, annotations on case outcomes for 638 cases, done by legal experts, are provided. The Word files were cleaned and transformed into an XML format. PDF files were converted to textual form via optical character recognition (OCR). The resulting text files were then converted to the XML standard format. Because of legal and ethical considerations, the full Cambridge Law Corpus (CLC) is only available for research purposes under restrictions and available via Related Resources. A smaller dataset consisting of 15 selected cases from the CLC is available on the University of Cambridge Apollo Data Repository which can be accessed via Related Resources.The Cambridge Law Corpus is a corpus designed for legal AI research. It consists of over 250,000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases dating from the 16th century. It was funded by the research project, Legal Systems and Artificial Intelligence, which was jointly supported by the UK’s Economic and Social Research Council, part of UKRI, and the Japanese Society and Technology Agency (JST), and involved collaboration between Cambridge University (the Centre for Business Research, Department of Computer Science and Faculty of Law) and Hitotsubashi University, Tokyo (the Graduate Schools of Law and Business Administration). The original cases of the Cambridge Law Corpus were supplied by the legal technology company CourtCorrect in raw form, including Microsoft Word and PDF files.
Legal Text Classification Dataset
kaggle.com
Updated Oct 17, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
A.Mohan kumar (2023). Legal Text Classification Dataset [Dataset]. https://www.kaggle.com/datasets/amohankumar/legal-text-classification-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 17, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
A.Mohan kumar
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The dataset contains a total of 25000 legal cases in the form of text documents. Each document has been annotated with catchphrases, citations sentences, citation catchphrases, and citation classes. Citation classes indicate the type of treatment given to the cases cited by the present case.
h
Multi_Legal_Pile
huggingface.co
Updated Oct 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Niklaus (2023). Multi_Legal_Pile [Dataset]. https://huggingface.co/datasets/joelniklaus/Multi_Legal_Pile
Explore at:
Dataset updated
Oct 23, 2023
Authors
Joel Niklaus
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Multi Legal Pile is a dataset of legal documents in the 24 EU languages.
h
open-australian-legal-corpus
huggingface.co
Updated Mar 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaacus (2025). open-australian-legal-corpus [Dataset]. http://doi.org/10.57967/hf/2833
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2833
Dataset updated
Mar 10, 2025
Dataset authored and provided by
Isaacus
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Area covered
Australia
Description
Open Australian Legal Corpus ‍⚖️

The Open Australian Legal Corpus by Isaacus is the first and only multijurisdictional open corpus of Australian legislative and judicial documents. Comprised of 229,122 texts totalling over 60 million lines and 1.4 billion tokens, the Corpus includes every in force statute and regulation in the Commonwealth, New South Wales, Queensland, Western Australia, South Australia, Tasmania and Norfolk Island, in addition to thousands of bills and hundreds of… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-corpus.
h
open-australian-legal-embeddings
huggingface.co
Updated Nov 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Isaacus (2023). open-australian-legal-embeddings [Dataset]. http://doi.org/10.57967/hf/1347
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1347
Dataset updated
Nov 15, 2023
Dataset authored and provided by
Isaacus
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Area covered
Australia
Description
Open Australian Legal Embeddings ‍⚖️

The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents. Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5. The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of… See the full description on the dataset page: https://huggingface.co/datasets/isaacus/open-australian-legal-embeddings.
F
English-French Parallel Corpus for the Legal Domain
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-French Parallel Corpus for the Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/french-english-translated-parallel-corpus-for-legal-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
French
Dataset funded by
FutureBeeAI
Description
Introduction
The English-French Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.
Dataset Content
•Volume and Translator Diversity
•Sentence Count: Over 50,000 bilingual sentence pairs
•Translator Base: More than 200 native French linguists with domain familiarity contributed to the translation process
•Dataset Origin: Built from scratch with legal use cases in mind, ensuring domain relevance and application readiness
•Sentence Variety
•Length Range: Sentences contain 7 to 25 words
•Grammatical Structures: Includes simple, compound, and complex sentences
•Form Types: Covers questions, commands, affirmations, and negations
•Voice Representation: Balanced use of active and passive sentence constructions
•Cross Translation: Dataset includes both English-to-French and French-to-English segments to ensure bidirectional support
•Linguistic Features:
•Idiomatic expressions and legal jargon
•Sentence connectors and discourse markers to preserve argument structure and legal reasoning
Legal Domain Specialization
•Legal Terminology Coverage
This dataset includes terminology across a wide range of legal subdomains such as:
•Contracts, agreements, and commercial law
•Criminal and civil litigation
•Legal procedures, rulings, and statutory interpretation
•Administrative, constitutional, and regulatory terms
•Courtroom dialogue, judgments, and legal advisories
•Contextual Diversity
Sentence pairs are drawn from realistic legal content types, including:
•Legal briefs, affidavits, and memoranda
•Terms of service and data protection policies
•Research articles and legal scholarship
•Standard forms and templates
•Legislative, policy, and compliance language
•Cross-Domain Elements
To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:
•Government policy
•Business and finance
•Technology, IP, and cybersecurity law
Format and Structure
•
Available Formats: Delivered in Excel, with optional conversions to TMX, JSON, XML, XLIFF, or other localization formats

•Included Fields:
•Serial Number
•Unique ID
•Source Sentence and Word Count
•Target Sentence and Word Count
Use Cases and Applications
•
Legal Machine Translation: Build accurate translation engines for contracts, laws, and compliance documentation

•
Multilingual NLP Tools: Develop legal summarization tools, AI writing assistants, and terminology alignment engines
Document corpus of court case judgments
figshare.com
application/gzip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
rupali wagh (2023). Document corpus of court case judgments [Dataset]. http://doi.org/10.6084/m9.figshare.8063186.v2
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.8063186.v2
Dataset updated
Jun 1, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
rupali wagh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data set contains Indian Supreme Court judgments. These cases are extracted using a web crawler from the website www.judis.nic.in. Preliminary pre-processing of removal of header/ metadata information from the document is performed.
E
OpenLegalData (2022 - Corpus)
live.european-language-grid.eu
binary format
Updated Jul 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). OpenLegalData (2022 - Corpus) [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/22980
Explore at:
binary formatAvailable download formats
Dataset updated
Jul 30, 2023
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
OpenLegalData is a free and open platform that makes legal documents and information available to the public. The aim of this platform is to improve the transparency of jurisprudence with the help of open data and to help people without legal training to understand the justice system. The project is committed to the Open Data principles and the Free Access to Justice Movement.

OpenLegalData's DUMP as of 2022-10-18 was used to create this corpus. The data was cleaned, automatically annotated (TreeTagger: POS & Lemma) and grouped based on the metadata (jurisdiction - BundeslandID - sub-size if applicable - ex: Verwaltungsgerichtsbarkeit_11_05.cec6.gz - jurisdiction: administrative jurisdiction, BundeslandID = 11 - sub-corpus = 05). Sub-corpora are randomly split into 50 MB each.

Corpus data is available in CEC6 format. This can be converted into many different corpus formats - use the software www.CorpusExplorer.de if necessary.
O
Pile of Law
opendatalab.com
huggingface.co
zip
Updated Mar 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanford University (2023). Pile of Law [Dataset]. https://opendatalab.com/OpenDataLab/Pile_of_Law
Explore at:
zipAvailable download formats
Dataset updated
Mar 24, 2023
Dataset provided by
Stanford University
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
F
English-Gujarati Parallel Corpus for the Legal Domain
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English-Gujarati Parallel Corpus for the Legal Domain [Dataset]. https://www.futurebeeai.com/dataset/parallel-corpora/gujarati-english-translated-parallel-corpus-for-legal-domain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English-Gujarati Legal Parallel Corpus is a high-quality bilingual dataset designed to support the development of multilingual legal language models, machine translation systems, and text-based AI tools. With over 50,000 carefully translated sentence pairs, this dataset serves as a critical resource for anyone working on cross-lingual legal technology or NLP applications in the legal field.
Dataset Content
•Volume and Translator Diversity
•Sentence Count: Over 50,000 bilingual sentence pairs
•Translator Base: More than 200 native Gujarati linguists with domain familiarity contributed to the translation process
•Dataset Origin: Built from scratch with legal use cases in mind, ensuring domain relevance and application readiness
•Sentence Variety
•Length Range: Sentences contain 7 to 25 words
•Grammatical Structures: Includes simple, compound, and complex sentences
•Form Types: Covers questions, commands, affirmations, and negations
•Voice Representation: Balanced use of active and passive sentence constructions
•Cross Translation: Dataset includes both English-to-Gujarati and Gujarati-to-English segments to ensure bidirectional support
•Linguistic Features:
•Idiomatic expressions and legal jargon
•Sentence connectors and discourse markers to preserve argument structure and legal reasoning
Legal Domain Specialization
•Legal Terminology Coverage
This dataset includes terminology across a wide range of legal subdomains such as:
•Contracts, agreements, and commercial law
•Criminal and civil litigation
•Legal procedures, rulings, and statutory interpretation
•Administrative, constitutional, and regulatory terms
•Courtroom dialogue, judgments, and legal advisories
•Contextual Diversity
Sentence pairs are drawn from realistic legal content types, including:
•Legal briefs, affidavits, and memoranda
•Terms of service and data protection policies
•Research articles and legal scholarship
•Standard forms and templates
•Legislative, policy, and compliance language
•Cross-Domain Elements
To reflect the multidisciplinary nature of legal texts, the dataset also includes content that touches on:
•Government policy
•Business and finance
•Technology, IP, and cybersecurity law
Format and Structure
•
Available Formats: Delivered in Excel, with optional conversions to TMX, JSON, XML, XLIFF, or other localization formats

•Included Fields:
•Serial Number
•Unique ID
•Source Sentence and Word Count
•Target Sentence and Word Count
Use Cases and Applications
•
Legal Machine Translation: Build accurate translation engines for contracts, laws, and compliance documentation

•
Multilingual NLP Tools: Develop legal summarization tools, AI writing assistants, and terminology alignment engines
e
EU Law Labour Market Corpus - Dataset - B2FIND
b2find.eudat.eu
Updated May 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). EU Law Labour Market Corpus - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/be849415-bb5d-5629-aac0-528a80d29233
Explore at:
Dataset updated
May 18, 2024
Area covered
European Union
Description
The datasets consist of a corpus of legal sources in the EU legal system mentioning the expression “labour market” (1968 – March 2023). Data were collected from the Eur-Lex website and then coded by hand according to pre-established definitions of “labour market”. The database contains information on the year, subject matter, and institutional author of the legal text, as well as hyperlinks to the original documents.
P
The legal consultation data and corpus of the thesis from China law...
opendata.pku.edu.cn
Updated Jun 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peking University Open Research Data Platform (2018). The legal consultation data and corpus of the thesis from China law network.Replication Data for: Design and research of legal consultation text classification system. [Dataset]. http://doi.org/10.18170/DVN/OLO4G8
Explore at:
text/plain; charset=utf-8(324707151), text/plain; charset=utf-8(145265567), text/plain; charset=utf-8(262774345), zip(1803167), text/plain; charset=utf-8(163247882)Available download formats
Unique identifier
https://doi.org/10.18170/DVN/OLO4G8
Dataset updated
Jun 7, 2018
Dataset provided by
Peking University Open Research Data Platform
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
China
Description
Data source: the whole question list of the "legal consultation" section of China law network and all the answers to the questions answered; Data collection time: December 2017 to February 2018. Data collection: Python script is written to automatically acquire text crawler and comply with the webpage robots protocol; Processing method: store MongoDB database and export; Data format: CSV and json; Data description: Question_1, question_2 contains around 1.2 million data volume, for China law web plate "legal advice" question list information, including consultants belong to, the content of counseling problems, consulting problem belongs to field, consulting; Answer_1, answer_2 contains around 2.1 million data, for the China law of "legal advice" plate has to solve network problems list all lawyers answer content, including lawyers answer content, lawyers answer, lawyers answer time for details. Instructions: mainly extracted the question text and sorted into corpus.
Spanish Legal Domain Word & Sub-Word Embeddings
zenodo.org
data.niaid.nih.gov
bin, txt
Updated Nov 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño (2022). Spanish Legal Domain Word & Sub-Word Embeddings [Dataset]. http://doi.org/10.5281/zenodo.5036147
Explore at:
bin, txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5036147
Dataset updated
Nov 4, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Asier Gutiérrez-Fandiño; Asier Gutiérrez-Fandiño
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Spanish Legal Word and Sub-word Embeddings in FastText

These embeddings have been generated from the largest corpus (9GB) ever made from Spanish Legal resources till the date.

More legal domain resources: https://github.com/PlanTL-GOB-ES/lm-legal-es

Citation

@misc{gutierrezfandino2021legal, title={Spanish Legalese Language Model and Corpora}, author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Aitor Gonzalez-Agirre and Marta Villegas}, year={2021}, eprint={2110.12201}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Copyright

Copyright (c) 2021 Secretaría de Estado de Digitalización e Inteligencia Artificial
D
English and Polish judicial Eurolects: grammar patterns
danebadawcze.uw.edu.pl
zip
Updated Jul 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koźbiał, Dariusz (2025). English and Polish judicial Eurolects: grammar patterns [Dataset]. http://doi.org/10.58132/GRQRLB
Explore at:
zip(10307175)Available download formats
Unique identifier
https://doi.org/10.58132/GRQRLB
Dataset updated
Jul 3, 2025
Dataset provided by
Dane Badawcze UW
Authors
Koźbiał, Dariusz
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The corpus has been compiled to examine the phraseological profiles of the English and Polish judicial Eurolects, using grammar patterns from a comparative, corpus- and genre-based perspective. The English and Polish judicial Eurolects are exemplified by two interconnected genres: Advocate Generals’ (AGs’) opinions and judgments issued by the Court of Justice of the European Union (CJEU).The focus corpus consists of a bilingual (English and Polish) genre-based corpus made up of four subcorpora, each made up of 55 texts:(1) English language versions of AGs’ opinions (807 547 tokens),(2) English language versions of Court of Justice (CJ) judgments issued following an AG’s opinion (756 829 tokens),(3) Polish language versions of AGs’ opinions (703 469 tokens),(4) Polish language versions of CJ judgments issued following an AG’s opinion (641 188 tokens).To limit the scope of the study and to increase its feasibility, the decision was made to include in the corpus only those AGs’ opinions and judgments which were given in cases concerning actions for annulment of an EU act.There are two reference corpora:(1) a sample of the British Law Report Corpus (BLaRC) (cf. Pérez and Rizzo 2012) which consists of 50 judgments handed out by the UK Supreme Court (UKSC) between 2008 and 2010 (735 338 tokens), serves as a benchmark of non-translated English judicial language against which translated EU judgments are compared, and(2) a corpus of 56 judgments issued by the Constitutional Tribunal of the Republic of Poland (1 089 817 tokens).All documents in the focus corpus were issued within the time frame of 2020 and 2022 (calendar years), with the only exception of UKSC judgments which were included in a premade corpus available on Sketch Engine (Kilgarriff et al. 2014) and handed out between 2008 and 2010.
f
Indian Court Decision Annotated Corpus.xlsx
figshare.com
xlsx
Updated Aug 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya (2022). Indian Court Decision Annotated Corpus.xlsx [Dataset]. http://doi.org/10.6084/m9.figshare.19719088.v4
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.19719088.v4
Dataset updated
Aug 22, 2022
Dataset provided by
figshare
Authors
Pooja Harde; Pariskhit Kamat; Suraj Suresh; Shubham Kalson; Sarika Jain; Nandana Mihindukulasooriya
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
India
Description
Dataset contains 50 Supreme Court of India Court Decisions annotated for Named Entity Recognition in the case documents with three different encoding schemes viz., IOB, IOBES, BILOU. The dataset is created using the CoNLL-2003 format.
A
Corpus of Law, Academic, and News
abacus.library.ubc.ca
iso, txt
Updated Mar 18, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abacus Data Network (2022). Corpus of Law, Academic, and News [Dataset]. https://abacus.library.ubc.ca/dataset.xhtml?persistentId=hdl:11272.1/AB2/VMWYC0
Explore at:
txt(1308), iso(4886528)Available download formats
Dataset updated
Mar 18, 2022
Dataset provided by
Abacus Data Network
Description
AbstractIntroduction Corpus of Law, Academic, and News consists of 400 Persian documents divided into three genres: legal, academic, and news. The legal section contains texts from official publications, including the civil penal code, the criminal penal code, and the constitution of the Islamic Republic of Iran. The academic sub-corpus is comprised of published academic abstracts in various disciplinary areas, such as Art and Humanities, Social Sciences, and Natural Sciences. The news sub-corpus was extracted from an archive of ten Iranian news outlets spanning the period 2010- 2020. Data The document and token counts are as follows: 48 legal documents, 88,170 tokens; 274 academic documents, 85,765 tokens; and 78 news documents, 101,055 tokens. Each document contains metadata in the file's header with information such as specific text type, dates and source, and also contains annotations marking title and body paragraphs. All documents are presented as UTF-8 encoded XML with internal DTDs.
h
Pretrain-1-Legal-Corpus
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syed hasan, Pretrain-1-Legal-Corpus [Dataset]. https://huggingface.co/datasets/Syed-Hasan-8503/Pretrain-1-Legal-Corpus
Explore at:
Authors
Syed hasan
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Syed-Hasan-8503/Pretrain-1-Legal-Corpus dataset hosted on Hugging Face and contributed by the HF Datasets community
o
Classification of Culture and Process in U.S. Supreme Court Language,...
explore.openaire.eu
data.niaid.nih.gov
Updated Nov 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David S. Tanenhaus; Eric C. Nystrom (2020). Classification of Culture and Process in U.S. Supreme Court Language, 1793-2011 [Dataset]. http://doi.org/10.5281/zenodo.4279858
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4279858
Dataset updated
Nov 18, 2020
Authors
David S. Tanenhaus; Eric C. Nystrom
Area covered
United States
Description
In an effort to distantly read U.S. Supreme Court opinions, the authors extracted and classified the noun or noun phrase following every usage of the possessive pronoun "our" in every court opinion from 1793 through the end of the 2011 term. All terms were classified by legal historian Professor David S. Tanenhaus, generally with reference to the original opinion text for context, and were re-checked by Tanenhaus after a period of several months. Thus the classifications represent his qualitative, expert opinion. Each term was classified as representing a "process" oriented usage (e.g. the judicial process), a "culture" or heritage usage, or, in some cases, either "ambiguous" or "unclassified" uses. "Ambiguous" uses are those where classification of the individual uses of each term revealed that no consensus emerged about the generalized meaning of the term (in other words, some were process, and some were culture). For more details about the classification process, please see the data paper. The data is contained in a TSV file containing three fields, with no header. The first field is the number of times this particular follower-term appeared in the entire corpus. The second field is the follower term itself (lowercased), which may be a phrase. The third field is the classification of that term. As of the version 11 update, the file contains 9527 unique follower-terms, representing the terms following "our" in 79,693 uses across the corpus of U.S. Supreme Court opinions from 1793-2011.
E
French Monolingual legal corpus from Official Journal of France
live.european-language-grid.eu
data.europa.eu
txt
Updated Aug 31, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). French Monolingual legal corpus from Official Journal of France [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/19397
Explore at:
txtAvailable download formats
Dataset updated
Aug 31, 2022
License
Licence Ouverte / Open Licence 2.0https://www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
License information was derived automatically
Area covered
France, French
Description
French Monolingual legal corpus from Official Journal of France as collected from https://www.legifrance.gouv.fr/ web site

Facebook

Twitter

Click to copy link

Link copied

Cite

(2022). Corpora of legal text [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18873

Corpora of legal text

Explore at:

92 scholarly articles cite this dataset (View in Google Scholar)

xmlAvailable download formats

Dataset updated

Aug 30, 2022

License

U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically

Description

Corpora of legal texts supplied by Dr. Claudia Foti. The texts include:

1) Report of the Special Rapporteur on violence against women, its causes and consequences, Ms. Rashida Manjoo 2) General Recommendation 3) Third Evaluation Round Evaluation Report on Italy Incriminations (ETS 173 and 191, GPC 2) 4) Law No. 190 of 6 November 2012, Provisions on preventing and combating corruption and other illegal activities in the Public Administration 5) Guidelines on Justice in Matters involving Child Victims and Witnesses of Crime 6) Certificate of the State of Enforcement of a Criminal Sentence 7) Extradition request for non-EU countries 8) Request for international cooperation made on 26.11.2015 by the Court of Matera 9) Video-conference link for the examination of ZW at the hearing of 27 October 2014 at 9.30 am and other subsequent hearings, if necessary. 10) State performance with letter accompanied.

Clear search

Close search

Google apps

Main menu

Corpora of legal text

Cambridge Law Corpus, 1550-2023 - Dataset - B2FIND

Legal Text Classification Dataset

Multi_Legal_Pile

open-australian-legal-corpus

open-australian-legal-embeddings

English-French Parallel Corpus for the Legal Domain

Introduction

Dataset Content

Legal Domain Specialization

Format and Structure

Use Cases and Applications

Document corpus of court case judgments

OpenLegalData (2022 - Corpus)

Pile of Law

English-Gujarati Parallel Corpus for the Legal Domain

Introduction

Dataset Content

Legal Domain Specialization

Format and Structure

Use Cases and Applications

EU Law Labour Market Corpus - Dataset - B2FIND

The legal consultation data and corpus of the thesis from China law...

Spanish Legal Domain Word & Sub-Word Embeddings

English and Polish judicial Eurolects: grammar patterns

Indian Court Decision Annotated Corpus.xlsx

Corpus of Law, Academic, and News

Pretrain-1-Legal-Corpus

Classification of Culture and Process in U.S. Supreme Court Language,...

French Monolingual legal corpus from Official Journal of France

Corpora of legal textSee More Versions

Corpora of legal text