https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The Caselaw Access Project
In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index.
The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.
Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5
.
The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.
To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.
The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise
from datasets import load_dataset from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '
oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming
to False
if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).
sample = list(itertools.islice(oale, 100000))
query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)
similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]
print(most_similar['text']) ```
To speed up the loading of the Embeddings, you may wish to install orjson
.
The Embeddings are stored in data/embeddings.jsonl
, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl
and the corresponding texts are located in data/texts.jsonl
.
The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text
field, which was removed, and with the addition of the is_last_chunk
key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).
All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5
's tokeniser) with the semchunk
Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format:
perl
Title: {title}
Jurisdiction: {jurisdiction}
Type: {type}
{text}
The chunks were then vectorised by bge-small-en-v1.5
on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers
library.
The resulting embeddings were serialised as json-encoded lists of floats by orjson
and stored in data/embeddings.jsonl
. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl
and data/texts.jsonl
, respectively.
The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
CaselawQA is a benchmark comprising legal classification tasks, drawing from the Supreme Court and Songer Court of Appeals legal databases. The majority of its 10,000 questions are multiple-choice, with 5,000 sourced from each database. The questions are randomly selected from the test sets of the Lawma tasks. From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement. From a… See the full description on the dataset page: https://huggingface.co/datasets/ricdomolm/caselawqa-8k.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Russian Financial Statements Database (RFSD)
The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:
🔓 First open data set with information on every active firm in Russia.
🗂️ First open financial statements data set that includes non-filing firms.
🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.
📅 Covers 2011-2023, will be… See the full description on the dataset page: https://huggingface.co/datasets/irlspbru/RFSD.
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Overview
This question set is created to assess the ability of LLMs to retrieve and provide exact links to specific regulations. It is for the link retrieval task at Regulations Challenge @ COLING 2025. The objective is to evaluate LLM’s effectiveness in navigating complex legal databases to find and reference the correct documents. Financial product contracts, financial reports, and compliance documents require references or citations to specific legal provisions. Quickly finding… See the full description on the dataset page: https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_Link_Retrieval.
https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/
Czech Constitutional Court Decisions Dataset
This dataset contains decisions from the Constitutional Court of the Czech Republic scraped from NALUS, the official database of Constitutional Court decisions.
Data Usage
This dataset can be utilized for:
Training language models on legal texts Creating synthetic legal datasets Building vector databases for Retrieval Augmented Generation (RAG) Legal text analysis and research NLP tasks focused on Czech legal domain… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_constitutional_court_decisions.
https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/
Czech Supreme Administrative Court Decisions Dataset
This dataset contains decisions from the Supreme Administrative Court of the Czech Republic scraped from their official search interface.
Data Usage
This dataset can be utilized for:
Training language models on administrative law texts Creating synthetic legal datasets Building vector databases for Retrieval Augmented Generation (RAG) Administrative law text analysis and research NLP tasks focused on Czech… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_supreme_administrative_court_decisions.
https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/
Czech Supreme Court Decisions Dataset
This dataset contains decisions from the Supreme Court of the Czech Republic scraped from their official collection database.
Data Usage
This dataset is ideal for:
Building legal vector databases for RAG (Retrieval Augmented Generation) Training language models on Czech civil and criminal law Creating synthetic legal datasets Legal text analysis and research NLP tasks focused on Czech judicial domain
Legal Status
The… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_Supreme_Court_Decision.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
📘 Indonesian Regulation QA Dataset (From Public Regulation Sources)
This dataset consists of automatically generated legal question-answer pairs, where questions are crafted from basic legal inquiries, and answers are mapped directly to parsed Indonesian regulation articles from alternative public legal repositories.
📌 Dataset Overview
Source: Public Indonesian regulation databases and portals
QA Generation:
Basic legal questions (template-driven or commonly asked)… See the full description on the dataset page: https://huggingface.co/datasets/Azzindani/Indonesian_Regulation_QA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
The Caselaw Access Project
In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index.