9 datasets found

h
Caselaw_Access_Project_FAISS_index
huggingface.co
Updated Mar 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FreeLaw (2024). Caselaw_Access_Project_FAISS_index [Dataset]. https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2024
Dataset authored and provided by
FreeLaw
License
https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Description
The Caselaw Access Project

In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index.
Open Australian Legal Embeddings
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Umar Butler (2023). Open Australian Legal Embeddings [Dataset]. https://www.kaggle.com/datasets/umarbutler/open-australian-legal-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kaggle
Authors
Umar Butler
Area covered
Australia
Description
Open Australian Legal Embeddings ‍⚖️

The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.

Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.

The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.

To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.

Usage 👩‍💻

The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise

from datasets import load_dataset from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '

oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).

Sample the first 100,000 embeddings.

sample = list(itertools.islice(oale, 100000))

Embed a query.

query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)

Identify the most similar embedding to the query.

similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]

Print the most similar text.

print(most_similar['text']) ```

To speed up the loading of the Embeddings, you may wish to install orjson.

Structure 🗂️

The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.

The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).

Creation 🧪

All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format: perl Title: {title} Jurisdiction: {jurisdiction} Type: {type} {text}

The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.

The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.

The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...
h
caselawqa-8k
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ricardo, caselawqa-8k [Dataset]. https://huggingface.co/datasets/ricdomolm/caselawqa-8k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Ricardo
Description
CaselawQA is a benchmark comprising legal classification tasks, drawing from the Supreme Court and Songer Court of Appeals legal databases. The majority of its 10,000 questions are multiple-choice, with 5,000 sourced from each database. The questions are randomly selected from the test sets of the Lawma tasks. From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement. From a… See the full description on the dataset page: https://huggingface.co/datasets/ricdomolm/caselawqa-8k.
h
RFSD
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Institute for the Rule of Law at European University at St. Petersburg, RFSD [Dataset]. http://doi.org/10.57967/hf/4574
Explore at:
Unique identifier
https://doi.org/10.57967/hf/4574
Dataset authored and provided by
Institute for the Rule of Law at European University at St. Petersburg
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Russian Financial Statements Database (RFSD)

The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

🔓 First open data set with information on every active firm in Russia.

🗂️ First open financial statements data set that includes non-filing firms.

🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

📅 Covers 2011-2023, will be… See the full description on the dataset page: https://huggingface.co/datasets/irlspbru/RFSD.
h
Regulations_Link_Retrieval
huggingface.co
Updated Jul 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SecureFinAI Lab (2025). Regulations_Link_Retrieval [Dataset]. https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_Link_Retrieval
Explore at:
Dataset updated
Jul 12, 2025
Dataset authored and provided by
SecureFinAI Lab
License
https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/
Description
Overview

This question set is created to assess the ability of LLMs to retrieve and provide exact links to specific regulations. It is for the link retrieval task at Regulations Challenge @ COLING 2025. The objective is to evaluate LLM’s effectiveness in navigating complex legal databases to find and reference the correct documents. Financial product contracts, financial reports, and compliance documents require references or citations to specific legal provisions. Quickly finding… See the full description on the dataset page: https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_Link_Retrieval.
h
CZE_constitutional_court_decisions
huggingface.co
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Röslein (2025). CZE_constitutional_court_decisions [Dataset]. https://huggingface.co/datasets/roslein/CZE_constitutional_court_decisions
Explore at:
Dataset updated
Apr 22, 2025
Authors
Jan Röslein
License
https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/
Description
Czech Constitutional Court Decisions Dataset

This dataset contains decisions from the Constitutional Court of the Czech Republic scraped from NALUS, the official database of Constitutional Court decisions.

Data Usage

This dataset can be utilized for:

Training language models on legal texts Creating synthetic legal datasets Building vector databases for Retrieval Augmented Generation (RAG) Legal text analysis and research NLP tasks focused on Czech legal domain… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_constitutional_court_decisions.
h
CZE_supreme_administrative_court_decisions
huggingface.co
Updated Feb 1, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Röslein (2017). CZE_supreme_administrative_court_decisions [Dataset]. https://huggingface.co/datasets/roslein/CZE_supreme_administrative_court_decisions
Explore at:
Dataset updated
Feb 1, 2017
Authors
Jan Röslein
License
https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/
Description
Czech Supreme Administrative Court Decisions Dataset

This dataset contains decisions from the Supreme Administrative Court of the Czech Republic scraped from their official search interface.

Data Usage

This dataset can be utilized for:

Training language models on administrative law texts Creating synthetic legal datasets Building vector databases for Retrieval Augmented Generation (RAG) Administrative law text analysis and research NLP tasks focused on Czech… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_supreme_administrative_court_decisions.
h
CZE_Supreme_Court_Decision
huggingface.co
Updated Dec 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jan Röslein (2024). CZE_Supreme_Court_Decision [Dataset]. https://huggingface.co/datasets/roslein/CZE_Supreme_Court_Decision
Explore at:
Dataset updated
Dec 31, 2024
Authors
Jan Röslein
License
https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/
Description
Czech Supreme Court Decisions Dataset

This dataset contains decisions from the Supreme Court of the Czech Republic scraped from their official collection database.

Data Usage

This dataset is ideal for:

Building legal vector databases for RAG (Retrieval Augmented Generation) Training language models on Czech civil and criminal law Creating synthetic legal datasets Legal text analysis and research NLP tasks focused on Czech judicial domain

Legal Status

The… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_Supreme_Court_Decision.
h
Indonesian_Regulation_QA
huggingface.co
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Azzindani (2025). Indonesian_Regulation_QA [Dataset]. https://huggingface.co/datasets/Azzindani/Indonesian_Regulation_QA
Explore at:
Dataset updated
Jun 2, 2025
Authors
Azzindani
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Area covered
Indonesia
Description
📘 Indonesian Regulation QA Dataset (From Public Regulation Sources)

This dataset consists of automatically generated legal question-answer pairs, where questions are crafted from basic legal inquiries, and answers are mapped directly to parsed Indonesian regulation articles from alternative public legal repositories.

📌 Dataset Overview

Source: Public Indonesian regulation databases and portals

QA Generation:

Basic legal questions (template-driven or commonly asked)… See the full description on the dataset page: https://huggingface.co/datasets/Azzindani/Indonesian_Regulation_QA.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

FreeLaw (2024). Caselaw_Access_Project_FAISS_index [Dataset]. https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index

Caselaw_Access_Project_FAISS_index

Caselaw Access Project

free-law/Caselaw_Access_Project_FAISS_index

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 2, 2024

Dataset authored and provided by

FreeLaw

License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

The Caselaw Access Project

In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index.

Clear search

Close search

Google apps

Main menu

Caselaw_Access_Project_FAISS_index

Open Australian Legal Embeddings

Open Australian Legal Embeddings ‍⚖️

Usage 👩‍💻

Sample the first 100,000 embeddings.

Embed a query.

Identify the most similar embedding to the query.

Print the most similar text.

Structure 🗂️

Creation 🧪

caselawqa-8k

RFSD

Regulations_Link_Retrieval

CZE_constitutional_court_decisions

CZE_supreme_administrative_court_decisions

CZE_Supreme_Court_Decision

Indonesian_Regulation_QA

Caselaw_Access_Project_FAISS_indexSee More Versions

Caselaw Access Project

free-law/Caselaw_Access_Project_FAISS_index

Caselaw_Access_Project_FAISS_index