9 datasets found
  1. h

    Caselaw_Access_Project_FAISS_index

    • huggingface.co
    Updated Mar 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FreeLaw (2024). Caselaw_Access_Project_FAISS_index [Dataset]. https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2024
    Dataset authored and provided by
    FreeLaw
    License

    https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

    Description

    The Caselaw Access Project

    In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index.

  2. Open Australian Legal Embeddings

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umar Butler (2023). Open Australian Legal Embeddings [Dataset]. https://www.kaggle.com/datasets/umarbutler/open-australian-legal-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kaggle
    Authors
    Umar Butler
    Area covered
    Australia
    Description

    Open Australian Legal Embeddings ‍⚖️

    The Open Australian Legal Embeddings are the first open-source embeddings of Australian legislative and judicial documents.

    Trained on the largest open database of Australian law, the Open Australian Legal Corpus, the Embeddings consist of roughly 5.2 million 384-dimensional vectors embedded with BAAI/bge-small-en-v1.5.

    The Embeddings open the door to a wide range of possibilities in the field of Australian legal AI, including the development of document classifiers, search engines and chatbots.

    To ensure their accessibility to as wide an audience as possible, the Embeddings are distributed under the same licence as the Open Australian Legal Corpus.

    Usage 👩‍💻

    The below code snippet illustrates how the Embeddings may be loaded and queried via the Hugging Face Datasets Python library: ```python import itertools import sklearn.metrics.pairwise

    from datasets import load_dataset from sentence_transformers import SentenceTransformer

    model = SentenceTransformer('BAAI/bge-small-en-v1.5') instruction = 'Represent this sentence for searching relevant passages: '

    oale = load_dataset('umarbutler/open_australian_legal_embeddings', split='train', streaming=True) # Set streaming to False if you wish to load the entire dataset into memory (unadvised unless you have at least 64 GB of RAM).

    Sample the first 100,000 embeddings.

    sample = list(itertools.islice(oale, 100000))

    Embed a query.

    query = model.encode(instruction + 'Who is the Governor-General of Australia?', normalize_embeddings=True)

    Identify the most similar embedding to the query.

    similarities = sklearn.metrics.pairwise.cosine_similarity([query], [embedding['embedding'] for embedding in sample]) most_similar_index = similarities.argmax() most_similar = sample[most_similar_index]

    Print the most similar text.

    print(most_similar['text']) ```

    To speed up the loading of the Embeddings, you may wish to install orjson.

    Structure 🗂️

    The Embeddings are stored in data/embeddings.jsonl, a json lines file where each line is a list of 384 32-bit floating point numbers. Associated metadata is stored in data/metadatas.jsonl and the corresponding texts are located in data/texts.jsonl.

    The metadata fields are the same as those used for the Open Australian Legal Corpus, barring the text field, which was removed, and with the addition of the is_last_chunk key, which is a boolean flag for whether a text is the last chunk of a document (used to detect and remove corrupted documents when creating and updating the Embeddings).

    Creation 🧪

    All documents in the Open Australian Legal Corpus were split into semantically meaningful chunks up to 512-tokens-long (as determined by bge-small-en-v1.5's tokeniser) with the semchunk Python library. These chunks included a header embedding documents' titles, jurisdictions and types in the following format: perl Title: {title} Jurisdiction: {jurisdiction} Type: {type} {text}

    The chunks were then vectorised by bge-small-en-v1.5 on a single GeForce RTX 2080 Ti with a batch size of 32 via the SentenceTransformers library.

    The resulting embeddings were serialised as json-encoded lists of floats by orjson and stored in data/embeddings.jsonl. The corresponding metadata and texts (with their headers removed) were saved to data/metadatas.jsonl and data/texts.jsonl, respectively.

    The code used to create and update the Embeddings may be found [here](https://github.com/umarbutler/open-australian-legal-embeddings-...

  3. h

    caselawqa-8k

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ricardo, caselawqa-8k [Dataset]. https://huggingface.co/datasets/ricdomolm/caselawqa-8k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Ricardo
    Description

    CaselawQA is a benchmark comprising legal classification tasks, drawing from the Supreme Court and Songer Court of Appeals legal databases. The majority of its 10,000 questions are multiple-choice, with 5,000 sourced from each database. The questions are randomly selected from the test sets of the Lawma tasks. From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement. From a… See the full description on the dataset page: https://huggingface.co/datasets/ricdomolm/caselawqa-8k.

  4. h

    RFSD

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Institute for the Rule of Law at European University at St. Petersburg, RFSD [Dataset]. http://doi.org/10.57967/hf/4574
    Explore at:
    Dataset authored and provided by
    Institute for the Rule of Law at European University at St. Petersburg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Russian Financial Statements Database (RFSD)

    The Russian Financial Statements Database (RFSD) is an open, harmonized collection of annual unconsolidated financial statements of the universe of Russian firms:

    🔓 First open data set with information on every active firm in Russia.

    🗂️ First open financial statements data set that includes non-filing firms.

    🏛️ Sourced from two official data providers: the Rosstat and the Federal Tax Service.

    📅 Covers 2011-2023, will be… See the full description on the dataset page: https://huggingface.co/datasets/irlspbru/RFSD.

  5. h

    Regulations_Link_Retrieval

    • huggingface.co
    Updated Jul 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SecureFinAI Lab (2025). Regulations_Link_Retrieval [Dataset]. https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_Link_Retrieval
    Explore at:
    Dataset updated
    Jul 12, 2025
    Dataset authored and provided by
    SecureFinAI Lab
    License

    https://choosealicense.com/licenses/cdla-permissive-2.0/https://choosealicense.com/licenses/cdla-permissive-2.0/

    Description

    Overview

    This question set is created to assess the ability of LLMs to retrieve and provide exact links to specific regulations. It is for the link retrieval task at Regulations Challenge @ COLING 2025. The objective is to evaluate LLM’s effectiveness in navigating complex legal databases to find and reference the correct documents. Financial product contracts, financial reports, and compliance documents require references or citations to specific legal provisions. Quickly finding… See the full description on the dataset page: https://huggingface.co/datasets/SecureFinAI-Lab/Regulations_Link_Retrieval.

  6. h

    CZE_constitutional_court_decisions

    • huggingface.co
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Röslein (2025). CZE_constitutional_court_decisions [Dataset]. https://huggingface.co/datasets/roslein/CZE_constitutional_court_decisions
    Explore at:
    Dataset updated
    Apr 22, 2025
    Authors
    Jan Röslein
    License

    https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/

    Description

    Czech Constitutional Court Decisions Dataset

    This dataset contains decisions from the Constitutional Court of the Czech Republic scraped from NALUS, the official database of Constitutional Court decisions.

      Data Usage
    

    This dataset can be utilized for:

    Training language models on legal texts Creating synthetic legal datasets Building vector databases for Retrieval Augmented Generation (RAG) Legal text analysis and research NLP tasks focused on Czech legal domain… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_constitutional_court_decisions.

  7. h

    CZE_supreme_administrative_court_decisions

    • huggingface.co
    Updated Feb 1, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Röslein (2017). CZE_supreme_administrative_court_decisions [Dataset]. https://huggingface.co/datasets/roslein/CZE_supreme_administrative_court_decisions
    Explore at:
    Dataset updated
    Feb 1, 2017
    Authors
    Jan Röslein
    License

    https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/

    Description

    Czech Supreme Administrative Court Decisions Dataset

    This dataset contains decisions from the Supreme Administrative Court of the Czech Republic scraped from their official search interface.

      Data Usage
    

    This dataset can be utilized for:

    Training language models on administrative law texts Creating synthetic legal datasets Building vector databases for Retrieval Augmented Generation (RAG) Administrative law text analysis and research NLP tasks focused on Czech… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_supreme_administrative_court_decisions.

  8. h

    CZE_Supreme_Court_Decision

    • huggingface.co
    Updated Dec 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Röslein (2024). CZE_Supreme_Court_Decision [Dataset]. https://huggingface.co/datasets/roslein/CZE_Supreme_Court_Decision
    Explore at:
    Dataset updated
    Dec 31, 2024
    Authors
    Jan Röslein
    License

    https://choosealicense.com/licenses/eupl-1.1/https://choosealicense.com/licenses/eupl-1.1/

    Description

    Czech Supreme Court Decisions Dataset

    This dataset contains decisions from the Supreme Court of the Czech Republic scraped from their official collection database.

      Data Usage
    

    This dataset is ideal for:

    Building legal vector databases for RAG (Retrieval Augmented Generation) Training language models on Czech civil and criminal law Creating synthetic legal datasets Legal text analysis and research NLP tasks focused on Czech judicial domain

      Legal Status
    

    The… See the full description on the dataset page: https://huggingface.co/datasets/roslein/CZE_Supreme_Court_Decision.

  9. h

    Indonesian_Regulation_QA

    • huggingface.co
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Azzindani (2025). Indonesian_Regulation_QA [Dataset]. https://huggingface.co/datasets/Azzindani/Indonesian_Regulation_QA
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Azzindani
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Indonesia
    Description

    📘 Indonesian Regulation QA Dataset (From Public Regulation Sources)

    This dataset consists of automatically generated legal question-answer pairs, where questions are crafted from basic legal inquiries, and answers are mapped directly to parsed Indonesian regulation articles from alternative public legal repositories.

      📌 Dataset Overview
    

    Source: Public Indonesian regulation databases and portals

    QA Generation:

    Basic legal questions (template-driven or commonly asked)… See the full description on the dataset page: https://huggingface.co/datasets/Azzindani/Indonesian_Regulation_QA.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
FreeLaw (2024). Caselaw_Access_Project_FAISS_index [Dataset]. https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index

Caselaw_Access_Project_FAISS_index

Caselaw Access Project

free-law/Caselaw_Access_Project_FAISS_index

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2024
Dataset authored and provided by
FreeLaw
License

https://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/

Description

The Caselaw Access Project

In collaboration with Ravel Law, Harvard Law Library digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. Access a bulk download of the data through the Caselaw Access Project API (CAPAPI): https://case.law/caselaw/ Find more information about accessing state and federal written court decisions of common law through the bulk data service documentation here:… See the full description on the dataset page: https://huggingface.co/datasets/free-law/Caselaw_Access_Project_FAISS_index.

Search
Clear search
Close search
Google apps
Main menu