Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/pile-of-law/pile-of-law Lilac dataset config: - {embedding: gte-small, path: text} name: pile-of-law-r-legaladvice namespace: lilac settings: preferred_embedding: gte-small ui: media_paths: [text] signals: - path: text signal: {signal_name: near_dup} - path: text signal: {signal_name: text_statistics} - path: text signal: {signal_name:… See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-pile-of-law-r-legaladvice.
Dataset Card for Law Stack Exchange Dataset
Dataset Summary
Dataset from the Law Stack Exchange, as used in "Parameter-Efficient Legal Domain Adaptation".
Citation Information
@inproceedings{li-etal-2022-parameter, title = "Parameter-Efficient Legal Domain Adaptation", author = "Li, Jonathan and Bhambhoria, Rohan and Zhu, Xiaodan", booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2022", month = dec… See the full description on the dataset page: https://huggingface.co/datasets/jonathanli/law-stack-exchange.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Multi Legal Pile is a dataset of legal documents in the 24 EU languages.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LRAGE: Legal Retrieval Augmented Generation Evaluation Tool
LRAGE (Legal Retrieval Augmented Generation Evaluation) is an open-source toolkit designed to evaluate Large Language Models (LLMs) in a Retrieval-Augmented Generation (RAG) setting, specifically tailored for the legal domain.
This repository facilitates evaluating LLM performance on legal tasks without cumbersome engineering overhead.
Code: https://github.com/hoorangyee/LRAGE
For more details, please refer to the LRAGE… See the full description on the dataset page: https://huggingface.co/datasets/hoorangyee/pile-of-law-bm25.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
LRAGE: Legal Retrieval Augmented Generation Evaluation Tool
LRAGE (Legal Retrieval Augmented Generation Evaluation, pronounced as 'large') is an open-source toolkit designed to evaluate Large Language Models (LLMs) in a Retrieval-Augmented Generation (RAG) setting, specifically tailored for the legal domain. This repository contains pointers to datasets and code used in LRAGE: Legal Retrieval Augmented Generation Evaluation. Code: https://github.com/hoorangyee/LRAGE… See the full description on the dataset page: https://huggingface.co/datasets/hoorangyee/pile-of-law-chunked.
Description
Regulations.gov is an online platform operated by the U.S. General Services Administration that collates newly proposed rules and regulations from federal agencies along with comments and feedback from the general public. This dataset includes all plain-text regulatory documents published by a variety of U.S. federal agencies on this platform, acquired via the bulk download interface provided by Regulations.gov. These agencies include the Bureau of Industry and… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/regulations.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PubText
Welcome to the Open License Corpus (OLC), a 228B token corpus for training permissively-licensed language models. Disclaimer: OLC should not be considered a universally safe-to-use dataset. We encourage users of OLC to consult a legal professional on the suitability of each data source for their application.
Dataset Summary
Domain Sources Specific License
Legal Case Law, Pile of Law (PD subset) Public… See the full description on the dataset page: https://huggingface.co/datasets/kernelmachine/open-license-corpus.
pile-of-law/pile-of-law: 법률 및 행정 문서를 대규모로 수집·정리한 코퍼스 파일크기 및 소스 : 약 256G 오픈 소스 영어권 법률 및 행정자료 포함 판결문, 계약서, 행정규칙, 법령, 시험용 아웃라인 등. 35개의 서브셋 중 크고 주요하며, qa와 관련성 있는 8개 서브셋 선택해서 사용 (약 120G)
"courtlistener_opinions": U.S. court opinions from CourtListener (synchronized as of 12/31/2022). "cc_casebooks": Educational Casebooks released under open CC licenses. "exam_outlines": Bar Exam outlines available openly on the web. "uscode": The United States Code (laws). "cfr": U.S. Code of Federal Regulations… See the full description on the dataset page: https://huggingface.co/datasets/SKIML-ICL/pile-of-law.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.