9 datasets found

h
pile-of-law
huggingface.co
opendatalab.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
Explore at:
Dataset updated
Jul 10, 2022
Dataset authored and provided by
Pile of Law
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
h
lilac-pile-of-law-r-legaladvice
huggingface.co
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lilac AI (2023). lilac-pile-of-law-r-legaladvice [Dataset]. https://huggingface.co/datasets/lilacai/lilac-pile-of-law-r-legaladvice
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2023
Dataset authored and provided by
Lilac AI
Description
This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/pile-of-law/pile-of-law Lilac dataset config: - {embedding: gte-small, path: text} name: pile-of-law-r-legaladvice namespace: lilac settings: preferred_embedding: gte-small ui: media_paths: [text] signals: - path: text signal: {signal_name: near_dup} - path: text signal: {signal_name: text_statistics} - path: text signal: {signal_name:… See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-pile-of-law-r-legaladvice.
h
law-stack-exchange
huggingface.co
Updated Mar 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jonathan Li (2022). law-stack-exchange [Dataset]. https://huggingface.co/datasets/jonathanli/law-stack-exchange
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2022
Authors
Jonathan Li
Description
Dataset Card for Law Stack Exchange Dataset

Dataset Summary

Dataset from the Law Stack Exchange, as used in "Parameter-Efficient Legal Domain Adaptation".

Citation Information

@inproceedings{li-etal-2022-parameter, title = "Parameter-Efficient Legal Domain Adaptation", author = "Li, Jonathan and Bhambhoria, Rohan and Zhu, Xiaodan", booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2022", month = dec… See the full description on the dataset page: https://huggingface.co/datasets/jonathanli/law-stack-exchange.
h
Multi_Legal_Pile
huggingface.co
Updated Oct 23, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Niklaus (2023). Multi_Legal_Pile [Dataset]. https://huggingface.co/datasets/joelniklaus/Multi_Legal_Pile
Explore at:
Dataset updated
Oct 23, 2023
Authors
Joel Niklaus
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Multi Legal Pile is a dataset of legal documents in the 24 EU languages.
h
pile-of-law-bm25
huggingface.co
Updated Mar 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minhu, Park (2025). pile-of-law-bm25 [Dataset]. https://huggingface.co/datasets/hoorangyee/pile-of-law-bm25
Explore at:
Dataset updated
Mar 11, 2025
Authors
Minhu, Park
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

LRAGE (Legal Retrieval Augmented Generation Evaluation) is an open-source toolkit designed to evaluate Large Language Models (LLMs) in a Retrieval-Augmented Generation (RAG) setting, specifically tailored for the legal domain.
This repository facilitates evaluating LLM performance on legal tasks without cumbersome engineering overhead. Code: https://github.com/hoorangyee/LRAGE For more details, please refer to the LRAGE… See the full description on the dataset page: https://huggingface.co/datasets/hoorangyee/pile-of-law-bm25.
h
pile-of-law-chunked
huggingface.co
Updated Oct 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Minhu, Park (2021). pile-of-law-chunked [Dataset]. https://huggingface.co/datasets/hoorangyee/pile-of-law-chunked
Explore at:
Dataset updated
Oct 24, 2021
Authors
Minhu, Park
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

LRAGE (Legal Retrieval Augmented Generation Evaluation, pronounced as 'large') is an open-source toolkit designed to evaluate Large Language Models (LLMs) in a Retrieval-Augmented Generation (RAG) setting, specifically tailored for the legal domain. This repository contains pointers to datasets and code used in LRAGE: Legal Retrieval Augmented Generation Evaluation. Code: https://github.com/hoorangyee/LRAGE… See the full description on the dataset page: https://huggingface.co/datasets/hoorangyee/pile-of-law-chunked.
h
regulations
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Pile, regulations [Dataset]. https://huggingface.co/datasets/common-pile/regulations
Explore at:
Dataset authored and provided by
Common Pile
Description
Description

Regulations.gov is an online platform operated by the U.S. General Services Administration that collates newly proposed rules and regulations from federal agencies along with comments and feedback from the general public. This dataset includes all plain-text regulatory documents published by a variety of U.S. federal agencies on this platform, acquired via the bulk download interface provided by Regulations.gov. These agencies include the Bureau of Industry and… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/regulations.
h
open-license-corpus
huggingface.co
Updated Oct 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suchin (2023). open-license-corpus [Dataset]. https://huggingface.co/datasets/kernelmachine/open-license-corpus
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2023
Authors
Suchin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PubText

Welcome to the Open License Corpus (OLC), a 228B token corpus for training permissively-licensed language models. Disclaimer: OLC should not be considered a universally safe-to-use dataset. We encourage users of OLC to consult a legal professional on the suitability of each data source for their application.

Dataset Summary

Domain Sources Specific License

BPE Tokens (in billions; GPT-NeoX tokenizer)

Legal Case Law, Pile of Law (PD subset) Public… See the full description on the dataset page: https://huggingface.co/datasets/kernelmachine/open-license-corpus.
h
pile-of-law
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ICL Team @SKIML Lab, SNU GSDS, pile-of-law [Dataset]. https://huggingface.co/datasets/SKIML-ICL/pile-of-law
Explore at:
Dataset authored and provided by
ICL Team @SKIML Lab, SNU GSDS
Description
pile-of-law/pile-of-law: 법률 및 행정 문서를 대규모로 수집·정리한 코퍼스 파일크기 및 소스 : 약 256G 오픈 소스 영어권 법률 및 행정자료 포함 판결문, 계약서, 행정규칙, 법령, 시험용 아웃라인 등. 35개의 서브셋 중 크고 주요하며, qa와 관련성 있는 8개 서브셋 선택해서 사용 (약 120G)

"courtlistener_opinions": U.S. court opinions from CourtListener (synchronized as of 12/31/2022). "cc_casebooks": Educational Casebooks released under open CC licenses. "exam_outlines": Bar Exam outlines available openly on the web. "uscode": The United States Code (laws). "cfr": U.S. Code of Federal Regulations… See the full description on the dataset page: https://huggingface.co/datasets/SKIML-ICL/pile-of-law.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law

pile-of-law

pile-of-law/pile-of-law

Explore at:

5 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jul 10, 2022

Dataset authored and provided by

Pile of Law

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

Clear search

Close search

Google apps

Main menu

pile-of-law

lilac-pile-of-law-r-legaladvice

law-stack-exchange

Multi_Legal_Pile

pile-of-law-bm25

pile-of-law-chunked

regulations

open-license-corpus

BPE Tokens (in billions; GPT-NeoX tokenizer)

pile-of-law

pile-of-law

pile-of-law

pile-of-law/pile-of-law