9 datasets found
  1. h

    pile-of-law

    • huggingface.co
    • opendatalab.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law
    Explore at:
    Dataset updated
    Jul 10, 2022
    Dataset authored and provided by
    Pile of Law
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

  2. h

    lilac-pile-of-law-r-legaladvice

    • huggingface.co
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lilac AI (2023). lilac-pile-of-law-r-legaladvice [Dataset]. https://huggingface.co/datasets/lilacai/lilac-pile-of-law-r-legaladvice
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2023
    Dataset authored and provided by
    Lilac AI
    Description

    This dataset is generated by Lilac for a HuggingFace Space: huggingface.co/spaces/lilacai/lilac. Original dataset: https://huggingface.co/datasets/pile-of-law/pile-of-law Lilac dataset config: - {embedding: gte-small, path: text} name: pile-of-law-r-legaladvice namespace: lilac settings: preferred_embedding: gte-small ui: media_paths: [text] signals: - path: text signal: {signal_name: near_dup} - path: text signal: {signal_name: text_statistics} - path: text signal: {signal_name:… See the full description on the dataset page: https://huggingface.co/datasets/lilacai/lilac-pile-of-law-r-legaladvice.

  3. h

    law-stack-exchange

    • huggingface.co
    Updated Mar 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Li (2022). law-stack-exchange [Dataset]. https://huggingface.co/datasets/jonathanli/law-stack-exchange
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2022
    Authors
    Jonathan Li
    Description

    Dataset Card for Law Stack Exchange Dataset

      Dataset Summary
    

    Dataset from the Law Stack Exchange, as used in "Parameter-Efficient Legal Domain Adaptation".

      Citation Information
    

    @inproceedings{li-etal-2022-parameter, title = "Parameter-Efficient Legal Domain Adaptation", author = "Li, Jonathan and Bhambhoria, Rohan and Zhu, Xiaodan", booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2022", month = dec… See the full description on the dataset page: https://huggingface.co/datasets/jonathanli/law-stack-exchange.

  4. h

    Multi_Legal_Pile

    • huggingface.co
    Updated Oct 23, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joel Niklaus (2023). Multi_Legal_Pile [Dataset]. https://huggingface.co/datasets/joelniklaus/Multi_Legal_Pile
    Explore at:
    Dataset updated
    Oct 23, 2023
    Authors
    Joel Niklaus
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Multi Legal Pile is a dataset of legal documents in the 24 EU languages.

  5. h

    pile-of-law-bm25

    • huggingface.co
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minhu, Park (2025). pile-of-law-bm25 [Dataset]. https://huggingface.co/datasets/hoorangyee/pile-of-law-bm25
    Explore at:
    Dataset updated
    Mar 11, 2025
    Authors
    Minhu, Park
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

    LRAGE (Legal Retrieval Augmented Generation Evaluation) is an open-source toolkit designed to evaluate Large Language Models (LLMs) in a Retrieval-Augmented Generation (RAG) setting, specifically tailored for the legal domain.
    This repository facilitates evaluating LLM performance on legal tasks without cumbersome engineering overhead. Code: https://github.com/hoorangyee/LRAGE For more details, please refer to the LRAGE… See the full description on the dataset page: https://huggingface.co/datasets/hoorangyee/pile-of-law-bm25.

  6. h

    pile-of-law-chunked

    • huggingface.co
    Updated Oct 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Minhu, Park (2021). pile-of-law-chunked [Dataset]. https://huggingface.co/datasets/hoorangyee/pile-of-law-chunked
    Explore at:
    Dataset updated
    Oct 24, 2021
    Authors
    Minhu, Park
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    LRAGE: Legal Retrieval Augmented Generation Evaluation Tool

    LRAGE (Legal Retrieval Augmented Generation Evaluation, pronounced as 'large') is an open-source toolkit designed to evaluate Large Language Models (LLMs) in a Retrieval-Augmented Generation (RAG) setting, specifically tailored for the legal domain. This repository contains pointers to datasets and code used in LRAGE: Legal Retrieval Augmented Generation Evaluation. Code: https://github.com/hoorangyee/LRAGE… See the full description on the dataset page: https://huggingface.co/datasets/hoorangyee/pile-of-law-chunked.

  7. h

    regulations

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile, regulations [Dataset]. https://huggingface.co/datasets/common-pile/regulations
    Explore at:
    Dataset authored and provided by
    Common Pile
    Description

    Description

    Regulations.gov is an online platform operated by the U.S. General Services Administration that collates newly proposed rules and regulations from federal agencies along with comments and feedback from the general public. This dataset includes all plain-text regulatory documents published by a variety of U.S. federal agencies on this platform, acquired via the bulk download interface provided by Regulations.gov. These agencies include the Bureau of Industry and… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/regulations.

  8. h

    open-license-corpus

    • huggingface.co
    Updated Oct 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suchin (2023). open-license-corpus [Dataset]. https://huggingface.co/datasets/kernelmachine/open-license-corpus
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2023
    Authors
    Suchin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PubText

    Welcome to the Open License Corpus (OLC), a 228B token corpus for training permissively-licensed language models. Disclaimer: OLC should not be considered a universally safe-to-use dataset. We encourage users of OLC to consult a legal professional on the suitability of each data source for their application.

      Dataset Summary
    

    Domain Sources Specific License

    BPE Tokens (in billions; GPT-NeoX tokenizer)

    Legal Case Law, Pile of Law (PD subset) Public… See the full description on the dataset page: https://huggingface.co/datasets/kernelmachine/open-license-corpus.

  9. h

    pile-of-law

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ICL Team @SKIML Lab, SNU GSDS, pile-of-law [Dataset]. https://huggingface.co/datasets/SKIML-ICL/pile-of-law
    Explore at:
    Dataset authored and provided by
    ICL Team @SKIML Lab, SNU GSDS
    Description

    pile-of-law/pile-of-law: 법률 및 행정 문서를 대규모로 수집·정리한 코퍼스 파일크기 및 소스 : 약 256G 오픈 소스 영어권 법률 및 행정자료 포함 판결문, 계약서, 행정규칙, 법령, 시험용 아웃라인 등. 35개의 서브셋 중 크고 주요하며, qa와 관련성 있는 8개 서브셋 선택해서 사용 (약 120G)

    "courtlistener_opinions": U.S. court opinions from CourtListener (synchronized as of 12/31/2022). "cc_casebooks": Educational Casebooks released under open CC licenses. "exam_outlines": Bar Exam outlines available openly on the web. "uscode": The United States Code (laws). "cfr": U.S. Code of Federal Regulations… See the full description on the dataset page: https://huggingface.co/datasets/SKIML-ICL/pile-of-law.

  10. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Pile of Law (2022). pile-of-law [Dataset]. https://huggingface.co/datasets/pile-of-law/pile-of-law

pile-of-law

pile-of-law

pile-of-law/pile-of-law

Explore at:
5 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jul 10, 2022
Dataset authored and provided by
Pile of Law
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

Search
Clear search
Close search
Google apps
Main menu