2 datasets found
  1. O

    PandaLM-testset

    • opendatalab.com
    zip
    Updated Aug 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peking University (2023). PandaLM-testset [Dataset]. https://opendatalab.com/OpenDataLab/PandaLM-testset
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 1, 2023
    Dataset provided by
    Peking University
    Microsoft Research Asia
    Westlake University
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PandaLM aims to provide reproducible and automated comparisons between different large language models (LLMs). By giving PandaLM the same context, it can compare the responses of different LLMs and provide a reason for the decision, along with a reference answer. The target audience for PandaLM may be organizations that have confidential data and research labs with limited funds that seek reproducibility. These organizations may not want to disclose their data to third parties or may not be able to afford the high costs of secret data leakage using third-party APIs or hiring human annotators. With PandaLM, they can perform evaluations without compromising data security or incurring high costs, and obtain reproducible results. To demonstrate the reliability and consistency of our tool, we have created a diverse human-annotated test dataset of approximately 1,000 samples, where the contexts and the labels are all created by humans. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset.. More papers and features are coming soon.

  2. h

    bird-critic-1.0-open

    • huggingface.co
    Updated Jan 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The BIRD Team (2025). bird-critic-1.0-open [Dataset]. https://huggingface.co/datasets/birdsql/bird-critic-1.0-open
    Explore at:
    Dataset updated
    Jan 19, 2025
    Dataset authored and provided by
    The BIRD Team
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    BIRD-CRITIC-1.0-Open

    BIRD-Critic is the first SQL debugging benchmark designed to answer a critical question: Can large language models (LLMs) fix user issues in real-world database applications? Each task in BIRD-CRITIC has been verified by human experts on the following dimensions:

    Reproduction of errors on BIRD env to prevent data leakage. Carefully curate test case functions for each task specifically. Soft EX: This metric can evaluate SELECT-ONLY tasks. Soft EX + Parsing:… See the full description on the dataset page: https://huggingface.co/datasets/birdsql/bird-critic-1.0-open.

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Peking University (2023). PandaLM-testset [Dataset]. https://opendatalab.com/OpenDataLab/PandaLM-testset

PandaLM-testset

OpenDataLab/PandaLM-testset

Explore at:
zipAvailable download formats
Dataset updated
Aug 1, 2023
Dataset provided by
Peking University
Microsoft Research Asia
Westlake University
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

PandaLM aims to provide reproducible and automated comparisons between different large language models (LLMs). By giving PandaLM the same context, it can compare the responses of different LLMs and provide a reason for the decision, along with a reference answer. The target audience for PandaLM may be organizations that have confidential data and research labs with limited funds that seek reproducibility. These organizations may not want to disclose their data to third parties or may not be able to afford the high costs of secret data leakage using third-party APIs or hiring human annotators. With PandaLM, they can perform evaluations without compromising data security or incurring high costs, and obtain reproducible results. To demonstrate the reliability and consistency of our tool, we have created a diverse human-annotated test dataset of approximately 1,000 samples, where the contexts and the labels are all created by humans. Our results indicate that PandaLM-7B achieves 93.75% of GPT-3.5's evaluation ability and 88.28% of GPT-4's in terms of F1-score on our test dataset.. More papers and features are coming soon.

Search
Clear search
Close search
Google apps
Main menu