14 datasets found
  1. P

    DSEval-Kaggle Dataset

    • paperswithcode.com
    Updated Feb 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval
    Explore at:
    Dataset updated
    Feb 26, 2024
    Authors
    Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren
    Description

    In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

    This is one of DSEval benchmarks.

  2. a

    Latency by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Latency by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Seconds to First Token Received; Lower is better by Model

  3. NYC Municipal Building Energy Benchmarking Results

    • kaggle.com
    zip
    Updated Jan 1, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of New York (2021). NYC Municipal Building Energy Benchmarking Results [Dataset]. https://www.kaggle.com/new-york-city/nyc-municipal-building-energy-benchmarking-results
    Explore at:
    zip(182030 bytes)Available download formats
    Dataset updated
    Jan 1, 2021
    Dataset authored and provided by
    City of New York
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    New York
    Description

    Content

    This is a list of New York City municipal buildings over 10,000 square feet by borough, block, lot, and agency, identifying each building’s energy intensity (kBtu/sq. ft.), Portfolio Manager benchmarking rating, where available, and the total GHG emissions for the calendar years 2010 - 2013.

    Context

    This is a dataset hosted by the City of New York. The city has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York City using Kaggle and all of the data sources available through the City of New York organization page!

    • Update Frequency: This dataset is updated annually.

    Acknowledgements

    This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

    Cover photo by Scott Webb on Unsplash
    Unsplash Images are distributed under a unique Unsplash License.

  4. a

    Llama 3.3 70B Output Speed by Provider

    • artificialanalysis.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Llama 3.3 70B Output Speed by Provider [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Output Speed: Output Tokens per Second by Provider

  5. a

    Math Index by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Math Index by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model

  6. a

    Intelligence vs. Output Speed by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Output Speed by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model

  7. a

    Coding Index by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Coding Index by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (LiveCodeBench & SciCode) by Model

  8. a

    Intelligence vs. Price by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence vs. Price by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Artificial Analysis Intelligence Index vs. Price (USD per M Tokens) by Model

  9. P

    MMLU Dataset

    • paperswithcode.com
    Updated Jan 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MMLU Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
    Explore at:
    Dataset updated
    Jan 5, 2025
    Authors
    Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
    Description

    MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

  10. a

    Output Speed by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Output Speed by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Output Tokens per Second; Higher is better by Model

  11. a

    Pricing: Input and Output by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Pricing: Input and Output by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Price: USD per 1M Tokens by Model

  12. a

    Llama 3.3 70B (FP8) Pricing: Input and Output by Provider

    • artificialanalysis.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Llama 3.3 70B (FP8) Pricing: Input and Output by Provider [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Price: USD per 1M Tokens; Lower is better by Provider

  13. a

    Average by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Average by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of by Model

  14. a

    Intelligence Index by Model

    • artificialanalysis.ai
    Updated Feb 19, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artificial Analysis (2025). Intelligence Index by Model [Dataset]. https://artificialanalysis.ai/
    Explore at:
    Dataset updated
    Feb 19, 2025
    Dataset authored and provided by
    Artificial Analysis
    Description

    Comparison of Intelligence Index incorporates 7 evaluations spanning reasoning, knowledge, math & coding by Model

  15. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval

DSEval-Kaggle Dataset

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Feb 26, 2024
Authors
Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren
Description

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

This is one of DSEval benchmarks.

Search
Clear search
Close search
Google apps
Main menu