14 datasets found

P
DSEval-Kaggle Dataset
paperswithcode.com
Updated Feb 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval
Explore at:
Dataset updated
Feb 26, 2024
Authors
Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren
Description
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

This is one of DSEval benchmarks.
a
Latency by Model
artificialanalysis.ai
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Latency by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Seconds to First Token Received; Lower is better by Model
NYC Municipal Building Energy Benchmarking Results
kaggle.com
zip
Updated Jan 1, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of New York (2021). NYC Municipal Building Energy Benchmarking Results [Dataset]. https://www.kaggle.com/new-york-city/nyc-municipal-building-energy-benchmarking-results
Explore at:
zip(182030 bytes)Available download formats
Dataset updated
Jan 1, 2021
Dataset authored and provided by
City of New York
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
New York
Description
Content

This is a list of New York City municipal buildings over 10,000 square feet by borough, block, lot, and agency, identifying each building’s energy intensity (kBtu/sq. ft.), Portfolio Manager benchmarking rating, where available, and the total GHG emissions for the calendar years 2010 - 2013.

Context

This is a dataset hosted by the City of New York. The city has an open data platform found here and they update their information according the amount of data that is brought in. Explore New York City using Kaggle and all of the data sources available through the City of New York organization page!

Update Frequency: This dataset is updated annually.

Acknowledgements

This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.

Cover photo by Scott Webb on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
a
Llama 3.3 70B Output Speed by Provider
artificialanalysis.ai
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Llama 3.3 70B Output Speed by Provider [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Output Speed: Output Tokens per Second by Provider
a
Math Index by Model
artificialanalysis.ai
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Math Index by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Represents the average of math benchmarks in the Artificial Analysis Intelligence Index (AIME 2024 & Math-500) by Model
a
Intelligence vs. Output Speed by Model
artificialanalysis.ai
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Intelligence vs. Output Speed by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Artificial Analysis Intelligence Index vs. Output Speed (Output Tokens per Second) by Model
a
Coding Index by Model
artificialanalysis.ai
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Coding Index by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Represents the average of coding benchmarks in the Artificial Analysis Intelligence Index (LiveCodeBench & SciCode) by Model
a
Intelligence vs. Price by Model
artificialanalysis.ai
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Intelligence vs. Price by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Artificial Analysis Intelligence Index vs. Price (USD per M Tokens) by Model
P
MMLU Dataset
paperswithcode.com
Updated Jan 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MMLU Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
Explore at:
Dataset updated
Jan 5, 2025
Authors
Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
Description
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.
a
Output Speed by Model
artificialanalysis.ai
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Output Speed by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Output Tokens per Second; Higher is better by Model
a
Pricing: Input and Output by Model
artificialanalysis.ai
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Pricing: Input and Output by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Price: USD per 1M Tokens by Model
a
Llama 3.3 70B (FP8) Pricing: Input and Output by Provider
artificialanalysis.ai
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Llama 3.3 70B (FP8) Pricing: Input and Output by Provider [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Price: USD per 1M Tokens; Lower is better by Provider
a
Average by Model
artificialanalysis.ai
Updated Feb 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Average by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of by Model
a
Intelligence Index by Model
artificialanalysis.ai
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artificial Analysis (2025). Intelligence Index by Model [Dataset]. https://artificialanalysis.ai/
Explore at:
Dataset updated
Feb 19, 2025
Dataset authored and provided by
Artificial Analysis
Description
Comparison of Intelligence Index incorporates 7 evaluations spanning reasoning, knowledge, math & coding by Model
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren (2024). DSEval-Kaggle Dataset [Dataset]. https://paperswithcode.com/dataset/dseval

DSEval-Kaggle Dataset

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Feb 26, 2024

Authors

Yuge Zhang; Qiyang Jiang; Xingyu Han; Nan Chen; Yuqing Yang; Kan Ren

Description

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.

This is one of DSEval benchmarks.

Clear search

Close search

Google apps

Main menu

DSEval-Kaggle Dataset

Latency by Model

NYC Municipal Building Energy Benchmarking Results

Content

Context

Acknowledgements

Llama 3.3 70B Output Speed by Provider

Math Index by Model

Intelligence vs. Output Speed by Model

Coding Index by Model

Intelligence vs. Price by Model

MMLU Dataset

Output Speed by Model

Pricing: Input and Output by Model

Llama 3.3 70B (FP8) Pricing: Input and Output by Provider

Average by Model

Intelligence Index by Model

DSEval-Kaggle DatasetSee More Versions

DSEval-Kaggle Dataset