In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.
This is one of DSEval benchmarks.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset contains information about high school students and their actual and predicted performance on an exam. Most of the information, including some general information about high school students and their grade for an exam, was based on an already existing dataset, while the predicted exam performance was based on a human experiment. In this experiment, participants were shown short descriptions of the students (based on the information in the original data) and had to rank and grade according to their expected performance. Prior to this task some participants were exposed to some "Stereotype Activation", suggesting that boys perform less well in school than girls.
Based on this dataset (which is also available on kaggle), we extracted a number of student profiles that participants had to make grade predictions for. For more information about this dataset we refer to the corresponding kaggle page: https://www.kaggle.com/datasets/uciml/student-alcohol-consumption
Note that we performed some preprocessing on the original data:
The original data consisted of two parts: the information about students following a Maths course and the information about students following a Portuguese course. Since in both datasets the same type of information was recorded, we merged both datasets and added a column "subject", to show which course each student belongs to
We excluded all data where G3 = 0 (i.e. the grade for the last exam = 0)
From original_data.csv we randomly sampled 856 students that participants in our study had to make grade predictions for.
index - this column corresponds to the indeces in the file "original_data.csv". Through these indices, it is possible to add columns from the original data to the dataset with the grade prediction
ParticipantID - the ID of the participant who made the performance predictions for the corresponding student. Predictions needed to be made for 856 students, and each participant made 8 predictions total. Thus there are 107 different participant IDs
name - to make the prediction task more engaging for participants, each of the 8 student profiles, that participants had to grade & rank was randomly matched to one of four boy/girl's names (depending on the sex of the student)
sex - the sex of each student, either female (F) or male (M). For benchmarking fair ML algorithms, this can be used as the sensitive attribute. We assume that in the fair version of the decision variable ("Pass"), no sex discrimination occurs. The biased versions of the variable ("Predicted Pass") are mostly discriminatory towards male students.
studytime - this variable is taken from the original dataset and denotes how long a student studied for their exam. In the original data this variable consisted of four levels (less than 2 hours vs. 2-5 hours vs. 5-10 hours vs. more than 10 hours). We binned the latter two levels together and encoded this column numerically from 1-3.
freetime - Originally, this variable ranged from 1 (very low) to 5 (very high). We binned this variable into three categories, where level 1 and 2 are binned, as well as level 4 and 5.
romantic - Binary variable, denoting whether the student is in a romantic relationship or not.
Walc - This variable shows how much alcohol each student consumes in the weekend. Originally it ranged from 1 to 5 (5 corresponding to the highest alcohol consumption), but we binned the last two levels together.
goout - This variable shows how often a student goes out in a week. Originally it ranged from 1 to 5 (5 corresponding to going out very often), but we binned the last two levels together.
Parents_edu - This variable was not present in the original dataset. Instead, the original dataset consisted of two variables "mum_edu" and "dad_edu". We obtained "Parents_edu" by taking the higher one of both. The variable consist of 4 levels, whereas 4 = highest level of education.
absences - This variable shows the number of absences per student. Originally it ranged from 0 - 93, but because large number of absences were infrequent we binned all absences of >=7 into one level.
reason - The reason for why a student chose to go to the school in question. The levels are close to home, school's reputation, school's curricular and other
G3 - The actual grade each student received for the final exam of the course, ranging from 0-20.
Pass - A binary variable showing whether G3 is a passing grade (i.e. >=10) or not.
Predicted Grade - The grade the student was predicted to receive in our experiment
Predicted Rank - In our ex...
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Benchmarks allow for easy comparison between multiple RAM kits by scoring their performance on a standardized series of tests, and they are useful in many instances: When buying or building a new PC.
Newest data as of May 3rd, 2022 This dataset contains benchmarks of DDR4 memory models
Data scrapped from PassMark
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This repository contains the results of three benchmarks that compare natural language understanding services offering: 1. built-in intents (Apple’s SiriKit, Amazon’s Alexa, Microsoft’s Luis, Google’s API.ai, and Snips.ai) on a selection of various intents. This benchmark was performed in December 2016. Its results are described in length in the following post. 2. custom intent engines (Google's API.ai, Facebook's Wit, Microsoft's Luis, Amazon's Alexa, and Snips' NLU) for seven chosen intents. This benchmark was performed in June 2017. Its results are described in a paper and a blog post. 3. extension of Braun et al., 2017 (Google's API.AI, Microsoft's Luis, IBM's Watson, Rasa) This experiment replicates the analysis made by Braun et al., 2017, published in Evaluating Natural Language Understanding Services for Conversational Question Answering Systems as part of SIGDIAL 2017 proceedings. Snips and Rasa are added. Details are available in a paper and a blog post.
The data is provided for each benchmark and more details about the methods are available in the README file in each folder.
Any publication based on these datasets must include a full citation to the following paper in which the results were published by the Snips Team:
"https://arxiv.org/abs/1805.10190">Coucke A. et al., "Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces." 2018,
accepted for a spotlight presentation at the Privacy in Machine Learning and Artificial Intelligence workshop colocated with ICML 2018.
The Snips team has joined Sonos in November 2019. These open datasets remain available and their access is now managed by the Sonos Voice Experience Team. Please email sve-research@sonos.com with any question.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
The diamond is 58 times harder than any other mineral in the world, and its elegance as a jewel has long been appreciated. Forecasting diamond prices is challenging due to nonlinearity in important features such as carat, cut, clarity, table, and depth. Against this backdrop, the study conducted a comparative analysis of the performance of multiple supervised machine learning models (regressors and classifiers) in predicting diamond prices. Eight supervised machine learning algorithms were evaluated in this work including Multiple Linear Regression, Linear Discriminant Analysis, eXtreme Gradient Boosting, Random Forest, k-Nearest Neighbors, Support Vector Machines, Boosted Regression and Classification Trees, and Multi-Layer Perceptron. The analysis is based on data preprocessing, exploratory data analysis (EDA), training the aforementioned models, assessing their accuracy, and interpreting their results. Based on the performance metrics values and analysis, it was discovered that eXtreme Gradient Boosting was the most optimal algorithm in both classification and regression, with a R2 score of 97.45% and an Accuracy value of 74.28%. As a result, eXtreme Gradient Boosting was recommended as the optimal regressor and classifier for forecasting the price of a diamond specimen. Methods Kaggle, a data repository with thousands of datasets, was used in the investigation. It is an online community for machine learning practitioners and data scientists, as well as a robust, well-researched, and sufficient resource for analyzing various data sources. On Kaggle, users can search for and publish various datasets. In a web-based data-science environment, they can study datasets and construct models.
This dataset was created by Anthony Goldbloom
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘STUDENTS PERFORMANCE DATASET’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/balavashan/students-performance-dataset on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This Dataset contains the information about the students and their Performance try to predict using the simple machine Learning Algorithm.
It contains the Specific information about the Schooling,Family issues,personal Relationship of the students,Internet facility and so on.
This data are gathered from numerous sources thanks a lot @uci_repository.
Try to find out the issues with the Students as they are future Personalities ,to save their Schooling and Youth life.
--- Original source retains full ownership of the source dataset ---
This data set contains 30 million chess positions along with a label that indicates if the position is not check (0), check (1) or checkmate (2). In addition, we provide 3 reference explanations per data point consisting of 8×8 bit masks that highlight certain squares that are relevant for the decision. For each class, we identified one explanation type that characterizes it most accurately: - No check (0): All squares that are controlled by the enemy player, i.e., all squares that can be reached or captured on by any enemy piece. - Check (1): All squares (origin or target) of legal moves. As a checkmate is a check where the player under attack has no more legal moves, highlighting legal moves is sufficient to disprove a checkmate. - Checkmate (2): All squares with pieces that are essential for creating the checkmate. This includes attackers, friendly pieces blocking the King, enemy pieces guarding escape squares and enemy pieces protecting attackers.
The data is saved as a CSV file containing the chess positions in Forsyth–Edwards Notation (FEN) and the label (0-2) as columns.
The FEN string can be read by most chess software packages and encodes the current piece setup, whose turn it is and some more game-specific information (castling rights, en-passant squares).
The explanations are saved as 64-bit unsigned integers, which can be converted to SquareSet
objects from the chess
library.
We provide code for converting between different data and explanation representations.
Our data set is based on the Lichess open database, which contains records of over 3 billion games of chess played online by human players on the free chess website Lichess. To read and process the games and to create the explanations, we used the Python package chess. We selected only those games that end in checkmate, excluding those that end by timeout or resignation. Also we skip the first ten moves, as they lead to lots of duplicate positions.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Body performance Data’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/kukuroo3/body-performance-data on 28 January 2022.
--- Dataset description provided by original source is as follows ---
This is data that confirmed the grade of performance with age and some exercise performance data.
data shape : (13393, 12)
link (Korea Sports Promotion Foundation) Some post-processing and filtering has done from the raw data.
--- Original source retains full ownership of the source dataset ---
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Clustering benchmark datasets published by School of Computing, University of Eastern Finland
2D scatter points and label which need to process the formatting first.
find more in https://cs.joensuu.fi/sipu/datasets/
@misc{ClusteringDatasets, author = {Pasi Fr"anti et al}, title = {Clustering datasets}, year = {2015}, url = {http://cs.uef.fi/sipu/datasets/}, }
With standard and famous benchmark, various clustering algorithm can be performed and compared though a number of kernels.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).
Columns | Description |
---|---|
school | student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) |
sex | student's sex (binary: 'F' - female or 'M' - male) |
age | student's age (numeric: from 15 to 22) |
address | student's home address type (binary: 'U' - urban or 'R' - rural) |
famsize | family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) |
Pstatus | parent's cohabitation status (binary: 'T' - living together or 'A' - apart) |
Medu | mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
Fedu | father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education) |
Mjob | mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') |
Fjob | father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') |
reason | reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other') |
guardian | student's guardian (nominal: 'mother', 'father' or 'other') |
traveltime | home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) |
studytime | weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) |
failures | number of past class failures (numeric: n if 1<=n<3, else 4) |
schoolsup | extra educational support (binary: yes or no) |
famsup | family educational support (binary: yes or no) |
paid | extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) |
activities | extra-curricular activities (binary: yes or no) |
nursery | attended nursery school (binary: yes or no) |
higher | wants to take higher education (binary: yes or no) |
internet | Internet access at home (binary: yes or no) |
romantic | with a romantic relationship (binary: yes or no) |
famrel | quality of family relationships (numeric: from 1 - very bad to 5 - excellent) |
freetime | free time after school (numeric: from 1 - very low to 5 - very high) |
goout | going out with friends (numeric: from 1 - very low to 5 - very high) |
Dalc | workday alcohol consumption (numeric: from 1 - very low to 5 - very high) |
Walc | weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) |
health | current health status (numeric: from 1 - very bad to 5 - very good) |
absences | number of school absences (numeric: from 0 to 93) |
Grade | Description |
---|---|
G1 | first period grade (numeric: from 0 to 20) |
G2 | second period grade (numeric: from 0 to 20) |
G3 | final grade (numeric: from 0 to 20, output target) |
More - Find More Exciting🙀 Datasets Here - An Upvote👍 A Dayᕙ(`▿´)ᕗ , Keeps Aman Hurray Hurray..... ٩(˘◡˘)۶Haha
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
This dataset contains information on student engagement with Tableau, including quizzes, exams, and lessons. The data includes the course title, the rating of the course, the date the course was rated, the exam category, the exam duration, whether the answer was correct or not, the number of quizzes completed, the number of exams completed, the number of lessons completed, the date engaged, the exam result, and more
The 'Student Engagement with Tableau' dataset offers insights into student engagement with the Tableau software. The data includes information on courses, exams, quizzes, and student learning.
This dataset can be used to examine how students use Tableau, what kind of engagement leads to better learning outcomes, and whether certain course or exam characteristics are associated with student engagement
- Creating a heat map of student engagement by course and location
- Determining which courses are most popular among students from different countries
- Identifying patterns in students' exam results
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: 365_course_info.csv | Column name | Description | |:-----------------|:----------------------------------| | course_title | The title of the course. (String) |
File: 365_course_ratings.csv | Column name | Description | |:------------------|:---------------------------------------------------------| | course_rating | The rating given to the course by the student. (Numeric) | | date_rated | The date on which the course was rated. (Date) |
File: 365_exam_info.csv | Column name | Description | |:------------------|:-------------------------------------------------| | exam_category | The category of the exam. (Categorical) | | exam_duration | The duration of the exam in minutes. (Numerical) |
File: 365_quiz_info.csv | Column name | Description | |:-------------------|:----------------------------------------------------------------------| | answer_correct | Whether or not the student answered the question correctly. (Boolean) |
File: 365_student_engagement.csv | Column name | Description | |:-----------------------|:------------------------------------------------------------------| | engagement_quizzes | The number of times a student has engaged with quizzes. (Numeric) | | engagement_exams | The number of times a student has engaged with exams. (Numeric) | | engagement_lessons | The number of times a student has engaged with lessons. (Numeric) | | date_engaged | The date of the student's engagement. (Date) |
File: 365_student_exams.csv | Column name | Description | |:-------------------------|:---------------------------------------------------| | exam_result | The result of the exam. (Categorical) | | exam_completion_time | The time it took to complete the exam. (Numerical) | | date_exam_completed | The date the exam was completed. (Date) |
File: 365_student_hub_questions.csv | Column name | Description | |:------------------------|:----------------------------------------| | date_question_asked | The date the question was asked. (Date) |
File: 365_student_info.csv | Column name | Description | |:--------------------|:-------------------------------------------------------| | student_country | The country of the student. (Categorical) | | date_registered | The date the student registered for the course. (Date) |
File: 365_student_learning.csv | Column name | Description | |:--------------------|:------------------------------...
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Relevant links: * leaderboard: Coming soon * implementation: * publication: https://arxiv.org/abs/2504.12516 * original repository: https://github.com/openai/simple-evals/tree/main
Abstract We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered lowresource in the realm of VQA due to the lack of a proper benchmark dataset. The absence of such datasets challenges models that are known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little cultural relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset titled ChitroJera, totaling over 15k samples where diverse and locally relevant data sources are used. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pretrained dual-encoders outperform other models of its scale. We also evaluate the performance of large language models (LLMs) using promptbased techniques, with LLMs achieving the best performance. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.
This dataset was created by Dmitry Sokolov
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The M4 competition which is a continuation of the Makridakis Competitions for forecasting and was conducted in 2018. This competion includes the prediction of both Point Forecasts and Prediction Intervals.
Paper describing the competition and the various benchmarks and approaches was published in a special edition of the International Journal of Forecasting and is available for open access and can be found here
The code for various benchmarks on this dataset can be found at the following github repository
The data is available at both the github link and the official website of MOFC
This dataset was created by KoHanE
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
AbRank is a large-scale benchmark and evaluation framework that reframes affinity prediction as a pairwise ranking problem. It aggregates over 380,000 binding assays from nine heterogeneous sources, spanning diverse antibodies, antigens, and experimental conditions, and introduces standardized data splits that systematically increase distribution shift, from local perturbations such as point mutations to broad generalization across novel antigens and antibodies. To ensure robust supervision, AbRank defines a 10-confident ranking framework by filtering out comparisons with marginal affinity differences, focusing training on pairs with at least an 10-fold difference in measured binding strength.
This dataset was created by Lisa Sharapova
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The "ZeroShot LLM4TS Benchmark" dataset is designed to evaluate the performance of large language models (LLMs) in zero-shot time series forecasting tasks. The dataset contains various time series data from different domains, providing a comprehensive benchmark for testing LLM capabilities in forecasting without prior training on the specific data.
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reveal the common challenges and limitations of the current works, providing useful insights and shedding light on future research on LLM-based data science agents.
This is one of DSEval benchmarks.