Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Dataset | δΈζ | Paper | Code
π Overview
SAS-Bench represents the first specialized benchmark for evaluating Large Language Models (LLMs) on Short Answer Scoring (SAS) tasks. Utilizing authentic questions from China's National College Entrance Examination (Gaokao)β¦ See the full description on the dataset page: https://huggingface.co/datasets/aleversn/SAS-Bench.
SAS-Bench represents the first specialized benchmark for evaluating Large Language Models (LLMs) on Short Answer Scoring (SAS) tasks. Utilizing authentic questions from China's National College Entrance Examination (Gaokao), our benchmark offers:
1,030 questions spanning 9 academic disciplines 4,109 expert-annotated student responses Step-wise scoring with Step-wise error analysis Multi-dimensional evaluation (holistic scoring, step-wise scoring, and error diagnosis consistency)
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Dataset | δΈζ | Paper | Code
π Overview
SAS-Bench represents the first specialized benchmark for evaluating Large Language Models (LLMs) on Short Answer Scoring (SAS) tasks. Utilizing authentic questions from China's National College Entrance Examination (Gaokao)β¦ See the full description on the dataset page: https://huggingface.co/datasets/aleversn/SAS-Bench.