19 datasets found

h
ceval-exam
huggingface.co
opendatalab.com
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam
Explore at:
Dataset updated
Jan 8, 2022
Dataset authored and provided by
ceval
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.
h
ceval-exam
huggingface.co
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
oooooz (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2022
Authors
oooooz
Description
Add Column 'choices' to the original dataset.

Citation

If you use C-Eval benchmark or the code in your research, please cite their paper: @article{huang2023ceval, title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian}… See the full description on the dataset page: https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam.
ceval-exam
kaggle.com
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GinRawin (2024). ceval-exam [Dataset]. https://www.kaggle.com/datasets/ginrawin/ceval-exam/suggestions?status=pending&yourSuggestions=true
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
GinRawin
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset

This dataset was created by GinRawin

Released under Apache 2.0

Contents
h
ceval-exam-zhtw
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erhwen, Kuo, ceval-exam-zhtw [Dataset]. https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Erhwen, Kuo
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Dataset Card for "ceval-exam-zhtw"

C-Eval 是一個針對基礎模型的綜合中文評估套件。它由 13,948 道多項選擇題組成，涵蓋 52 個不同的學科和四個難度級別。原始網站和 GitHub 或查看論文以了解更多詳細資訊。 C-Eval 主要的數據都是使用簡體中文來撰寫并且用來評測簡體中文的 LLM 的效能來設計的，本數據集使用 OpenCC 來進行簡繁的中文轉換，主要目的方便繁中 LLM 的開發與驗測。

下載

使用 Hugging Face datasets 直接載入資料集： from datasets import load_dataset

dataset=load_dataset(r"erhwenkuo/ceval-exam-zhtw",name="computer_network")

print(dataset['val'][0])

{'id': 0, 'question':… See the full description on the dataset page: https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw.
d
C-Eval 大模型评测基准排行榜
datalearner.com
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
数据学习 (DataLearner) (2025). C-Eval 大模型评测基准排行榜 [Dataset]. https://www.datalearner.com/ai-benchmarks/c-eval
Explore at:
Dataset updated
Mar 21, 2025
Dataset authored and provided by
数据学习 (DataLearner)
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
基于 C-Eval 基准的最新大语言模型（LLM）性能排行榜，包含各模型的得分、发布机构、发布时间等数据。
h
robench-eval-Time4-c
huggingface.co
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zi Liang (2024). robench-eval-Time4-c [Dataset]. https://huggingface.co/datasets/liangzid/robench-eval-Time4-c
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2024
Authors
Zi Liang
Description
liangzid/robench-eval-Time4-c dataset hosted on Hugging Face and contributed by the HF Datasets community
P
HumanEval-X Dataset
paperswithcode.com
opendatalab.com
Updated Jun 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang (2025). HumanEval-X Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-x
Explore at:
Dataset updated
Jun 9, 2025
Authors
Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang
Description
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
Fiat chrysler automobiles c/o ceval USA Import & Buyer Data
seair.co.in
Updated May 1, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim (2017). Fiat chrysler automobiles c/o ceval USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset updated
May 1, 2017
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
h
SV-TrustEval-C-1.0
huggingface.co
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yansong Li (2025). SV-TrustEval-C-1.0 [Dataset]. https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0
Explore at:
Dataset updated
Jun 23, 2025
Authors
Yansong Li
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SV‑TrustEval‑C 🚨🔒

🔍 Overview

SV‑TrustEval‑C is the first reasoning‑based benchmark designed to rigorously evaluate Large Language Models (LLMs) on both structure (control/data flow) and semantic reasoning for vulnerability analysis in C source code. Unlike existing benchmarks that focus solely on pattern recognition, SV‑TrustEval‑C measures logical consistency, adaptability to code transformations, and real‑world security reasoning across six core tasks. Our… See the full description on the dataset page: https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0.
P
CodeFuseEval Dataset
paperswithcode.com
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). CodeFuseEval Dataset [Dataset]. https://paperswithcode.com/dataset/codefuseeval
Explore at:
Dataset updated
Oct 26, 2023
Description
CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. This benchmark is designed to evaluate the performance of models in various multi-tasking tasks, including code completion, code generation from natural language, test case generation, cross-language code translation, and code generation from Chinese commands, among others.

The evaluation of the generated codes involves compiling and running in multiple programming languages. The versions of the programming language environments and packages we use are as follows:

Dependency Version
Python 3.10.9
JDK 18.0.2.1
Node.js 16.14.0
js-md5 0.7.3
C++ 11
g++ 7.5.0
Boost 1.75.0
OpenSSL 3.0.0
go 1.18.4
cargo 1.71.1
(Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole...
doi.pangaea.de
html, tsv
Updated 1984
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre Cotillon; Michel Rio (1984). (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535 [Dataset]. http://doi.org/10.1594/PANGAEA.809134
Explore at:
html, tsvAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.809134
Dataset updated
1984
Dataset provided by
PANGAEA
Authors
Pierre Cotillon; Michel Rio
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Dec 29, 1980
Area covered

Variables measured
Ratio, Carbon, Calcium carbonate, Sample code/label, DEPTH, sediment/rock, Carbon, organic, total, Carbon, pyrolysis mineral, Lithology/composition/facies, Production index, S1/(S1+S2), Pyrolysis temperature maximum, and 1 more
Description
This dataset is about: (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.809136 for more information.
P
RedEval Dataset
paperswithcode.com
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
Explore at:
Dataset updated
Mar 11, 2024
Authors
Rishabh Bhardwaj; Soujanya Poria
Description
RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

Question Banks:

HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

Installation:

Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

Prompt Templates:

Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

How to Perform Red-Teaming:

Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.
P
RepoEval Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fengji Zhang; Bei Chen; Yue Zhang; Jacky Keung; Jin Liu; Daoguang Zan; Yi Mao; Jian-Guang Lou; Weizhu Chen, RepoEval Dataset [Dataset]. https://paperswithcode.com/dataset/repoeval
Explore at:
Authors
Fengji Zhang; Bei Chen; Yue Zhang; Jacky Keung; Jin Liu; Daoguang Zan; Yi Mao; Jian-Guang Lou; Weizhu Chen
Description
RepoEval is a benchmark specifically designed for evaluating repository-level code auto-completion systems. While existing benchmarks mainly focus on single-file tasks, RepoEval addresses the assessment gap for more complex, real-world, multi-file programming scenarios. Here are the key details about RepoEval:

Tasks:

RepoBench-R (Retrieval): Measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context. RepoBench-C (Code Completion): Evaluates the system's capability to predict the next line of code with cross-file and in-file context. RepoBench-P (Pipeline): Handles complex tasks that require a combination of both retrieval and next-line prediction¹.

Languages Supported:

RepoEval supports both Python and Java¹.

Purpose:

RepoEval aims to facilitate a more complete comparison of performance and encourage continuous improvement in auto-completion systems¹.

Availability:

RepoEval is publicly available for use here ¹.

In summary, RepoEval provides a comprehensive evaluation framework for assessing the effectiveness of repository-level code auto-completion systems, enabling researchers and developers to enhance code productivity and quality.

(1) [2306.03091] RepoBench: Benchmarking Repository-Level Code Auto .... https://arxiv.org/abs/2306.03091. (2) [2303.12570] RepoCoder: Repository-Level Code Completion Through .... https://arxiv.org/abs/2303.12570. (3) [2306.03091] RepoBench: Benchmarking Repository-Level Code Auto .... https://ar5iv.labs.arxiv.org/html/2306.03091. (4) GitHub - Leolty/repobench: RepoBench: Benchmarking Repository-Level .... https://github.com/Leolty/repobench. (5) undefined. https://doi.org/10.48550/arXiv.2306.03091.
h
chinese-multi-choice-ceval-validation-glm4-explanation
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhang Ruichong, chinese-multi-choice-ceval-validation-glm4-explanation [Dataset]. https://huggingface.co/datasets/ZhangRC/chinese-multi-choice-ceval-validation-glm4-explanation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Zhang Ruichong
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ZhangRC/chinese-multi-choice-ceval-validation-glm4-explanation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
autoeval-eval-kmfoda_booksum-kmfoda_booksum-66d70e-2296872703
huggingface.co
Updated Aug 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Evaluation on the Hub (2023). autoeval-eval-kmfoda_booksum-kmfoda_booksum-66d70e-2296872703 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-kmfoda_booksum-kmfoda_booksum-66d70e-2296872703
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2023
Dataset authored and provided by
Evaluation on the Hub
Description
Dataset Card for AutoTrain Evaluator

This repository contains model predictions generated by AutoTrain for the following task and dataset:

Task: Summarization Model: pszemraj/pegasus-x-large-book-summary-C-r2 Dataset: kmfoda/booksum Config: kmfoda--booksum Split: test

To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

Contributions

Thanks to @pszemraj for evaluating this model.
h
GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity
huggingface.co
Updated Jan 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shreyan C (2025). GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity [Dataset]. https://huggingface.co/datasets/thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 12, 2025
Authors
Shreyan C
Description
thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity dataset hosted on Hugging Face and contributed by the HF Datasets community
h
agieval-lsat-lr
huggingface.co
Updated Jun 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dmayhem93 (2023). agieval-lsat-lr [Dataset]. https://huggingface.co/datasets/dmayhem93/agieval-lsat-lr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2023
Authors
dmayhem93
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "agieval-lsat-lr"

Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo. Raw datset: https://github.com/zhongwanjun/AR-LSAT MIT License Copyright (c) 2022 Wanjun Zhong Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish… See the full description on the dataset page: https://huggingface.co/datasets/dmayhem93/agieval-lsat-lr.
h
agieval-sat-math
huggingface.co
Updated Jun 18, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dmayhem93 (2023). agieval-sat-math [Dataset]. https://huggingface.co/datasets/dmayhem93/agieval-sat-math
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 18, 2023
Authors
dmayhem93
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "agieval-sat-math"

Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo. MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of… See the full description on the dataset page: https://huggingface.co/datasets/dmayhem93/agieval-sat-math.
h
cmmlu_dpo_pairs
huggingface.co
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Belandros Pan (2024). cmmlu_dpo_pairs [Dataset]. https://huggingface.co/datasets/wenbopan/cmmlu_dpo_pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2024
Authors
Belandros Pan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for cmmlu_dpo_pairs

Preference pairs derived from dev split of cmmlu and valid split of ceval-exam. Brute-forced way to align the distribution of LLM to favor the multi-choice style to increase scores on mmlu and ceval.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Dependency	Version
Python	3.10.9
JDK	18.0.2.1
Node.js	16.14.0
js-md5	0.7.3
C++	11
g++	7.5.0
Boost	1.75.0
OpenSSL	3.0.0
go	1.18.4
cargo	1.71.1

Facebook

Twitter

Click to copy link

Link copied

Cite

ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam

ceval-exam

C-Eval

ceval/ceval-exam

Explore at:

17 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jan 8, 2022

Dataset authored and provided by

ceval

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

Clear search

Close search

Google apps

Main menu

ceval-exam

ceval-exam

ceval-exam

Dataset

Contents

ceval-exam-zhtw

{'id': 0, 'question':… See the full description on the dataset page: https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw.

C-Eval 大模型评测基准排行榜

robench-eval-Time4-c

HumanEval-X Dataset

Fiat chrysler automobiles c/o ceval USA Import & Buyer Data

SV-TrustEval-C-1.0

CodeFuseEval Dataset

(Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole...

RedEval Dataset

RepoEval Dataset

chinese-multi-choice-ceval-validation-glm4-explanation

autoeval-eval-kmfoda_booksum-kmfoda_booksum-66d70e-2296872703

GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity

agieval-lsat-lr

agieval-sat-math

cmmlu_dpo_pairs

ceval-exam

C-Eval

ceval/ceval-exam