Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.
Add Column 'choices' to the original dataset.
Citation
If you use C-Eval benchmark or the code in your research, please cite their paper: @article{huang2023ceval, title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian}… See the full description on the dataset page: https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset was created by GinRawin
Released under Apache 2.0
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Card for "ceval-exam-zhtw"
C-Eval 是一個針對基礎模型的綜合中文評估套件。它由 13,948 道多項選擇題組成,涵蓋 52 個不同的學科和四個難度級別。原始網站和 GitHub 或查看論文以了解更多詳細資訊。 C-Eval 主要的數據都是使用簡體中文來撰寫并且用來評測簡體中文的 LLM 的效能來設計的,本數據集使用 OpenCC 來進行簡繁的中文轉換,主要目的方便繁中 LLM 的開發與驗測。
下載
使用 Hugging Face datasets 直接載入資料集: from datasets import load_dataset
dataset=load_dataset(r"erhwenkuo/ceval-exam-zhtw",name="computer_network")
print(dataset['val'][0])
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
基于 C-Eval 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。
liangzid/robench-eval-Time4-c dataset hosted on Hugging Face and contributed by the HF Datasets community
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SV‑TrustEval‑C 🚨🔒
🔍 Overview
SV‑TrustEval‑C is the first reasoning‑based benchmark designed to rigorously evaluate Large Language Models (LLMs) on both structure (control/data flow) and semantic reasoning for vulnerability analysis in C source code. Unlike existing benchmarks that focus solely on pattern recognition, SV‑TrustEval‑C measures logical consistency, adaptability to code transformations, and real‑world security reasoning across six core tasks. Our… See the full description on the dataset page: https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0.
CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. This benchmark is designed to evaluate the performance of models in various multi-tasking tasks, including code completion, code generation from natural language, test case generation, cross-language code translation, and code generation from Chinese commands, among others.
The evaluation of the generated codes involves compiling and running in multiple programming languages. The versions of the programming language environments and packages we use are as follows:
Dependency | Version |
---|---|
Python | 3.10.9 |
JDK | 18.0.2.1 |
Node.js | 16.14.0 |
js-md5 | 0.7.3 |
C++ | 11 |
g++ | 7.5.0 |
Boost | 1.75.0 |
OpenSSL | 3.0.0 |
go | 1.18.4 |
cargo | 1.71.1 |
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is about: (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.809136 for more information.
RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:
Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.
Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).
Question Banks:
HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.
Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.
Installation:
Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.
Prompt Templates:
Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.
How to Perform Red-Teaming:
Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.
RepoEval is a benchmark specifically designed for evaluating repository-level code auto-completion systems. While existing benchmarks mainly focus on single-file tasks, RepoEval addresses the assessment gap for more complex, real-world, multi-file programming scenarios. Here are the key details about RepoEval:
Tasks:
RepoBench-R (Retrieval): Measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context. RepoBench-C (Code Completion): Evaluates the system's capability to predict the next line of code with cross-file and in-file context. RepoBench-P (Pipeline): Handles complex tasks that require a combination of both retrieval and next-line prediction¹.
Languages Supported:
RepoEval supports both Python and Java¹.
Purpose:
RepoEval aims to facilitate a more complete comparison of performance and encourage continuous improvement in auto-completion systems¹.
Availability:
RepoEval is publicly available for use here ¹.
In summary, RepoEval provides a comprehensive evaluation framework for assessing the effectiveness of repository-level code auto-completion systems, enabling researchers and developers to enhance code productivity and quality.
(1) [2306.03091] RepoBench: Benchmarking Repository-Level Code Auto .... https://arxiv.org/abs/2306.03091. (2) [2303.12570] RepoCoder: Repository-Level Code Completion Through .... https://arxiv.org/abs/2303.12570. (3) [2306.03091] RepoBench: Benchmarking Repository-Level Code Auto .... https://ar5iv.labs.arxiv.org/html/2306.03091. (4) GitHub - Leolty/repobench: RepoBench: Benchmarking Repository-Level .... https://github.com/Leolty/repobench. (5) undefined. https://doi.org/10.48550/arXiv.2306.03091.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ZhangRC/chinese-multi-choice-ceval-validation-glm4-explanation dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Card for AutoTrain Evaluator
This repository contains model predictions generated by AutoTrain for the following task and dataset:
Task: Summarization Model: pszemraj/pegasus-x-large-book-summary-C-r2 Dataset: kmfoda/booksum Config: kmfoda--booksum Split: test
To run new evaluation jobs, visit Hugging Face's automatic model evaluator.
Contributions
Thanks to @pszemraj for evaluating this model.
thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "agieval-lsat-lr"
Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo. Raw datset: https://github.com/zhongwanjun/AR-LSAT MIT License Copyright (c) 2022 Wanjun Zhong Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish… See the full description on the dataset page: https://huggingface.co/datasets/dmayhem93/agieval-lsat-lr.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "agieval-sat-math"
Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo. MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of… See the full description on the dataset page: https://huggingface.co/datasets/dmayhem93/agieval-sat-math.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for cmmlu_dpo_pairs
Preference pairs derived from dev split of cmmlu and valid split of ceval-exam. Brute-forced way to align the distribution of LLM to favor the multi-choice style to increase scores on mmlu and ceval.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.