19 datasets found
  1. h

    ceval-exam

    • huggingface.co
    • opendatalab.com
    Updated Jan 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam
    Explore at:
    Dataset updated
    Jan 8, 2022
    Dataset authored and provided by
    ceval
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

  2. h

    ceval-exam

    • huggingface.co
    Updated Jan 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    oooooz (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2022
    Authors
    oooooz
    Description

    Add Column 'choices' to the original dataset.

      Citation
    

    If you use C-Eval benchmark or the code in your research, please cite their paper: @article{huang2023ceval, title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian}… See the full description on the dataset page: https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam.

  3. ceval-exam

    • kaggle.com
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    GinRawin (2024). ceval-exam [Dataset]. https://www.kaggle.com/datasets/ginrawin/ceval-exam/suggestions?status=pending&yourSuggestions=true
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    GinRawin
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset

    This dataset was created by GinRawin

    Released under Apache 2.0

    Contents

  4. h

    ceval-exam-zhtw

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erhwen, Kuo, ceval-exam-zhtw [Dataset]. https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Erhwen, Kuo
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Dataset Card for "ceval-exam-zhtw"

    C-Eval 是一個針對基礎模型的綜合中文評估套件。它由 13,948 道多項選擇題組成,涵蓋 52 個不同的學科和四個難度級別。原始網站和 GitHub 或查看論文以了解更多詳細資訊。 C-Eval 主要的數據都是使用簡體中文來撰寫并且用來評測簡體中文的 LLM 的效能來設計的,本數據集使用 OpenCC 來進行簡繁的中文轉換,主要目的方便繁中 LLM 的開發與驗測。

      下載
    

    使用 Hugging Face datasets 直接載入資料集: from datasets import load_dataset

    dataset=load_dataset(r"erhwenkuo/ceval-exam-zhtw",name="computer_network")

    print(dataset['val'][0])

    {'id': 0, 'question':… See the full description on the dataset page: https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw.

  5. d

    C-Eval 大模型评测基准排行榜

    • datalearner.com
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    数据学习 (DataLearner) (2025). C-Eval 大模型评测基准排行榜 [Dataset]. https://www.datalearner.com/ai-benchmarks/c-eval
    Explore at:
    Dataset updated
    Mar 21, 2025
    Dataset authored and provided by
    数据学习 (DataLearner)
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    基于 C-Eval 基准的最新大语言模型(LLM)性能排行榜,包含各模型的得分、发布机构、发布时间等数据。

  6. h

    robench-eval-Time4-c

    • huggingface.co
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zi Liang (2024). robench-eval-Time4-c [Dataset]. https://huggingface.co/datasets/liangzid/robench-eval-Time4-c
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2024
    Authors
    Zi Liang
    Description

    liangzid/robench-eval-Time4-c dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. P

    HumanEval-X Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang (2025). HumanEval-X Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-x
    Explore at:
    Dataset updated
    Jun 9, 2025
    Authors
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang
    Description

    HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.

  8. Fiat chrysler automobiles c/o ceval USA Import & Buyer Data

    • seair.co.in
    Updated May 1, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim (2017). Fiat chrysler automobiles c/o ceval USA Import & Buyer Data [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset updated
    May 1, 2017
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  9. h

    SV-TrustEval-C-1.0

    • huggingface.co
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yansong Li (2025). SV-TrustEval-C-1.0 [Dataset]. https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0
    Explore at:
    Dataset updated
    Jun 23, 2025
    Authors
    Yansong Li
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SV‑TrustEval‑C 🚨🔒

      🔍 Overview
    

    SV‑TrustEval‑C is the first reasoning‑based benchmark designed to rigorously evaluate Large Language Models (LLMs) on both structure (control/data flow) and semantic reasoning for vulnerability analysis in C source code. Unlike existing benchmarks that focus solely on pattern recognition, SV‑TrustEval‑C measures logical consistency, adaptability to code transformations, and real‑world security reasoning across six core tasks. Our… See the full description on the dataset page: https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0.

  10. P

    CodeFuseEval Dataset

    • paperswithcode.com
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). CodeFuseEval Dataset [Dataset]. https://paperswithcode.com/dataset/codefuseeval
    Explore at:
    Dataset updated
    Oct 26, 2023
    Description

    CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. This benchmark is designed to evaluate the performance of models in various multi-tasking tasks, including code completion, code generation from natural language, test case generation, cross-language code translation, and code generation from Chinese commands, among others.

    The evaluation of the generated codes involves compiling and running in multiple programming languages. The versions of the programming language environments and packages we use are as follows:

    DependencyVersion
    Python3.10.9
    JDK18.0.2.1
    Node.js16.14.0
    js-md50.7.3
    C++11
    g++7.5.0
    Boost1.75.0
    OpenSSL3.0.0
    go1.18.4
    cargo1.71.1
  11. (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole...

    • doi.pangaea.de
    html, tsv
    Updated 1984
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Cotillon; Michel Rio (1984). (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535 [Dataset]. http://doi.org/10.1594/PANGAEA.809134
    Explore at:
    html, tsvAvailable download formats
    Dataset updated
    1984
    Dataset provided by
    PANGAEA
    Authors
    Pierre Cotillon; Michel Rio
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Dec 29, 1980
    Area covered
    Variables measured
    Ratio, Carbon, Calcium carbonate, Sample code/label, DEPTH, sediment/rock, Carbon, organic, total, Carbon, pyrolysis mineral, Lithology/composition/facies, Production index, S1/(S1+S2), Pyrolysis temperature maximum, and 1 more
    Description

    This dataset is about: (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.809136 for more information.

  12. P

    RedEval Dataset

    • paperswithcode.com
    Updated Mar 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
    Explore at:
    Dataset updated
    Mar 11, 2024
    Authors
    Rishabh Bhardwaj; Soujanya Poria
    Description

    RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

    Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

    Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

    Question Banks:

    HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

    Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

    Installation:

    Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

    Prompt Templates:

    Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

    How to Perform Red-Teaming:

    Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.

  13. P

    RepoEval Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fengji Zhang; Bei Chen; Yue Zhang; Jacky Keung; Jin Liu; Daoguang Zan; Yi Mao; Jian-Guang Lou; Weizhu Chen, RepoEval Dataset [Dataset]. https://paperswithcode.com/dataset/repoeval
    Explore at:
    Authors
    Fengji Zhang; Bei Chen; Yue Zhang; Jacky Keung; Jin Liu; Daoguang Zan; Yi Mao; Jian-Guang Lou; Weizhu Chen
    Description

    RepoEval is a benchmark specifically designed for evaluating repository-level code auto-completion systems. While existing benchmarks mainly focus on single-file tasks, RepoEval addresses the assessment gap for more complex, real-world, multi-file programming scenarios. Here are the key details about RepoEval:

    Tasks:

    RepoBench-R (Retrieval): Measures the system's ability to retrieve the most relevant code snippets from other files as cross-file context. RepoBench-C (Code Completion): Evaluates the system's capability to predict the next line of code with cross-file and in-file context. RepoBench-P (Pipeline): Handles complex tasks that require a combination of both retrieval and next-line prediction¹.

    Languages Supported:

    RepoEval supports both Python and Java¹.

    Purpose:

    RepoEval aims to facilitate a more complete comparison of performance and encourage continuous improvement in auto-completion systems¹.

    Availability:

    RepoEval is publicly available for use here ¹.

    In summary, RepoEval provides a comprehensive evaluation framework for assessing the effectiveness of repository-level code auto-completion systems, enabling researchers and developers to enhance code productivity and quality.

    (1) [2306.03091] RepoBench: Benchmarking Repository-Level Code Auto .... https://arxiv.org/abs/2306.03091. (2) [2303.12570] RepoCoder: Repository-Level Code Completion Through .... https://arxiv.org/abs/2303.12570. (3) [2306.03091] RepoBench: Benchmarking Repository-Level Code Auto .... https://ar5iv.labs.arxiv.org/html/2306.03091. (4) GitHub - Leolty/repobench: RepoBench: Benchmarking Repository-Level .... https://github.com/Leolty/repobench. (5) undefined. https://doi.org/10.48550/arXiv.2306.03091.

  14. h

    chinese-multi-choice-ceval-validation-glm4-explanation

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang Ruichong, chinese-multi-choice-ceval-validation-glm4-explanation [Dataset]. https://huggingface.co/datasets/ZhangRC/chinese-multi-choice-ceval-validation-glm4-explanation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Zhang Ruichong
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ZhangRC/chinese-multi-choice-ceval-validation-glm4-explanation dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    autoeval-eval-kmfoda_booksum-kmfoda_booksum-66d70e-2296872703

    • huggingface.co
    Updated Aug 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Evaluation on the Hub (2023). autoeval-eval-kmfoda_booksum-kmfoda_booksum-66d70e-2296872703 [Dataset]. https://huggingface.co/datasets/autoevaluate/autoeval-eval-kmfoda_booksum-kmfoda_booksum-66d70e-2296872703
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2023
    Dataset authored and provided by
    Evaluation on the Hub
    Description

    Dataset Card for AutoTrain Evaluator

    This repository contains model predictions generated by AutoTrain for the following task and dataset:

    Task: Summarization Model: pszemraj/pegasus-x-large-book-summary-C-r2 Dataset: kmfoda/booksum Config: kmfoda--booksum Split: test

    To run new evaluation jobs, visit Hugging Face's automatic model evaluator.

      Contributions
    

    Thanks to @pszemraj for evaluating this model.

  16. h

    GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity

    • huggingface.co
    Updated Jan 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreyan C (2025). GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity [Dataset]. https://huggingface.co/datasets/thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2025
    Authors
    Shreyan C
    Description

    thethinkmachine/GPT4-Mixtral-GSM8K-MMLU-Preference-16K-Eval-Complexity dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    agieval-lsat-lr

    • huggingface.co
    Updated Jun 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dmayhem93 (2023). agieval-lsat-lr [Dataset]. https://huggingface.co/datasets/dmayhem93/agieval-lsat-lr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2023
    Authors
    dmayhem93
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "agieval-lsat-lr"

    Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo. Raw datset: https://github.com/zhongwanjun/AR-LSAT MIT License Copyright (c) 2022 Wanjun Zhong Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish… See the full description on the dataset page: https://huggingface.co/datasets/dmayhem93/agieval-lsat-lr.

  18. h

    agieval-sat-math

    • huggingface.co
    Updated Jun 18, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    dmayhem93 (2023). agieval-sat-math [Dataset]. https://huggingface.co/datasets/dmayhem93/agieval-sat-math
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 18, 2023
    Authors
    dmayhem93
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "agieval-sat-math"

    Dataset taken from https://github.com/microsoft/AGIEval and processed as in that repo. MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of… See the full description on the dataset page: https://huggingface.co/datasets/dmayhem93/agieval-sat-math.

  19. h

    cmmlu_dpo_pairs

    • huggingface.co
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Belandros Pan (2024). cmmlu_dpo_pairs [Dataset]. https://huggingface.co/datasets/wenbopan/cmmlu_dpo_pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2024
    Authors
    Belandros Pan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for cmmlu_dpo_pairs

    Preference pairs derived from dev split of cmmlu and valid split of ceval-exam. Brute-forced way to align the distribution of LLM to favor the multi-choice style to increase scores on mmlu and ceval.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam

ceval-exam

C-Eval

ceval/ceval-exam

Explore at:
17 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 8, 2022
Dataset authored and provided by
ceval
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

Search
Clear search
Close search
Google apps
Main menu