21 datasets found
  1. h

    ceval-exam

    • huggingface.co
    • opendatalab.com
    Updated Jan 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam
    Explore at:
    Dataset updated
    Jan 8, 2022
    Dataset authored and provided by
    ceval
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

  2. h

    ceval-exam

    • huggingface.co
    Updated Jan 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    oooooz (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 8, 2022
    Authors
    oooooz
    Description

    Add Column 'choices' to the original dataset.

      Citation
    

    If you use C-Eval benchmark or the code in your research, please cite their paper: @article{huang2023ceval, title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian}… See the full description on the dataset page: https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam.

  3. h

    ceval-exam-zhtw

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erhwen, Kuo, ceval-exam-zhtw [Dataset]. https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Erhwen, Kuo
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    Dataset Card for "ceval-exam-zhtw"

    C-Eval 是一個針對基礎模型的綜合中文評估套件。它由 13,948 道多項選擇題組成,涵蓋 52 個不同的學科和四個難度級別。原始網站和 GitHub 或查看論文以了解更多詳細資訊。 C-Eval 主要的數據都是使用簡體中文來撰寫并且用來評測簡體中文的 LLM 的效能來設計的,本數據集使用 OpenCC 來進行簡繁的中文轉換,主要目的方便繁中 LLM 的開發與驗測。

      下載
    

    使用 Hugging Face datasets 直接載入資料集: from datasets import load_dataset

    dataset=load_dataset(r"erhwenkuo/ceval-exam-zhtw",name="computer_network")

    print(dataset['val'][0])

    {'id': 0, 'question':… See the full description on the dataset page: https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw.

  4. Neotech epic technologies c/o ceval logisitics950 loma verde USA Import &...

    • seair.co.in
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seair Exim, Neotech epic technologies c/o ceval logisitics950 loma verde USA Import & Buyer Data [Dataset]. https://www.seair.co.in
    Explore at:
    .bin, .xml, .csv, .xlsAvailable download formats
    Dataset provided by
    Seair Exim Solutions
    Authors
    Seair Exim
    Area covered
    United States
    Description

    Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.

  5. h

    robench-eval-Time17-c

    • huggingface.co
    Updated Nov 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zi Liang (2024). robench-eval-Time17-c [Dataset]. https://huggingface.co/datasets/liangzid/robench-eval-Time17-c
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 27, 2024
    Authors
    Zi Liang
    Description

    liangzid/robench-eval-Time17-c dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. P

    HumanEval-X Dataset

    • paperswithcode.com
    • opendatalab.com
    Updated Jun 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang (2025). HumanEval-X Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-x
    Explore at:
    Dataset updated
    Jun 9, 2025
    Authors
    Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang
    Description

    HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.

  7. (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole...

    • doi.pangaea.de
    html, tsv
    Updated 1984
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pierre Cotillon; Michel Rio (1984). (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535 [Dataset]. http://doi.org/10.1594/PANGAEA.809134
    Explore at:
    html, tsvAvailable download formats
    Dataset updated
    1984
    Dataset provided by
    PANGAEA
    Authors
    Pierre Cotillon; Michel Rio
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Dec 29, 1980
    Area covered
    Variables measured
    Ratio, Carbon, Calcium carbonate, Sample code/label, DEPTH, sediment/rock, Carbon, organic, total, Carbon, pyrolysis mineral, Lithology/composition/facies, Production index, S1/(S1+S2), Pyrolysis temperature maximum, and 1 more
    Description

    This dataset is about: (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.809136 for more information.

  8. P

    RedEval Dataset

    • paperswithcode.com
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
    Explore at:
    Dataset updated
    Apr 10, 2024
    Authors
    Rishabh Bhardwaj; Soujanya Poria
    Description

    RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

    Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

    Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

    Question Banks:

    HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

    Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

    Installation:

    Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

    Prompt Templates:

    Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

    How to Perform Red-Teaming:

    Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.

  9. P

    CodeFuseEval Dataset

    • paperswithcode.com
    Updated Oct 26, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). CodeFuseEval Dataset [Dataset]. https://paperswithcode.com/dataset/codefuseeval
    Explore at:
    Dataset updated
    Oct 26, 2023
    Description

    CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. This benchmark is designed to evaluate the performance of models in various multi-tasking tasks, including code completion, code generation from natural language, test case generation, cross-language code translation, and code generation from Chinese commands, among others.

    The evaluation of the generated codes involves compiling and running in multiple programming languages. The versions of the programming language environments and packages we use are as follows:

    DependencyVersion
    Python3.10.9
    JDK18.0.2.1
    Node.js16.14.0
    js-md50.7.3
    C++11
    g++7.5.0
    Boost1.75.0
    OpenSSL3.0.0
    go1.18.4
    cargo1.71.1
  10. Z

    Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Di Salvo, Francesco (2024). MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11471503
    Explore at:
    Dataset updated
    Jul 31, 2024
    Dataset provided by
    Doerrich, Sebastian
    Ledig, Christian
    Di Salvo, Francesco
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.

    This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].

    Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.

    Usage: We recommend using the demo code and tutorials available on our GitHub repository.

    Citation: If you find this work useful, please consider citing us:

    @article{disalvo2024medmnist, title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions}, author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian}, journal={arXiv preprint arXiv:2406.17536}, year={2024} }

    Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.

  11. w

    Park Scores 2005-2014

    • data.wu.ac.at
    • data.sfgov.org
    • +4more
    csv, json, rdf, xml
    Updated Jun 10, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of San Francisco (2016). Park Scores 2005-2014 [Dataset]. https://data.wu.ac.at/schema/data_gov/NzA4Mzk1NDYtYzQyZS00NmRhLTk5YjgtZTY5ZjljZDg2MzUz
    Explore at:
    rdf, xml, csv, jsonAvailable download formats
    Dataset updated
    Jun 10, 2016
    Dataset provided by
    City of San Francisco
    Description

    Average quarterly park evaluation scores from Q3 FY2005 to Q4 FY2014. These scores are collected and reported pursuant to 2003's Prop C, which requires city agencies to establish and publish standards for street, sidewalk, and park maintenance. Beginning FY2015 a new methodology was developed to evaluate parks, therefore these scores should not form the basis of direct comparisons with scores reported in FY2015 and onward. FY2015 data onward is published and maintained by the SF Controller's Office.

  12. P

    SCARED-C Dataset

    • paperswithcode.com
    • data.mendeley.com
    Updated Sep 29, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ivan Reyes-Amezcua; Ricardo Espinosa; Christian Daul; Gilberto Ochoa-Ruiz; Andres Mendez-Vazquez (2024). SCARED-C Dataset [Dataset]. https://paperswithcode.com/dataset/scared-c
    Explore at:
    Dataset updated
    Sep 29, 2024
    Authors
    Ivan Reyes-Amezcua; Ricardo Espinosa; Christian Daul; Gilberto Ochoa-Ruiz; Andres Mendez-Vazquez
    Description

    The dataset SCARED-C is introduced in the context of assessing robustness in endoscopic depth prediction models. It is part of the EndoDepth benchmark, which is designed to evaluate the performance of monocular depth prediction models specifically for endoscopic scenarios. The dataset features 16 different types of image corruptions, each with five levels of severity, encompassing challenges like lens distortion, resolution alterations, specular reflection, and color changes that are typical in endoscopic imaging. The ground truth is on the original testing set of SCARED.

    The purpose of SCARED-C is to test the robustness of depth estimation models by exposing them to various common endoscopic corruptions. This dataset is a valuable tool for developing and evaluating depth prediction algorithms that can handle the unique challenges presented by endoscopic procedures, ensuring more accurate and reliable outcomes in medical imaging.

  13. Dataset from An Open-Label, Multicenter, Extension Study to Evaluate the...

    • data.niaid.nih.gov
    Updated Nov 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Medical Director (2024). Dataset from An Open-Label, Multicenter, Extension Study to Evaluate the Safety and Tolerability of Natalizumab Following Re-Initiation of Dosing in Multiple Sclerosis Subjects Who Have Completed Study C-1801, C-1802, C-1803, or C-1808 and a Dosing Suspension Safety Evaluation [Dataset]. http://doi.org/10.25934/00003229
    Explore at:
    Dataset updated
    Nov 26, 2024
    Dataset provided by
    Biogenhttp://biogen.com/
    Authors
    Medical Director
    Area covered
    Israel, Canada, Sweden, Switzerland, Netherlands, United Kingdom, Poland, Spain, Italy, Czech Republic
    Variables measured
    Expanded Disability Status Scale
    Description

    The primary objectives for the initial treatment period of this study are to further evaluate the safety of natalizumab monotherapy by evaluating the risk of hypersensitivity reactions and immunogenicity following re-exposure to natalizumab and confirming the safety of switching from interferon (IFN), glatiramer acetate, or other multiple sclerosis (MS) therapies to natalizumab. The primary objective for the long-term treatment period of this study is to evaluate the long-term impact of natalizumab monotherapy on the progression of disability measured by Expanded Disability Status Scale (EDSS) changes over time.

  14. f

    Table_1_Development of a modified C-BARQ for evaluating behavior in working...

    • frontiersin.figshare.com
    docx
    Updated Jun 28, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Hare; Jennifer Lynn Essler; Cynthia M. Otto; Dana Ebbecke; James A. Serpell (2024). Table_1_Development of a modified C-BARQ for evaluating behavior in working dogs.docx [Dataset]. http://doi.org/10.3389/fvets.2024.1371630.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 28, 2024
    Dataset provided by
    Frontiers
    Authors
    Elizabeth Hare; Jennifer Lynn Essler; Cynthia M. Otto; Dana Ebbecke; James A. Serpell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionCurrent high demand for effective odor detection dogs calls for the development of reliable methods for measuring performance-related behavioral phenotypes in these highly specialized working animals. The Canine Behavioral Assessment & Research Questionnaire (C-BARQ) is a widely used behavioral assessment tool among working dog organizations with a demonstrated ability to predict success/failure of dogs in training. However, this instrument was developed originally to study the prevalence of behavior problems in the pet dog population, and it therefore lacks the capacity to measure specific behavioral propensities that may also be important predictors of working dog success. The current paper examines the factor structure, internal reliability, and content validity of a modified version of the C-BARQ designed to evaluate four new domains of canine behavior in addition to those encompassed by the original C-BARQ. These domains, labeled Playfulness, Impulsivity, Distractibility, and Basophobia (fear of falling), respectively, describe aspects of canine behavior or temperament which are believed to contribute substantially to working dog performance.MethodsExploratory factor analysis (EFA) of owner/handler questionnaire responses based on a sample of 1,117 working odor detection dogs.ResultsA total of 15 factors were extracted by EFA, 10 of which correspond to original C-BARQ factors. The remaining 5 comprise the four new domains– Playfulness, Impulsivity, Distractibility, and Basophobia– as well as a fifth new factor labeled Food focus.DiscussionThe resulting Working Dog Canine Behavioral Assessment & Research Questionnaire (WDC-BARQ) successfully expands the measurement capacities of the original C-BARQ to include dimensions of behavior/temperament of particular relevance to many working dog populations.

  15. f

    Defects in C programs

    • figshare.com
    7z
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan-An Xiao (2022). Defects in C programs [Dataset]. http://doi.org/10.6084/m9.figshare.20073119.v1
    Explore at:
    7zAvailable download formats
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    figshare
    Authors
    Yuan-An Xiao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These are datasets containing software defects in C programs paired with corresponding patches and metadata, collected from public GitHub repositories.

    • GDD.7z contains 181722 general defects
    • MDD.7z contains 48076 memory-related defects

    [Note about compliance] These datasets are to help researchers evaluate the ability of deep learning in software engineering. They are not intended for commercial use, as repositories may have their own license. Users of these datasets should check the license of each defect on GitHub to see what is permitted. We have included the repository name of each defect in the corresponding metadata file for your convenience.

  16. f

    Evaluation of the capacities of mouse TCR profiling from short read RNA-seq...

    • plos.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yu Bai; David Wang; Wentian Li; Ying Huang; Xuan Ye; Janelle Waite; Thomas Barry; Kurt H. Edelmann; Natasha Levenkova; Chunguang Guo; Dimitris Skokos; Yi Wei; Lynn E. Macdonald; Wen Fury (2023). Evaluation of the capacities of mouse TCR profiling from short read RNA-seq data [Dataset]. http://doi.org/10.1371/journal.pone.0207020
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yu Bai; David Wang; Wentian Li; Ying Huang; Xuan Ye; Janelle Waite; Thomas Barry; Kurt H. Edelmann; Natasha Levenkova; Chunguang Guo; Dimitris Skokos; Yi Wei; Lynn E. Macdonald; Wen Fury
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Profiling T cell receptor (TCR) repertoire via short read transcriptome sequencing (RNA-Seq) has a unique advantage of probing simultaneously TCRs and the genome-wide RNA expression of other genes. However, compared to targeted amplicon approaches, the shorter read length is more prone to mapping error. In addition, only a small percentage of the genome-wide reads may cover the TCR loci and thus the repertoire could be significantly under-sampled. Although this approach has been applied in a few studies, the utility of transcriptome sequencing in probing TCR repertoires has not been evaluated extensively. Here we present a systematic assessment of RNA-Seq in TCR profiling. We evaluate the power of both Fluidigm C1 full-length single cell RNA-Seq and bulk RNA-Seq in characterizing the repertoires of different diversities under either naïve conditions or after immunogenic challenges. Standard read length and sequencing coverage were employed so that the evaluation was conducted in accord with the current RNA-Seq practices. Despite high sequencing depth in bulk RNA-Seq, we encountered difficulty quantifying TCRs with low transcript abundance (

  17. h

    SV-TrustEval-C-1.0

    • huggingface.co
    Updated Jun 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yansong Li (2025). SV-TrustEval-C-1.0 [Dataset]. https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0
    Explore at:
    Dataset updated
    Jun 23, 2025
    Authors
    Yansong Li
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SV‑TrustEval‑C 🚨🔒

      🔍 Overview
    

    SV‑TrustEval‑C is the first reasoning‑based benchmark designed to rigorously evaluate Large Language Models (LLMs) on both structure (control/data flow) and semantic reasoning for vulnerability analysis in C source code. Unlike existing benchmarks that focus solely on pattern recognition, SV‑TrustEval‑C measures logical consistency, adaptability to code transformations, and real‑world security reasoning across six core tasks. Our… See the full description on the dataset page: https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0.

  18. h

    cmmlu_dpo_pairs

    • huggingface.co
    Updated Apr 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Belandros Pan (2024). cmmlu_dpo_pairs [Dataset]. https://huggingface.co/datasets/wenbopan/cmmlu_dpo_pairs
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 18, 2024
    Authors
    Belandros Pan
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for cmmlu_dpo_pairs

    Preference pairs derived from dev split of cmmlu and valid split of ceval-exam. Brute-forced way to align the distribution of LLM to favor the multi-choice style to increase scores on mmlu and ceval.

  19. h

    c-seo-bench

    • huggingface.co
    Updated Jun 6, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parameter Lab (2025). c-seo-bench [Dataset]. https://huggingface.co/datasets/parameterlab/c-seo-bench
    Explore at:
    Dataset updated
    Jun 6, 2025
    Dataset authored and provided by
    Parameter Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Summary

    C-SEO Bench is a benchmark designed to evaluate conversational search engine optimization (C-SEO) techniques across two common tasks: product recommendation and question answering. Each task spans multiple domains to assess domain-specific effects and generalization ability of C-SEO methods.

      Supported Tasks and Domains
    
    
    
    
    
      Product Recommendation
    

    This task requires an LLM to recommend the top-k products relevant to a user query, using only the… See the full description on the dataset page: https://huggingface.co/datasets/parameterlab/c-seo-bench.

  20. h

    MediMeta-C

    • huggingface.co
    Updated May 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raza Imam (2025). MediMeta-C [Dataset]. https://huggingface.co/datasets/razaimam45/MediMeta-C
    Explore at:
    Dataset updated
    May 31, 2025
    Authors
    Raza Imam
    Description

    RobustMedCLIP: On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

    Accepted at [Medical Image Understanding and Analysis (MIUA) 2025]

      🚀 Highlights
    

    🧠 MVLM Benchmarking: Evaluate 5 major and recent MVLMs across 5 modalities, 7 corruption types, and 5 severity levels 📉 Corruption Evaluation: Analyze degradation under Gaussian noise, motion blur, pixelation, etc. 🔬 MediMeta-C: A new benchmark simulating real-world OOD shifts in… See the full description on the dataset page: https://huggingface.co/datasets/razaimam45/MediMeta-C.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam

ceval-exam

C-Eval

ceval/ceval-exam

Explore at:
13 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Jan 8, 2022
Dataset authored and provided by
ceval
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

Search
Clear search
Close search
Google apps
Main menu