Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.
Add Column 'choices' to the original dataset.
Citation
If you use C-Eval benchmark or the code in your research, please cite their paper: @article{huang2023ceval, title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian}… See the full description on the dataset page: https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Card for "ceval-exam-zhtw"
C-Eval 是一個針對基礎模型的綜合中文評估套件。它由 13,948 道多項選擇題組成,涵蓋 52 個不同的學科和四個難度級別。原始網站和 GitHub 或查看論文以了解更多詳細資訊。 C-Eval 主要的數據都是使用簡體中文來撰寫并且用來評測簡體中文的 LLM 的效能來設計的,本數據集使用 OpenCC 來進行簡繁的中文轉換,主要目的方便繁中 LLM 的開發與驗測。
下載
使用 Hugging Face datasets 直接載入資料集: from datasets import load_dataset
dataset=load_dataset(r"erhwenkuo/ceval-exam-zhtw",name="computer_network")
print(dataset['val'][0])
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
liangzid/robench-eval-Time17-c dataset hosted on Hugging Face and contributed by the HF Datasets community
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset is about: (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.809136 for more information.
RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:
Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.
Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).
Question Banks:
HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.
Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.
Installation:
Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.
Prompt Templates:
Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.
How to Perform Red-Teaming:
Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.
CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. This benchmark is designed to evaluate the performance of models in various multi-tasking tasks, including code completion, code generation from natural language, test case generation, cross-language code translation, and code generation from Chinese commands, among others.
The evaluation of the generated codes involves compiling and running in multiple programming languages. The versions of the programming language environments and packages we use are as follows:
Dependency | Version |
---|---|
Python | 3.10.9 |
JDK | 18.0.2.1 |
Node.js | 16.14.0 |
js-md5 | 0.7.3 |
C++ | 11 |
g++ | 7.5.0 |
Boost | 1.75.0 |
OpenSSL | 3.0.0 |
go | 1.18.4 |
cargo | 1.71.1 |
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.
This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].
Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.
Usage: We recommend using the demo code and tutorials available on our GitHub repository.
Citation: If you find this work useful, please consider citing us:
@article{disalvo2024medmnist, title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions}, author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian}, journal={arXiv preprint arXiv:2406.17536}, year={2024} }
Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.
Average quarterly park evaluation scores from Q3 FY2005 to Q4 FY2014. These scores are collected and reported pursuant to 2003's Prop C, which requires city agencies to establish and publish standards for street, sidewalk, and park maintenance. Beginning FY2015 a new methodology was developed to evaluate parks, therefore these scores should not form the basis of direct comparisons with scores reported in FY2015 and onward. FY2015 data onward is published and maintained by the SF Controller's Office.
The dataset SCARED-C is introduced in the context of assessing robustness in endoscopic depth prediction models. It is part of the EndoDepth benchmark, which is designed to evaluate the performance of monocular depth prediction models specifically for endoscopic scenarios. The dataset features 16 different types of image corruptions, each with five levels of severity, encompassing challenges like lens distortion, resolution alterations, specular reflection, and color changes that are typical in endoscopic imaging. The ground truth is on the original testing set of SCARED.
The purpose of SCARED-C is to test the robustness of depth estimation models by exposing them to various common endoscopic corruptions. This dataset is a valuable tool for developing and evaluating depth prediction algorithms that can handle the unique challenges presented by endoscopic procedures, ensuring more accurate and reliable outcomes in medical imaging.
The primary objectives for the initial treatment period of this study are to further evaluate the safety of natalizumab monotherapy by evaluating the risk of hypersensitivity reactions and immunogenicity following re-exposure to natalizumab and confirming the safety of switching from interferon (IFN), glatiramer acetate, or other multiple sclerosis (MS) therapies to natalizumab. The primary objective for the long-term treatment period of this study is to evaluate the long-term impact of natalizumab monotherapy on the progression of disability measured by Expanded Disability Status Scale (EDSS) changes over time.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionCurrent high demand for effective odor detection dogs calls for the development of reliable methods for measuring performance-related behavioral phenotypes in these highly specialized working animals. The Canine Behavioral Assessment & Research Questionnaire (C-BARQ) is a widely used behavioral assessment tool among working dog organizations with a demonstrated ability to predict success/failure of dogs in training. However, this instrument was developed originally to study the prevalence of behavior problems in the pet dog population, and it therefore lacks the capacity to measure specific behavioral propensities that may also be important predictors of working dog success. The current paper examines the factor structure, internal reliability, and content validity of a modified version of the C-BARQ designed to evaluate four new domains of canine behavior in addition to those encompassed by the original C-BARQ. These domains, labeled Playfulness, Impulsivity, Distractibility, and Basophobia (fear of falling), respectively, describe aspects of canine behavior or temperament which are believed to contribute substantially to working dog performance.MethodsExploratory factor analysis (EFA) of owner/handler questionnaire responses based on a sample of 1,117 working odor detection dogs.ResultsA total of 15 factors were extracted by EFA, 10 of which correspond to original C-BARQ factors. The remaining 5 comprise the four new domains– Playfulness, Impulsivity, Distractibility, and Basophobia– as well as a fifth new factor labeled Food focus.DiscussionThe resulting Working Dog Canine Behavioral Assessment & Research Questionnaire (WDC-BARQ) successfully expands the measurement capacities of the original C-BARQ to include dimensions of behavior/temperament of particular relevance to many working dog populations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
These are datasets containing software defects in C programs paired with corresponding patches and metadata, collected from public GitHub repositories.
[Note about compliance] These datasets are to help researchers evaluate the ability of deep learning in software engineering. They are not intended for commercial use, as repositories may have their own license. Users of these datasets should check the license of each defect on GitHub to see what is permitted. We have included the repository name of each defect in the corresponding metadata file for your convenience.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Profiling T cell receptor (TCR) repertoire via short read transcriptome sequencing (RNA-Seq) has a unique advantage of probing simultaneously TCRs and the genome-wide RNA expression of other genes. However, compared to targeted amplicon approaches, the shorter read length is more prone to mapping error. In addition, only a small percentage of the genome-wide reads may cover the TCR loci and thus the repertoire could be significantly under-sampled. Although this approach has been applied in a few studies, the utility of transcriptome sequencing in probing TCR repertoires has not been evaluated extensively. Here we present a systematic assessment of RNA-Seq in TCR profiling. We evaluate the power of both Fluidigm C1 full-length single cell RNA-Seq and bulk RNA-Seq in characterizing the repertoires of different diversities under either naïve conditions or after immunogenic challenges. Standard read length and sequencing coverage were employed so that the evaluation was conducted in accord with the current RNA-Seq practices. Despite high sequencing depth in bulk RNA-Seq, we encountered difficulty quantifying TCRs with low transcript abundance (
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SV‑TrustEval‑C 🚨🔒
🔍 Overview
SV‑TrustEval‑C is the first reasoning‑based benchmark designed to rigorously evaluate Large Language Models (LLMs) on both structure (control/data flow) and semantic reasoning for vulnerability analysis in C source code. Unlike existing benchmarks that focus solely on pattern recognition, SV‑TrustEval‑C measures logical consistency, adaptability to code transformations, and real‑world security reasoning across six core tasks. Our… See the full description on the dataset page: https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for cmmlu_dpo_pairs
Preference pairs derived from dev split of cmmlu and valid split of ceval-exam. Brute-forced way to align the distribution of LLM to favor the multi-choice style to increase scores on mmlu and ceval.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Summary
C-SEO Bench is a benchmark designed to evaluate conversational search engine optimization (C-SEO) techniques across two common tasks: product recommendation and question answering. Each task spans multiple domains to assess domain-specific effects and generalization ability of C-SEO methods.
Supported Tasks and Domains
Product Recommendation
This task requires an LLM to recommend the top-k products relevant to a user query, using only the… See the full description on the dataset page: https://huggingface.co/datasets/parameterlab/c-seo-bench.
RobustMedCLIP: On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?
Accepted at [Medical Image Understanding and Analysis (MIUA) 2025]
🚀 Highlights
🧠 MVLM Benchmarking: Evaluate 5 major and recent MVLMs across 5 modalities, 7 corruption types, and 5 severity levels 📉 Corruption Evaluation: Analyze degradation under Gaussian noise, motion blur, pixelation, etc. 🔬 MediMeta-C: A new benchmark simulating real-world OOD shifts in… See the full description on the dataset page: https://huggingface.co/datasets/razaimam45/MediMeta-C.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.