21 datasets found

h
ceval-exam
huggingface.co
opendatalab.com
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam
Explore at:
Dataset updated
Jan 8, 2022
Dataset authored and provided by
ceval
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.
h
ceval-exam
huggingface.co
Updated Jan 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
oooooz (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 8, 2022
Authors
oooooz
Description
Add Column 'choices' to the original dataset.

Citation

If you use C-Eval benchmark or the code in your research, please cite their paper: @article{huang2023ceval, title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian}… See the full description on the dataset page: https://huggingface.co/datasets/zacharyxxxxcr/ceval-exam.
h
ceval-exam-zhtw
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Erhwen, Kuo, ceval-exam-zhtw [Dataset]. https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Erhwen, Kuo
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
Dataset Card for "ceval-exam-zhtw"

C-Eval 是一個針對基礎模型的綜合中文評估套件。它由 13,948 道多項選擇題組成，涵蓋 52 個不同的學科和四個難度級別。原始網站和 GitHub 或查看論文以了解更多詳細資訊。 C-Eval 主要的數據都是使用簡體中文來撰寫并且用來評測簡體中文的 LLM 的效能來設計的，本數據集使用 OpenCC 來進行簡繁的中文轉換，主要目的方便繁中 LLM 的開發與驗測。

下載

使用 Hugging Face datasets 直接載入資料集： from datasets import load_dataset

dataset=load_dataset(r"erhwenkuo/ceval-exam-zhtw",name="computer_network")

print(dataset['val'][0])

{'id': 0, 'question':… See the full description on the dataset page: https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw.
Neotech epic technologies c/o ceval logisitics950 loma verde USA Import &...
seair.co.in
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Seair Exim, Neotech epic technologies c/o ceval logisitics950 loma verde USA Import & Buyer Data [Dataset]. https://www.seair.co.in
Explore at:
.bin, .xml, .csv, .xlsAvailable download formats
Dataset provided by
Seair Exim Solutions
Authors
Seair Exim
Area covered
United States
Description
Subscribers can find out export and import data of 23 countries by HS code or product’s name. This demo is helpful for market analysis.
h
robench-eval-Time17-c
huggingface.co
Updated Nov 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zi Liang (2024). robench-eval-Time17-c [Dataset]. https://huggingface.co/datasets/liangzid/robench-eval-Time17-c
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 27, 2024
Authors
Zi Liang
Description
liangzid/robench-eval-Time17-c dataset hosted on Hugging Face and contributed by the HF Datasets community
P
HumanEval-X Dataset
paperswithcode.com
opendatalab.com
Updated Jun 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang (2025). HumanEval-X Dataset [Dataset]. https://paperswithcode.com/dataset/humaneval-x
Explore at:
Dataset updated
Jun 9, 2025
Authors
Qinkai Zheng; Xiao Xia; Xu Zou; Yuxiao Dong; Shan Wang; Yufei Xue; Zihan Wang; Lei Shen; Andi Wang; Yang Li; Teng Su; Zhilin Yang; Jie Tang
Description
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
(Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole...
doi.pangaea.de
html, tsv
Updated 1984
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pierre Cotillon; Michel Rio (1984). (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535 [Dataset]. http://doi.org/10.1594/PANGAEA.809134
Explore at:
html, tsvAvailable download formats
Unique identifier
https://doi.org/10.1594/PANGAEA.809134
Dataset updated
1984
Dataset provided by
PANGAEA
Authors
Pierre Cotillon; Michel Rio
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Dec 29, 1980
Area covered

Variables measured
Ratio, Carbon, Calcium carbonate, Sample code/label, DEPTH, sediment/rock, Carbon, organic, total, Carbon, pyrolysis mineral, Lithology/composition/facies, Production index, S1/(S1+S2), Pyrolysis temperature maximum, and 1 more
Description
This dataset is about: (Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole 77-535. Please consult parent dataset @ https://doi.org/10.1594/PANGAEA.809136 for more information.
P
RedEval Dataset
paperswithcode.com
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
Explore at:
Dataset updated
Apr 10, 2024
Authors
Rishabh Bhardwaj; Soujanya Poria
Description
RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

Question Banks:

HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

Installation:

Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

Prompt Templates:

Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

How to Perform Red-Teaming:

Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.
P
CodeFuseEval Dataset
paperswithcode.com
Updated Oct 26, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). CodeFuseEval Dataset [Dataset]. https://paperswithcode.com/dataset/codefuseeval
Explore at:
Dataset updated
Oct 26, 2023
Description
CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. This benchmark is designed to evaluate the performance of models in various multi-tasking tasks, including code completion, code generation from natural language, test case generation, cross-language code translation, and code generation from Chinese commands, among others.

The evaluation of the generated codes involves compiling and running in multiple programming languages. The versions of the programming language environments and packages we use are as follows:

Dependency Version
Python 3.10.9
JDK 18.0.2.1
Node.js 16.14.0
js-md5 0.7.3
C++ 11
g++ 7.5.0
Boost 1.75.0
OpenSSL 3.0.0
go 1.18.4
cargo 1.71.1
Z
Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...
data.niaid.nih.gov
zenodo.org
Updated Jul 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Di Salvo, Francesco (2024). MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11471503
Explore at:
Dataset updated
Jul 31, 2024
Dataset provided by
Doerrich, Sebastian
Ledig, Christian
Di Salvo, Francesco
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract: The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection, covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at github.com/francescodisalvo05/medmnistc-api.

This work has been accepted at the Workshop on Advancing Data Solutions in Medical Imaging AI @ MICCAI 2024 [preprint].

Note: Due to space constraints, we have uploaded all datasets except TissueMNIST-C. However, it can be reproduced via our APIs.

Usage: We recommend using the demo code and tutorials available on our GitHub repository.

Citation: If you find this work useful, please consider citing us:

@article{disalvo2024medmnist, title={MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions}, author={Di Salvo, Francesco and Doerrich, Sebastian and Ledig, Christian}, journal={arXiv preprint arXiv:2406.17536}, year={2024} }

Disclaimer: This repository is inspired by MedMNIST APIs and the ImageNet-C repository. Thus, please also consider citing MedMNIST, the respective source datasets (described here), and ImageNet-C.
w
Park Scores 2005-2014
data.wu.ac.at
data.sfgov.org
+4more
csv, json, rdf, xml
Updated Jun 10, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of San Francisco (2016). Park Scores 2005-2014 [Dataset]. https://data.wu.ac.at/schema/data_gov/NzA4Mzk1NDYtYzQyZS00NmRhLTk5YjgtZTY5ZjljZDg2MzUz
Explore at:
rdf, xml, csv, jsonAvailable download formats
Dataset updated
Jun 10, 2016
Dataset provided by
City of San Francisco
Description
Average quarterly park evaluation scores from Q3 FY2005 to Q4 FY2014. These scores are collected and reported pursuant to 2003's Prop C, which requires city agencies to establish and publish standards for street, sidewalk, and park maintenance. Beginning FY2015 a new methodology was developed to evaluate parks, therefore these scores should not form the basis of direct comparisons with scores reported in FY2015 and onward. FY2015 data onward is published and maintained by the SF Controller's Office.
P
SCARED-C Dataset
paperswithcode.com
data.mendeley.com
Updated Sep 29, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Reyes-Amezcua; Ricardo Espinosa; Christian Daul; Gilberto Ochoa-Ruiz; Andres Mendez-Vazquez (2024). SCARED-C Dataset [Dataset]. https://paperswithcode.com/dataset/scared-c
Explore at:
Dataset updated
Sep 29, 2024
Authors
Ivan Reyes-Amezcua; Ricardo Espinosa; Christian Daul; Gilberto Ochoa-Ruiz; Andres Mendez-Vazquez
Description
The dataset SCARED-C is introduced in the context of assessing robustness in endoscopic depth prediction models. It is part of the EndoDepth benchmark, which is designed to evaluate the performance of monocular depth prediction models specifically for endoscopic scenarios. The dataset features 16 different types of image corruptions, each with five levels of severity, encompassing challenges like lens distortion, resolution alterations, specular reflection, and color changes that are typical in endoscopic imaging. The ground truth is on the original testing set of SCARED.

The purpose of SCARED-C is to test the robustness of depth estimation models by exposing them to various common endoscopic corruptions. This dataset is a valuable tool for developing and evaluating depth prediction algorithms that can handle the unique challenges presented by endoscopic procedures, ensuring more accurate and reliable outcomes in medical imaging.
Dataset from An Open-Label, Multicenter, Extension Study to Evaluate the...
data.niaid.nih.gov
Updated Nov 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Medical Director (2024). Dataset from An Open-Label, Multicenter, Extension Study to Evaluate the Safety and Tolerability of Natalizumab Following Re-Initiation of Dosing in Multiple Sclerosis Subjects Who Have Completed Study C-1801, C-1802, C-1803, or C-1808 and a Dosing Suspension Safety Evaluation [Dataset]. http://doi.org/10.25934/00003229
Explore at:
Unique identifier
https://doi.org/10.25934/00003229
Dataset updated
Nov 26, 2024
Dataset provided by
Biogenhttp://biogen.com/
Authors
Medical Director
Area covered
Israel, Canada, Sweden, Switzerland, Netherlands, United Kingdom, Poland, Spain, Italy, Czech Republic
Variables measured
Expanded Disability Status Scale
Description
The primary objectives for the initial treatment period of this study are to further evaluate the safety of natalizumab monotherapy by evaluating the risk of hypersensitivity reactions and immunogenicity following re-exposure to natalizumab and confirming the safety of switching from interferon (IFN), glatiramer acetate, or other multiple sclerosis (MS) therapies to natalizumab. The primary objective for the long-term treatment period of this study is to evaluate the long-term impact of natalizumab monotherapy on the progression of disability measured by Expanded Disability Status Scale (EDSS) changes over time.
f
Table_1_Development of a modified C-BARQ for evaluating behavior in working...
frontiersin.figshare.com
docx
Updated Jun 28, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Hare; Jennifer Lynn Essler; Cynthia M. Otto; Dana Ebbecke; James A. Serpell (2024). Table_1_Development of a modified C-BARQ for evaluating behavior in working dogs.docx [Dataset]. http://doi.org/10.3389/fvets.2024.1371630.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fvets.2024.1371630.s003
Dataset updated
Jun 28, 2024
Dataset provided by
Frontiers
Authors
Elizabeth Hare; Jennifer Lynn Essler; Cynthia M. Otto; Dana Ebbecke; James A. Serpell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionCurrent high demand for effective odor detection dogs calls for the development of reliable methods for measuring performance-related behavioral phenotypes in these highly specialized working animals. The Canine Behavioral Assessment & Research Questionnaire (C-BARQ) is a widely used behavioral assessment tool among working dog organizations with a demonstrated ability to predict success/failure of dogs in training. However, this instrument was developed originally to study the prevalence of behavior problems in the pet dog population, and it therefore lacks the capacity to measure specific behavioral propensities that may also be important predictors of working dog success. The current paper examines the factor structure, internal reliability, and content validity of a modified version of the C-BARQ designed to evaluate four new domains of canine behavior in addition to those encompassed by the original C-BARQ. These domains, labeled Playfulness, Impulsivity, Distractibility, and Basophobia (fear of falling), respectively, describe aspects of canine behavior or temperament which are believed to contribute substantially to working dog performance.MethodsExploratory factor analysis (EFA) of owner/handler questionnaire responses based on a sample of 1,117 working odor detection dogs.ResultsA total of 15 factors were extracted by EFA, 10 of which correspond to original C-BARQ factors. The remaining 5 comprise the four new domains– Playfulness, Impulsivity, Distractibility, and Basophobia– as well as a fifth new factor labeled Food focus.DiscussionThe resulting Working Dog Canine Behavioral Assessment & Research Questionnaire (WDC-BARQ) successfully expands the measurement capacities of the original C-BARQ to include dimensions of behavior/temperament of particular relevance to many working dog populations.
f
Defects in C programs
figshare.com
7z
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuan-An Xiao (2022). Defects in C programs [Dataset]. http://doi.org/10.6084/m9.figshare.20073119.v1
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.20073119.v1
Dataset updated
Jun 20, 2022
Dataset provided by
figshare
Authors
Yuan-An Xiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These are datasets containing software defects in C programs paired with corresponding patches and metadata, collected from public GitHub repositories.

GDD.7z contains 181722 general defects

MDD.7z contains 48076 memory-related defects

[Note about compliance] These datasets are to help researchers evaluate the ability of deep learning in software engineering. They are not intended for commercial use, as repositories may have their own license. Users of these datasets should check the license of each defect on GitHub to see what is permitted. We have included the repository name of each defect in the corresponding metadata file for your convenience.
f
Evaluation of the capacities of mouse TCR profiling from short read RNA-seq...
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Bai; David Wang; Wentian Li; Ying Huang; Xuan Ye; Janelle Waite; Thomas Barry; Kurt H. Edelmann; Natasha Levenkova; Chunguang Guo; Dimitris Skokos; Yi Wei; Lynn E. Macdonald; Wen Fury (2023). Evaluation of the capacities of mouse TCR profiling from short read RNA-seq data [Dataset]. http://doi.org/10.1371/journal.pone.0207020
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0207020
Dataset updated
May 31, 2023
Dataset provided by
PLOS ONE
Authors
Yu Bai; David Wang; Wentian Li; Ying Huang; Xuan Ye; Janelle Waite; Thomas Barry; Kurt H. Edelmann; Natasha Levenkova; Chunguang Guo; Dimitris Skokos; Yi Wei; Lynn E. Macdonald; Wen Fury
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Profiling T cell receptor (TCR) repertoire via short read transcriptome sequencing (RNA-Seq) has a unique advantage of probing simultaneously TCRs and the genome-wide RNA expression of other genes. However, compared to targeted amplicon approaches, the shorter read length is more prone to mapping error. In addition, only a small percentage of the genome-wide reads may cover the TCR loci and thus the repertoire could be significantly under-sampled. Although this approach has been applied in a few studies, the utility of transcriptome sequencing in probing TCR repertoires has not been evaluated extensively. Here we present a systematic assessment of RNA-Seq in TCR profiling. We evaluate the power of both Fluidigm C1 full-length single cell RNA-Seq and bulk RNA-Seq in characterizing the repertoires of different diversities under either naïve conditions or after immunogenic challenges. Standard read length and sequencing coverage were employed so that the evaluation was conducted in accord with the current RNA-Seq practices. Despite high sequencing depth in bulk RNA-Seq, we encountered difficulty quantifying TCRs with low transcript abundance (
h
SV-TrustEval-C-1.0
huggingface.co
Updated Jun 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yansong Li (2025). SV-TrustEval-C-1.0 [Dataset]. https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0
Explore at:
Dataset updated
Jun 23, 2025
Authors
Yansong Li
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
SV‑TrustEval‑C 🚨🔒

🔍 Overview

SV‑TrustEval‑C is the first reasoning‑based benchmark designed to rigorously evaluate Large Language Models (LLMs) on both structure (control/data flow) and semantic reasoning for vulnerability analysis in C source code. Unlike existing benchmarks that focus solely on pattern recognition, SV‑TrustEval‑C measures logical consistency, adaptability to code transformations, and real‑world security reasoning across six core tasks. Our… See the full description on the dataset page: https://huggingface.co/datasets/LLMs4CodeSecurity/SV-TrustEval-C-1.0.
h
cmmlu_dpo_pairs
huggingface.co
Updated Apr 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Belandros Pan (2024). cmmlu_dpo_pairs [Dataset]. https://huggingface.co/datasets/wenbopan/cmmlu_dpo_pairs
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 18, 2024
Authors
Belandros Pan
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for cmmlu_dpo_pairs

Preference pairs derived from dev split of cmmlu and valid split of ceval-exam. Brute-forced way to align the distribution of LLM to favor the multi-choice style to increase scores on mmlu and ceval.
h
c-seo-bench
huggingface.co
Updated Jun 6, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parameter Lab (2025). c-seo-bench [Dataset]. https://huggingface.co/datasets/parameterlab/c-seo-bench
Explore at:
Dataset updated
Jun 6, 2025
Dataset authored and provided by
Parameter Lab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Summary

C-SEO Bench is a benchmark designed to evaluate conversational search engine optimization (C-SEO) techniques across two common tasks: product recommendation and question answering. Each task spans multiple domains to assess domain-specific effects and generalization ability of C-SEO methods.

Supported Tasks and Domains Product Recommendation

This task requires an LLM to recommend the top-k products relevant to a user query, using only the… See the full description on the dataset page: https://huggingface.co/datasets/parameterlab/c-seo-bench.
h
MediMeta-C
huggingface.co
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raza Imam (2025). MediMeta-C [Dataset]. https://huggingface.co/datasets/razaimam45/MediMeta-C
Explore at:
Dataset updated
May 31, 2025
Authors
Raza Imam
Description
RobustMedCLIP: On the Robustness of Medical Vision-Language Models: Are they Truly Generalizable?

Accepted at [Medical Image Understanding and Analysis (MIUA) 2025]

🚀 Highlights

🧠 MVLM Benchmarking: Evaluate 5 major and recent MVLMs across 5 modalities, 7 corruption types, and 5 severity levels 📉 Corruption Evaluation: Analyze degradation under Gaussian noise, motion blur, pixelation, etc. 🔬 MediMeta-C: A new benchmark simulating real-world OOD shifts in… See the full description on the dataset page: https://huggingface.co/datasets/razaimam45/MediMeta-C.

Dependency	Version
Python	3.10.9
JDK	18.0.2.1
Node.js	16.14.0
js-md5	0.7.3
C++	11
g++	7.5.0
Boost	1.75.0
OpenSSL	3.0.0
go	1.18.4
cargo	1.71.1

Facebook

Twitter

Click to copy link

Link copied

Cite

ceval (2022). ceval-exam [Dataset]. https://huggingface.co/datasets/ceval/ceval-exam

ceval-exam

C-Eval

ceval/ceval-exam

Explore at:

13 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Jan 8, 2022

Dataset authored and provided by

ceval

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels. Please visit our website and GitHub or check our paper for more details. Each subject consists of three splits: dev, val, and test. The dev set per subject consists of five exemplars with explanations for few-shot evaluation. The val set is intended to be used for hyperparameter tuning. And the test set is for model… See the full description on the dataset page: https://huggingface.co/datasets/ceval/ceval-exam.

Clear search

Close search

Google apps

Main menu

ceval-exam

ceval-exam

ceval-exam-zhtw

{'id': 0, 'question':… See the full description on the dataset page: https://huggingface.co/datasets/erhwenkuo/ceval-exam-zhtw.

Neotech epic technologies c/o ceval logisitics950 loma verde USA Import &...

robench-eval-Time17-c

HumanEval-X Dataset

(Appendix C) Carbonate, organic carbon, and Rock-Eval pyrolysis at DSDP Hole...

RedEval Dataset

CodeFuseEval Dataset

Data from: MedMNIST-C: Comprehensive benchmark and improved classifier...

Park Scores 2005-2014

SCARED-C Dataset

Dataset from An Open-Label, Multicenter, Extension Study to Evaluate the...

Table_1_Development of a modified C-BARQ for evaluating behavior in working...

Defects in C programs

Evaluation of the capacities of mouse TCR profiling from short read RNA-seq...

SV-TrustEval-C-1.0

cmmlu_dpo_pairs

c-seo-bench

MediMeta-C

ceval-exam

C-Eval

ceval/ceval-exam