64 datasets found

IFEval
huggingface.co
Updated Dec 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google (2023). IFEval [Dataset]. https://huggingface.co/datasets/google/IFEval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 22, 2023
Dataset authored and provided by
Googlehttp://google.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for IFEval

Dataset Summary

This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset

ifeval = load_dataset("google/IFEval")

Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.
h
wild-if-eval
huggingface.co
Updated May 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gili Lior (2025). wild-if-eval [Dataset]. https://huggingface.co/datasets/gililior/wild-if-eval
Explore at:
Dataset updated
May 17, 2025
Authors
Gili Lior
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
WildIFEval Dataset

This dataset was originally introduced in the paper WildIFEval: Instruction Following in the Wild, available on arXiv. Code: https://github.com/gililior/wild-if-eval

Dataset Overview

The WildIFEval dataset is designed for evaluating instruction-following capabilities in language models. It provides decompositions of conversations extracted from the LMSYS-Chat-1M dataset. Each example includes:

conversation_id: A unique identifier for each conversation.… See the full description on the dataset page: https://huggingface.co/datasets/gililior/wild-if-eval.
P
IFEval Dataset
paperswithcode.com
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeffrey Zhou; Tianjian Lu; Swaroop Mishra; Siddhartha Brahma; Sujoy Basu; Yi Luan; Denny Zhou; Le Hou (2025). IFEval Dataset [Dataset]. https://paperswithcode.com/dataset/ifeval
Explore at:
Dataset updated
May 13, 2025
Authors
Jeffrey Zhou; Tianjian Lu; Swaroop Mishra; Siddhartha Brahma; Sujoy Basu; Yi Luan; Denny Zhou; Le Hou
Description
This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks", etc.
h
wild-if-eval-gpt-decomposition
huggingface.co
Updated May 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gili Lior (2025). wild-if-eval-gpt-decomposition [Dataset]. https://huggingface.co/datasets/gililior/wild-if-eval-gpt-decomposition
Explore at:
Dataset updated
May 11, 2025
Authors
Gili Lior
Description
gililior/wild-if-eval-gpt-decomposition dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset for: SESR-Eval: Dataset to Evaluate LLMs in the Screening Process of...
zenodo.org
zip
Updated Apr 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Authors Anonymous; Authors Anonymous (2025). Dataset for: SESR-Eval: Dataset to Evaluate LLMs in the Screening Process of Systematic Reviews [Dataset]. http://doi.org/10.5281/zenodo.15295440
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15295440
Dataset updated
Apr 28, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Authors Anonymous; Authors Anonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This is a dataset for: "SESR-Eval: Dataset to Evaluate LLMs in the Screening Process of Systematic Reviews".

Folder structure

data

The `data`-folder contains:

- Initial replication package selection (`1-replication-package-selection`)

- Inter-rater reliablity agreement for replication package selection (`2-replication-package-selection-reliability-agreement`)

- Processed replication packages (`3-processed-data`)

- Replication packages are omitted due to size constraints, but are downloadable via provided links

- LLM results (`4-llm-results`)

- The SESR-Eval dataset (`sesr-eval-dataset`)

See: `data/sesr-eval-dataset/README.md`

documentation

The `documentation`-folder contains miscellaneous documentation for the study.

experiments

The `experiments`-folder contains the LLM experiment source code.

How to run the benchmarks?

1. Install Python 3

2. Run `python3 -m venv venv`

3. Run `source venv/bin/acticate`

4. Run `pip install -r requirements.txt`

5. Copy `.env.example` to `.env`

6. Obtain:

1. Dataset (see data/sesr-eval-dataset/README.md)

2. OpenAI API key

3. Openrouter API key (if you wish to run other models than OpenAI)

7. Run: `./run_experiments.sh`

Requirements

- Python 3

Scopus API usage

The data was downloaded from Scopus API between January 1 and 25 April, 2025 via http://api.elsevier.com and http://www.scopus.com.

License

The replication package is licensed with the CC-BY-ND 4.0 license. Each dataset secondary study has their own license. However, Elsevier has their own terms and conditions regarding the use of our research data:

----

This work uses data that was downloaded from Scopus API between Jan 1 and Apr 24, 2025 via http://api.elsevier.com and http://www.scopus.com.

Elsevier allows access to the Scopus APIs in support of academic research for researchers affiliated with a Scopus subscribing institution.

The end product here is a scholarly published work, that utilizes publications in Scopus for our research effort. We want to publish a scholarly work regarding Scopus data relationships.

The data downloaded from Scopus API, for our work, is published to make work follow the practices of open science. It also makes possible to reproduce our work's results.

Elsevier allows this use case under the following conditions, which our work meets:

- The research is for non-commercial, academic purposes only.

- The research is performed by approved representative of the applying institution.

- The research is limited to the scope of Software engineering (SE) - we are not mining the entire Scopus dataset.

- The retention of original research dataset is limited to archival purposes and reproduction of the research results.

- Public sharing of data for purpose of reproducibility with a specific party is permissible upon written request and explicit written approval.

- Scopus has been identified as the data source as described in the Scopus Attribution Guide.

- If the user is a bibliometrician doing work outside this use case, they contact Elsevier's International Center for the Study of Research.

The data is not displayed in a website or in a public forum outisde of the output format of the scholarly published work. The data is only stored in Zenodo, in this replication package.
P
RedEval Dataset
paperswithcode.com
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
Explore at:
Dataset updated
Apr 10, 2024
Authors
Rishabh Bhardwaj; Soujanya Poria
Description
RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

Question Banks:

HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

Installation:

Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

Prompt Templates:

Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

How to Perform Red-Teaming:

Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.
RLVR-IFeval
huggingface.co
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). RLVR-IFeval [Dataset]. https://huggingface.co/datasets/allenai/RLVR-IFeval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
IF Data - RLVR Formatted

This dataset contains instruction following data formatted for use with open-instruct - specifically reinforcement learning with verifiable rewards. Prompts with verifiable constraints generated by sampling from the Tulu 2 SFT mixture and randomly adding constraints from IFEval. Part of the Tulu 3 release, for which you can see models here and datasets here.

Dataset Structure

Each example in the dataset contains the standard instruction-tuning… See the full description on the dataset page: https://huggingface.co/datasets/allenai/RLVR-IFeval.
h
MM-Eval
huggingface.co
Updated Sep 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
prometheus-eval (2024). MM-Eval [Dataset]. https://huggingface.co/datasets/prometheus-eval/MM-Eval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 29, 2024
Dataset authored and provided by
prometheus-eval
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Multilingual Meta-EVALuation benchmark (MM-Eval)

👨‍💻Code | 📄Paper | 🤗 MMQA

MM-Eval is a multilingual meta-evaluation benchmark consisting of five core subsets—Chat, Reasoning, Safety, Language Hallucination, and Linguistics—spanning 18 languages and a Language Resource subset spanning 122 languages for a broader analysis of language effects.

Design ChoiceIn this work, we minimize the inclusion of translated samples, as mere translation may alter existing preferences due to… See the full description on the dataset page: https://huggingface.co/datasets/prometheus-eval/MM-Eval.
B
Data from: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability...
borealisdata.ca
Updated Jul 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aisha Khatun; Dan Brown (2024). TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability [Dataset]. http://doi.org/10.5683/SP3/5MZWBV
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP3/5MZWBV
Dataset updated
Jul 30, 2024
Dataset provided by
Borealis
Authors
Aisha Khatun; Dan Brown
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Large Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs' various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs' abilities from their stochastic nature. Details of collection method and use cases can be found in this paper: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
P
Data from: COFFE Dataset
paperswithcode.com
Updated Feb 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). COFFE Dataset [Dataset]. https://paperswithcode.com/dataset/coffe
Explore at:
Dataset updated
Feb 4, 2025
Description
COFFE COFFE is a Python benchmark for evaluating the time efficiency of LLM-generated code. It is released by the FSE'25 paper "COFFE: A Code Efficiency Benchmark for Code Generation". You can also refer to the project webpage for more details.

Data COFFE is designed for evaluating both function-level code and file-level code. It contains selected instances from HumanEval, MBPP, APPS and Code Contests. COFFE keeps the original test cases in these benchmarks as correctness test cases and adds new test cases designed for time efficiency evaluation as stressful test cases.

Statistics:

Category #Instance #Solution/Instance #Correctness/Instance #Stressful/Instance
Function-Level 398 1.00 5.72 4.99
File-Level 358 66.93 43.68 4.95

Data Files:

All instances in COFFE are in Coffe/datasets, where Coffe/datasets/function contains all function-level instances and Coffe/datasets/file contains all file-level instances. In each repo: - best_solutions.json contains the best ground truth solution COFFE uses to calculate efficient@1 and speedup. - stressful_testcases.json contains all stressful test cases COFFE adds. - solutions.json contains all ground truth solutions in original benchmarks. - testcases.json contains all correctness test cases in original benchmarks.

Installation Requirements: - Linux Machine - Docker - Python>=3.10

We suggest you create a virtural environment before installing COFFE.

To use COFFE, please clone this repo in your current workspace workspace/ and execute: bash cd Coffe && pip install .

COFFE comes with a docker image, to install it: bash docker build . -t coffe Note that if your network requires proxy, please modify the Dockerfile in Coffe/ to indicate it, otherwise the docker image building process could fail.

Go back to your workspace and initialize COFFE: bash cd .. && coffe init If your installation succeeds, you could see the statistics of COFFE.

Usage Pipeline When you prepare the predictions from LLMs, COFFE provides a pipeline to calculate the efficient@1 and speedup defined in the paper: bash coffe pipe

For example: bash coffe pipe function Coffe/examples/function -p Coffe/examples/function/GPT-4o.json -f efficient_at_1 -n 8 This command evaluates the predictions from GPT-4o on the function-level instances of COFFE. If you want to evaluate other LLMs, please prepare a JSON file with the same format as Coffe/examples/function/GPT-4o.json.

Prediction File Format:

In the JSON file, the key is the prompt used to query the LLM for the results, you could get the prompts in datasets/function/prompts.json and datasets/file/prompts.json. The value contains two objects, the first is a list contains the raw outputs from LLMs and the second is an indicator for the whether the raw output is valid.

Note:

In default, COFFE will run all predictions in docker. However, if you could not successfully install the docker or want to run the predictions on the host machine, you can add the -x option.

Single Evaluation The pipe command provides an entire pipeline for calculating the final metrics. This pipeline could also be completed by executing the following four single evaluation commands.

Sanitize the predictions bash coffe eval

Select correct predictions bash coffe eval

Note: This command will combine all correct solutions and ground truth solutions together into files

Evaluate the GPU instruction count bash coffe eval

Note: This command could only accept one single prediction file ending with PASSED_SOLUTIONS.json.

Calculating the efficient@1/speedup bash coffe eval

Note: This command requires the index file and the instruction file as COFFE compares the performance of predictions with grouth truth solutions to calculate the metrics.

STGen For details about the stressful test case generation approach STGen, please see stgen/.

Cite If you use COFFE, please cite us: @misc{peng2025coffe, title={COFFE: A Code Efficiency Benchmark for Code Generation}, author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren}, year={2025}, eprint={2502.02827}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2502.02827}, }
Data from: Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl...
catalog.data.gov
Updated Jun 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2023). Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment [Dataset]. https://catalog.data.gov/dataset/plasma-protein-binding-evaluations-of-per-and-polyfluoroalkyl-substances-for-category-base
Explore at:
Dataset updated
Jun 29, 2023
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Marci Smeltz, John F. Wambaugh, and Barbara A. Wetmore, Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment, Chemical Research in Toxicology, 10.1021/acs.chemrestox.3c00003. This dataset is associated with the following publication: Smeltz, M., J. Wambaugh, and B. Wetmore. Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, USA, 36(6): 870-871, (2023).
ContextEval
huggingface.co
Updated Nov 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2024). ContextEval [Dataset]. https://huggingface.co/datasets/allenai/ContextEval
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 8, 2024
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

Dataset Summary

We provide here the data accompanying the paper: Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations.

Dataset Structure Data Instances

We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments.

Data overview List of queries: Data Structure

The list… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ContextEval.
o
TruthfulQA: Benchmark for Evaluating Language
opendatabay.com
.undefined
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). TruthfulQA: Benchmark for Evaluating Language [Dataset]. https://www.opendatabay.com/data/ai-ml/248609bb-9172-4cd4-a042-6dcb70fcc3f3
Explore at:
.undefinedAvailable download formats
Dataset updated
Jun 26, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Data Science and Analytics
Description
The TruthfulQA dataset is specifically designed to evaluate the truthfulness of language models in generating answers to a wide range of questions. Comprising 817 carefully crafted questions spanning various topics such as health, law, finance, and politics, this benchmark aims to uncover any erroneous or false answers that may arise due to incorrect beliefs or misconceptions. It serves as a comprehensive measure of the ability of language models to go beyond imitating human texts and avoid generating inaccurate responses. The dataset includes columns such as type (indicating the format or style of the question), category (providing the topic or theme), best_answer (the correct and truthful answer), correct_answers (a list containing all valid responses), incorrect_answers (a list encompassing potential false interpretations provided by some humans), source (identifying the origin or reference for each question), mc1_targets and mc2_targets (highlighting respective correct answers for multiple-choice questions). The generation_validation.csv file contains generated questions and their corresponding evaluations based on truthfulness, while multiple_choice_validation.csv focuses on validating multiple-choice questions along with their answer choices. Through this dataset, researchers can comprehensively assess language model performance in terms of factual accuracy and avoidance of misleading information during answer generation tasks

How to use the dataset How to Use the TruthfulQA Dataset: A Guide Welcome to the TruthfulQA dataset, a benchmark designed to evaluate the truthfulness of language models in generating answers to questions. This guide will provide you with essential information on how to effectively utilize this dataset for your own purposes.

Dataset Overview The TruthfulQA dataset consists of 817 carefully crafted questions covering a wide range of topics, including health, law, finance, and politics. These questions are constructed in such a way that some humans would answer falsely due to false beliefs or misconceptions. The aim is to assess language models' ability to avoid generating false answers learned from imitating human texts.

Files in the Dataset The dataset includes two main files:

generation_validation.csv: This file contains questions and answers generated by language models. These responses are evaluated based on their truthfulness.

multiple_choice_validation.csv: This file consists of multiple-choice questions along with their corresponding answer choices for validation purposes.

Column Descriptions To better understand the dataset and its contents, here is an explanation of each column present in both files:

type: Indicates the type or format of the question. category: Represents the category or topic of the question. best_answer: Provides the correct and truthful answer according to human knowledge/expertise. correct_answers: Contains a list of correct and truthful answers provided by humans. incorrect_answers: Lists incorrect and false answers that some humans might provide. source: Specifies where the question originates from (e.g., publication, website). For multiple-choice questions: mc1_targets, mc2_targets, etc.: Represent different options available as answer choices (with corresponding correct answers). Using this Dataset Effectively When utilizing this dataset for evaluation or testing purposes:

Truth Evaluation: For assessing language models' truthfulness in generating answers, use the generation_validation.csv file. Compare the model answers with the correct_answers column to evaluate their accuracy.

Multiple-Choice Evaluation: To test language models' ability to choose the correct answer among given choices, refer to the multiple_choice_validation.csv file. The correct answer options are provided in the columns such as mc1_targets, mc2_targets, etc.

Ensure that you consider these guidelines while leveraging this dataset for your analysis or experiments related to evaluating language models' truthfulness and performance.

Remember that this guide is intended to help

Research Ideas Training and evaluating language models: The TruthfulQA dataset can be used to train and evaluate the truthfulness of language models in generating answers to questions. By comparing the generated answers with the correct and truthful ones provided in the dataset, researchers can assess the ability of language models to avoid false answers learned from imitating human texts. Detecting misinformation: This dataset can also be used to develop algorithms or models that are capable of identifying false or misleading information. By analyzing the generated answers and comparing them with the correct ones, it is possible to build systems that automatically detect and flag misinformation. Improving fact-checking systems: Fact-checking platforms or systems can benefit from this dataset by using it as a source for training and validating their algorithms. With access to a large number of
H
Replication Data for Study 1 of Instructor Name Preference and Student...
dataverse.harvard.edu
search.dataone.org
Updated Aug 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Melissa Foster (2022). Replication Data for Study 1 of Instructor Name Preference and Student Evaluations of Instruction [Dataset]. http://doi.org/10.7910/DVN/GT89B6
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/GT89B6
Dataset updated
Aug 24, 2022
Dataset provided by
Harvard Dataverse
Authors
Melissa Foster
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This data set includes the independent variable "Name" which identifies whether the instructor expressed preference for being referred to by her first name or professional title (e.g. "Dr. Last Name"). The dependent variable of interest was the Student Evaluation of Instruction scores, especially the "overall" score. Class information includes the class size, number of evaluations, year, and semester (Fall or Spring).
Data from: In Vitro Hepatic Clearance Evaluations of Per- and...
catalog.data.gov
Updated Mar 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2025). In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) across Multiple Structural Categories [Dataset]. https://catalog.data.gov/dataset/in-vitro-hepatic-clearance-evaluations-of-per-and-polyfluoroalkyl-substances-pfas-across-m
Explore at:
Dataset updated
Mar 22, 2025
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
Dataset for "Crizer, D.M.; Rice, J.R.; Smeltz, M.G.; Lavrich, K.S.; Ravindra, K.; Wambaugh, J.F.; DeVito, M.; Wetmore, B.A. In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) across Multiple Structural Categories. Toxics 2024, 12, 672. https://doi.org/10.3390/toxics12090672". This dataset is associated with the following publication: Crizer, D., J. Rice, M. Smeltz, K. Lavrich, K. Ravindra, J. Wambaugh, M. Devito, and B. Wetmore. In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) Across Multiple Structural Categories. Toxics. MDPI, Basel, SWITZERLAND, 12(9): 672, (2024).
P
DiscoEval Dataset
paperswithcode.com
Updated May 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zachary Bamberger; Ofek Glick; Chaim Baskin; Yonatan Belinkov (2024). DiscoEval Dataset [Dataset]. https://paperswithcode.com/dataset/discoeval
Explore at:
Dataset updated
May 12, 2024
Authors
Zachary Bamberger; Ofek Glick; Chaim Baskin; Yonatan Belinkov
Description
Dataset Summary

The DiscoEval is an English-language Benchmark that contains a test suite of 7 tasks to evaluate whether sentence representations include semantic information relevant to discourse processing. The benchmark datasets offer a collection of tasks designed to evaluate natural language understanding models in the context of discourse analysis and coherence.

Dataset Sources

Arxiv: A repository of scientific papers and research articles. Wikipedia: An extensive online encyclopedia with articles on diverse topics. Rocstory: A dataset consisting of fictional stories. Ubuntu IRC channel: Conversational data extracted from the Ubuntu Internet Relay Chat (IRC) channel. PeerRead: A dataset of scientific papers frequently used for discourse-related tasks. RST Discourse Treebank: A dataset annotated with Rhetorical Structure Theory (RST) discourse relations. Penn Discourse Treebank: Another dataset with annotated discourse relations, facilitating the study of discourse structure
Z
Data from: Static Stack-Preserving Intra-Procedural Slicing of WebAssembly...
data.niaid.nih.gov
zenodo.org
Updated Feb 8, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David W. Binkley (2022). Static Stack-Preserving Intra-Procedural Slicing of WebAssembly Binaries [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5821006
Explore at:
Dataset updated
Feb 8, 2022
Dataset provided by
Quentin Stiévenart
Coen De Roover
David W. Binkley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
About this artifact

This artifact contains the implementation and the results of the evaluation of a static slicer for WebAssembly described in the ICSE 2022 paper titled "Static Stack-Preserving Intra-Procedural Slicing of WebAssembly Binaries".

The artifact contains a docker image (wassail-eval.tar.xz) that contains everything necessary to reproduce our evaluation, and the actual data resulting from our evaluation: 1. The implementation of our slicer (presented in Section 4.1) is included in the docker machine, and is available publicly here: https://github.com/acieroid/wassail/tree/icse2022 2. Test cases used for our evaluation of RQ1 are included in the docker machine and in the rq1.tar.xz archive. 3. The dataset used in RQ2, RQ3, and RQ4 is included in the docker machine. 4. The code needed to run our evaluation of RQ2, RQ3, and RQ4 is included in the docker machine. 5. The scripts used to generate the statistics and graphs that are included in the paper for RQ2, RQ3, and RQ4 are included in the docker machine and as the *.py files in this artifact. 6. The data of RQ5 that has been used in our manual investigation is included in the docker machine and in the rq5.tar.xz archive, along with rq5-manual.txt detailing our manual analysis findings.

How to obtain it

Our artifact is available on Zenodo at the following URL: https://zenodo.org/record/5821007

Setting up the Docker image

Downloading The Artifact

The artifact is available at the following URL: https://zenodo.org/record/5821007

Loading The Docker Image

Once the artifact is downloaded in the file icse2022slicing.tar.xz, it can be extracted and loaded into Docker as follows (this takes a few minutes): docker import icse2022slicing.tar.xz To simplify further commands, you can tag the image using the printed sha256 hash of the image: if the docker import command resulted in the hash 54aa9416a379a6c71b1c325985add8bf931752d754c8fb17872c05f4e4b52ea2, you can run: docker tag 54aa9416a379a6c71b1c325985add8bf931752d754c8fb17872c05f4e4b52ea2 wassail-eval

Once the Docker image has been loaded, you can run the following commands to obtain a shell in the appropriate environment: docker volume create result docker run -it -v result:/tmp/out/ wassail-eval bash su - opam

Reproducing results of RQ1

Our manual translations of the "classical" examples are included in the rq1/ directory (available in the docker image and in rq1.tar.xz). We include the slices computed by our implementation in the rq1/out/ directory.

A slice can be produced for each example in the docker image as follows, where the first argument is the name of the program being sliced, the second the function index being sliced, the third the slicing criterion (indicated as the instruction index, where instructions start at 1), and the last argument is the output file for the slice:

cd rq1/ wassail slice scam-mug.wat 5 8 scam-mug-slice.wat wassail slice montreal-boat.wat 5 19 montreal-boat-slice.wat wassail slice word-count.wat 1 41 word-count-slice1.wat wassail slice word-count.wat 1 43 word-count-slice2.wat wassail slice word-count.wat 1 39 word-count-slice3.wat wassail slice word-count.wat 1 45 word-count-slice4.wat wassail slice word-count.wat 1 37 word-count-slice5.wat wassail slice agrawal-fig-3.wat 3 38 agrawal-fig-3-slice.wat wassail slice agrawal-fig-5.wat 3 37 agrawal-fig-5-slice.wat

The slice results can then be inspected manually, and compared with the original version of the .wat program to see which instructions have been removed, or with the expected solutions in the out/ directory, e.g. by running: diff word-count-slice1.wat out/word-count-slice1.wat (No output is expected if the slice is correct)

Reproducing results of RQ2, RQ3, and RQ4

For these RQ, we include the data resulting from our evaluation, but we also allow reviewers to rerun the full evaluation if needed. However, such an evaluation requires a heavy machine and takes quite some time (4-5 days to run to completion with a 4 hours timeout). In our case, we used a machine with 256 GB of RAM and a 64-core processor with HyperThreading enabled, allowing us to run 128 slicing jobs in parallel.

Runnig the Evaluation

We explain how to run the full evaluation, or only a partial evaluation below. One can directly skip to the next section and reuse our raw evaluation results, provided alongside this artifact.

Running the Full Evaluation

In order to reproduce our evaluation, you can run the following commands in the docker image. It is recommended to run them in a tmux session if one wants to inspect other elements in parallel (tmux is installed in the docker image). The timeout (set to 4 hours per binary, like in the paper) can be decreased by editing the evaluate.sh script (vim is installed in the docker image).

This is expected to take 2-3 days of time, on a machine with 128 cores. In order to produce only partial results, see the next section.

cd filtered cat ../supported.txt | parallel --bar -j 128 sh ../evaluate.sh {}

The results are outputted in the /tmp/out/ directory.

Running a Partial Evaluation

If one does not have access to a high-end machine with 128 cores nor the time to run the full evaluation, it is possible to produce partial results. To do so, the following commands can be run. This will run the evaluation on the full dataset in a random order, which can be stopped early to represent a partial view of our evaluation, on a random subset of the data. In order to gather more datapoints, it is also advised to decrease the timeout in the evaluate.sh file, for example to 20 minutes by setting TIMEOUT=20m with nano evaluate.sh. The number of slicing jobs running in parallel can also be decreased to match the number of processors on the machine running the experiments (the -j 128 argument in the following command runs 128 parallel jobs)

sudo chown opam:opam /tmp/out/ cd filtered shuf ../supported.txt | parallel --bar -j 128 sh ../evaluate.sh {}

The evaluation results will be stored in the /tmp/out/ directory.

Skipping the Evaluation Run

Instead of rerunning the evaluation, one can rely on our full results included in the data.txt.xz and error.txt.xz archives. These can simply be downloaded from within the Docker machine and extracted in /tmp/out/:

cd /tmp/out/ wget https://zenodo.org/record/5821007/files/data.txt.xz wget https://zenodo.org/record/5821007/files/error.txt.xz unxz data.txt.7z unxz error.txt.7z

Processing the data

In order to process this data, we included multiple python script. These require around 100GB of RAM to load the full dataset in memory. The scripts should be run with Python 3. When running this in the docker image, first run cd /tmp/out/ && cp /home/opam/*.py ./ - To count the number of functions sliced, run cut -d, -f 1,2 data.txt | sort -u | wc -l. This takes around 6 minutes to run on the full dataset. - To count the total number of slices encountered, run wc -l data.txt error.txt. This takes around 15 seconds to run. - To count the number of errors encountered, run wc -l error.txt. This takes around 1 second to run. - To produce data and graphs regarding the sizes and timing, run python3 statistics-and-plots.py. This will output the statistics presented in the paper, along with Figure 2 (rq2-sizes.pdf) and Figure 3 (rq2-times.pdf). This script takes around 35 minutes to run. - To find the executable slices that are larger than the original programs, run python3 larger-slices.py > larger.txt. This script takes around 2h30 to run. It will list the slice using the notation filename function-sliced slicing-criterion in the larger.txt file, from which the slice can be recomputed by running wassail slice function-sliced slicing-criterion output.wat in the docker image. It will also output statistics regarding these slices, which you can easily inspect by running tail larger.txt. - To investigate slices that could not be computed, run: sed -i error.txt -e 's/annotation,/annotation./' python3 errors.py This will take a few seconds to run and will print a summary of the errors encountered during the slicing process, and requires some manual sorting to map to the categories we discuss in the paper. Here is a summary of the errors encountered and their root cause:

Root Cause: Unsupported Usage of br_table

Error: (Failure"Invalid vstack when popping 2 values") Error: (Failure"Spec_inference.drop: not enough elements in stack") Error: (Failure"Spec_inference.take: not enough element in var list") Error: (Failure"unsupported in spec_inference: incompatible stack lengths (probably due to mismatches in br_table branches)")

Root Cause: Unreachable Code

Error: (Failure"Unsupported in slicing: cannot find an instruction. It probably is part of unreachable code.") Error: (Failure"bottom annotation") Error: (Failure"bottom annotation. this an unreachable instruction")

RQ5: Comparison to Slicing C Programs

For this RQ, we include the following data in the rq5.7z archive, and in the rq5/ directory in the docker image: - The slicing subjects in their C and textual wasm form in rq5/subjects/ - The CodeSurfer slices in their C and textual wasm form in rq5/codesurfer/ - Our slices in their wasm form in rq5/wasm-slices/

As this RQ requires heavy manual comparison, we do not expect the reviewers to reproduce all of our results. We include a summary of our manual investigation in rq5-manual.txt. In order to validate these manual findings, one can for example inspect a specific slice. For example, the following line in rq5-manual.txt:

adpcm_apl1_565_expr.c.wat INTERPROCEDURAL

can be validated as follows: ``` cd ~/

This generates a trimmed down version of the CodeSurfer slice, only containing the function of
Data from: A public data set of human balance evaluations
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Damiana A. dos Santos; Marcos Duarte (2023). A public data set of human balance evaluations [Dataset]. http://doi.org/10.6084/m9.figshare.3394432.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3394432.v2
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Damiana A. dos Santos; Marcos Duarte
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The data set comprises signals from the force platform (raw data for the force, moments of forces, and centers of pressure) of 163 subjects plus one file with information about the subjects and balance conditions and the results of the other evaluations. Subject’s balance was evaluated by posturography using a force platform and by the Mini Balance Evaluation Systems Tests. In the posturography test, we evaluated subjects during standing still for 60 s in four different conditions where vision and the standing surface were manipulated: on a rigid surface with eyes open; on a rigid surface with eyes closed; on an unstable surface with eyes open; on an unstable surface with eyes closed. Each condition was performed three times and the order of the conditions was randomized among subjects. In addition, the following tests were employed in order to better characterize each subject: Short Falls Efficacy Scale International; International Physical Activity Questionnaire Short Version; and Trail Making Test. The subjects were also interviewed to collect information about their socio-cultural, demographic, and health characteristics.The data set is available at PhysioNet (DOI: 10.13026/C2WW2W) and at figshare (DOI: 10.6084/m9.figshare.3394432).
2025 Green Card Report for Title Of Engineer; Eval
myvisajobs.com
Updated Jan 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
MyVisaJobs (2025). 2025 Green Card Report for Title Of Engineer; Eval [Dataset]. https://www.myvisajobs.com/reports/green-card/major/title-of-engineer;-eval
Explore at:
Dataset updated
Jan 16, 2025
Dataset provided by
MyVisaJobs.com
Authors
MyVisaJobs
License
https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/
Variables measured
Major, Salary, Petitions Filed
Description
A dataset that explores Green Card sponsorship trends, salary data, and employer insights for title of engineer; eval in the U.S.
P
ComplexCodeEval Dataset
paperswithcode.com
Updated Oct 31, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). ComplexCodeEval Dataset [Dataset]. https://paperswithcode.com/dataset/complexcodeeval
Explore at:
Dataset updated
Oct 31, 2024
Description
ComplexCodeEval ComplexCodeEval is an evaluation benchmark designed to accommodate multiple downstream tasks, accurately reflect different programming environments, and deliberately avoid data leakage issues. This benchmark includes a diverse set of samples from real-world projects, aiming to closely mirror actual development scenarios.

Overview ComplexCodeEval consists of: - 3,897 Java samples from 1,055 code repositories - 7,184 Python samples from 2,107 code repositories

Key Features

Diverse Downstream Tasks: The benchmark supports multiple downstream tasks to evaluate the performance of different code analysis tools and models. Accurate Reflection of Programming Environments: Samples are selected from projects that use popular third-party frameworks and packages. Avoidance of Data Leakage: Incorporates multiple timestamps for each sample to prevent data leakage.

Category	#Instance	#Solution/Instance	#Correctness/Instance	#Stressful/Instance
Function-Level	398	1.00	5.72	4.99
File-Level	358	66.93	43.68	4.95

Facebook

Twitter

Click to copy link

Link copied

Cite

Google (2023). IFEval [Dataset]. https://huggingface.co/datasets/google/IFEval

IFEval

google/IFEval

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 22, 2023

Dataset authored and provided by

Googlehttp://google.com/

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for IFEval

  Dataset Summary

This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset

ifeval = load_dataset("google/IFEval")

  Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.

Clear search

Close search

Google apps

Main menu

IFEval

wild-if-eval

IFEval Dataset

wild-if-eval-gpt-decomposition

Dataset for: SESR-Eval: Dataset to Evaluate LLMs in the Screening Process of...

Introduction

Folder structure

data

documentation

experiments

How to run the benchmarks?

Requirements

Scopus API usage

License

RedEval Dataset

RLVR-IFeval

MM-Eval

Data from: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability...

Data from: COFFE Dataset

Data from: Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl...

ContextEval

TruthfulQA: Benchmark for Evaluating Language

Replication Data for Study 1 of Instructor Name Preference and Student...

Data from: In Vitro Hepatic Clearance Evaluations of Per- and...

DiscoEval Dataset

Data from: Static Stack-Preserving Intra-Procedural Slicing of WebAssembly...

About this artifact

How to obtain it

Setting up the Docker image

Downloading The Artifact

Loading The Docker Image

Reproducing results of RQ1

Reproducing results of RQ2, RQ3, and RQ4

Runnig the Evaluation

Running the Full Evaluation

Running a Partial Evaluation

Skipping the Evaluation Run

Processing the data

Root Cause: Unsupported Usage of br_table

Root Cause: Unreachable Code

RQ5: Comparison to Slicing C Programs

This generates a trimmed down version of the CodeSurfer slice, only containing the function of

Data from: A public data set of human balance evaluations

2025 Green Card Report for Title Of Engineer; Eval

ComplexCodeEval Dataset

IFEvalSee More Versions

IFEval

google/IFEval

IFEval