64 datasets found
  1. IFEval

    • huggingface.co
    Updated Dec 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google (2023). IFEval [Dataset]. https://huggingface.co/datasets/google/IFEval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2023
    Dataset authored and provided by
    Googlehttp://google.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for IFEval

      Dataset Summary
    

    This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset

    ifeval = load_dataset("google/IFEval")

      Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.
    
  2. h

    wild-if-eval

    • huggingface.co
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gili Lior (2025). wild-if-eval [Dataset]. https://huggingface.co/datasets/gililior/wild-if-eval
    Explore at:
    Dataset updated
    May 17, 2025
    Authors
    Gili Lior
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    WildIFEval Dataset

    This dataset was originally introduced in the paper WildIFEval: Instruction Following in the Wild, available on arXiv. Code: https://github.com/gililior/wild-if-eval

      Dataset Overview
    

    The WildIFEval dataset is designed for evaluating instruction-following capabilities in language models. It provides decompositions of conversations extracted from the LMSYS-Chat-1M dataset. Each example includes:

    conversation_id: A unique identifier for each conversation.… See the full description on the dataset page: https://huggingface.co/datasets/gililior/wild-if-eval.

  3. P

    IFEval Dataset

    • paperswithcode.com
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey Zhou; Tianjian Lu; Swaroop Mishra; Siddhartha Brahma; Sujoy Basu; Yi Luan; Denny Zhou; Le Hou (2025). IFEval Dataset [Dataset]. https://paperswithcode.com/dataset/ifeval
    Explore at:
    Dataset updated
    May 13, 2025
    Authors
    Jeffrey Zhou; Tianjian Lu; Swaroop Mishra; Siddhartha Brahma; Sujoy Basu; Yi Luan; Denny Zhou; Le Hou
    Description

    This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks", etc.

  4. h

    wild-if-eval-gpt-decomposition

    • huggingface.co
    Updated May 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gili Lior (2025). wild-if-eval-gpt-decomposition [Dataset]. https://huggingface.co/datasets/gililior/wild-if-eval-gpt-decomposition
    Explore at:
    Dataset updated
    May 11, 2025
    Authors
    Gili Lior
    Description

    gililior/wild-if-eval-gpt-decomposition dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. Dataset for: SESR-Eval: Dataset to Evaluate LLMs in the Screening Process of...

    • zenodo.org
    zip
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Authors Anonymous; Authors Anonymous (2025). Dataset for: SESR-Eval: Dataset to Evaluate LLMs in the Screening Process of Systematic Reviews [Dataset]. http://doi.org/10.5281/zenodo.15295440
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Authors Anonymous; Authors Anonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Introduction

    This is a dataset for: "SESR-Eval: Dataset to Evaluate LLMs in the Screening Process of Systematic Reviews".

    Folder structure

    data

    The `data`-folder contains:
    - Initial replication package selection (`1-replication-package-selection`)
    - Inter-rater reliablity agreement for replication package selection (`2-replication-package-selection-reliability-agreement`)
    - Processed replication packages (`3-processed-data`)
    - Replication packages are omitted due to size constraints, but are downloadable via provided links
    - LLM results (`4-llm-results`)
    - The SESR-Eval dataset (`sesr-eval-dataset`)
    See: `data/sesr-eval-dataset/README.md`

    documentation

    The `documentation`-folder contains miscellaneous documentation for the study.

    experiments

    The `experiments`-folder contains the LLM experiment source code.

    How to run the benchmarks?

    1. Install Python 3
    2. Run `python3 -m venv venv`
    3. Run `source venv/bin/acticate`
    4. Run `pip install -r requirements.txt`
    5. Copy `.env.example` to `.env`
    6. Obtain:
    1. Dataset (see data/sesr-eval-dataset/README.md)
    2. OpenAI API key
    3. Openrouter API key (if you wish to run other models than OpenAI)
    7. Run: `./run_experiments.sh`

    Requirements

    - Python 3

    Scopus API usage

    The data was downloaded from Scopus API between January 1 and 25 April, 2025 via http://api.elsevier.com and http://www.scopus.com.

    License

    The replication package is licensed with the CC-BY-ND 4.0 license. Each dataset secondary study has their own license. However, Elsevier has their own terms and conditions regarding the use of our research data:
    ----
    This work uses data that was downloaded from Scopus API between Jan 1 and Apr 24, 2025 via http://api.elsevier.com and http://www.scopus.com.
    Elsevier allows access to the Scopus APIs in support of academic research for researchers affiliated with a Scopus subscribing institution.
    The end product here is a scholarly published work, that utilizes publications in Scopus for our research effort. We want to publish a scholarly work regarding Scopus data relationships.
    The data downloaded from Scopus API, for our work, is published to make work follow the practices of open science. It also makes possible to reproduce our work's results.
    Elsevier allows this use case under the following conditions, which our work meets:
    - The research is for non-commercial, academic purposes only.
    - The research is performed by approved representative of the applying institution.
    - The research is limited to the scope of Software engineering (SE) - we are not mining the entire Scopus dataset.
    - The retention of original research dataset is limited to archival purposes and reproduction of the research results.
    - Public sharing of data for purpose of reproducibility with a specific party is permissible upon written request and explicit written approval.
    - Scopus has been identified as the data source as described in the Scopus Attribution Guide.
    - If the user is a bibliometrician doing work outside this use case, they contact Elsevier's International Center for the Study of Research.
    The data is not displayed in a website or in a public forum outisde of the output format of the scholarly published work. The data is only stored in Zenodo, in this replication package.
  6. P

    RedEval Dataset

    • paperswithcode.com
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rishabh Bhardwaj; Soujanya Poria (2024). RedEval Dataset [Dataset]. https://paperswithcode.com/dataset/redeval
    Explore at:
    Dataset updated
    Apr 10, 2024
    Authors
    Rishabh Bhardwaj; Soujanya Poria
    Description

    RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:

    Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.

    Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).

    Question Banks:

    HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.

    Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.

    Installation:

    Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.

    Prompt Templates:

    Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.

    How to Perform Red-Teaming:

    Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.

  7. RLVR-IFeval

    • huggingface.co
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). RLVR-IFeval [Dataset]. https://huggingface.co/datasets/allenai/RLVR-IFeval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    IF Data - RLVR Formatted

    This dataset contains instruction following data formatted for use with open-instruct - specifically reinforcement learning with verifiable rewards. Prompts with verifiable constraints generated by sampling from the Tulu 2 SFT mixture and randomly adding constraints from IFEval. Part of the Tulu 3 release, for which you can see models here and datasets here.

      Dataset Structure
    

    Each example in the dataset contains the standard instruction-tuning… See the full description on the dataset page: https://huggingface.co/datasets/allenai/RLVR-IFeval.

  8. h

    MM-Eval

    • huggingface.co
    Updated Sep 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    prometheus-eval (2024). MM-Eval [Dataset]. https://huggingface.co/datasets/prometheus-eval/MM-Eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 29, 2024
    Dataset authored and provided by
    prometheus-eval
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Multilingual Meta-EVALuation benchmark (MM-Eval)

    👨‍💻Code | 📄Paper | 🤗 MMQA

    MM-Eval is a multilingual meta-evaluation benchmark consisting of five core subsets—Chat, Reasoning, Safety, Language Hallucination, and Linguistics—spanning 18 languages and a Language Resource subset spanning 122 languages for a broader analysis of language effects.

    Design ChoiceIn this work, we minimize the inclusion of translated samples, as mere translation may alter existing preferences due to… See the full description on the dataset page: https://huggingface.co/datasets/prometheus-eval/MM-Eval.

  9. B

    Data from: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability...

    • borealisdata.ca
    Updated Jul 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aisha Khatun; Dan Brown (2024). TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability [Dataset]. http://doi.org/10.5683/SP3/5MZWBV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 30, 2024
    Dataset provided by
    Borealis
    Authors
    Aisha Khatun; Dan Brown
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Large Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs' various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs' abilities from their stochastic nature. Details of collection method and use cases can be found in this paper: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability

  10. P

    Data from: COFFE Dataset

    • paperswithcode.com
    Updated Feb 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). COFFE Dataset [Dataset]. https://paperswithcode.com/dataset/coffe
    Explore at:
    Dataset updated
    Feb 4, 2025
    Description

    COFFE COFFE is a Python benchmark for evaluating the time efficiency of LLM-generated code. It is released by the FSE'25 paper "COFFE: A Code Efficiency Benchmark for Code Generation". You can also refer to the project webpage for more details.

    Data COFFE is designed for evaluating both function-level code and file-level code. It contains selected instances from HumanEval, MBPP, APPS and Code Contests. COFFE keeps the original test cases in these benchmarks as correctness test cases and adds new test cases designed for time efficiency evaluation as stressful test cases.

    Statistics:

    Category#Instance#Solution/Instance#Correctness/Instance#Stressful/Instance
    Function-Level3981.005.724.99
    File-Level35866.9343.684.95

    Data Files:

    All instances in COFFE are in Coffe/datasets, where Coffe/datasets/function contains all function-level instances and Coffe/datasets/file contains all file-level instances. In each repo: - best_solutions.json contains the best ground truth solution COFFE uses to calculate efficient@1 and speedup. - stressful_testcases.json contains all stressful test cases COFFE adds. - solutions.json contains all ground truth solutions in original benchmarks. - testcases.json contains all correctness test cases in original benchmarks.

    Installation Requirements: - Linux Machine - Docker - Python>=3.10

    We suggest you create a virtural environment before installing COFFE.

    To use COFFE, please clone this repo in your current workspace workspace/ and execute: bash cd Coffe && pip install .

    COFFE comes with a docker image, to install it: bash docker build . -t coffe Note that if your network requires proxy, please modify the Dockerfile in Coffe/ to indicate it, otherwise the docker image building process could fail.

    Go back to your workspace and initialize COFFE: bash cd .. && coffe init If your installation succeeds, you could see the statistics of COFFE.

    Usage Pipeline When you prepare the predictions from LLMs, COFFE provides a pipeline to calculate the efficient@1 and speedup defined in the paper: bash coffe pipe

    For example: bash coffe pipe function Coffe/examples/function -p Coffe/examples/function/GPT-4o.json -f efficient_at_1 -n 8 This command evaluates the predictions from GPT-4o on the function-level instances of COFFE. If you want to evaluate other LLMs, please prepare a JSON file with the same format as Coffe/examples/function/GPT-4o.json.

    Prediction File Format:

    In the JSON file, the key is the prompt used to query the LLM for the results, you could get the prompts in datasets/function/prompts.json and datasets/file/prompts.json. The value contains two objects, the first is a list contains the raw outputs from LLMs and the second is an indicator for the whether the raw output is valid.

    Note:

    In default, COFFE will run all predictions in docker. However, if you could not successfully install the docker or want to run the predictions on the host machine, you can add the -x option.

    Single Evaluation The pipe command provides an entire pipeline for calculating the final metrics. This pipeline could also be completed by executing the following four single evaluation commands.

    Sanitize the predictions bash coffe eval

    Select correct predictions bash coffe eval

    Note: This command will combine all correct solutions and ground truth solutions together into files

    Evaluate the GPU instruction count bash coffe eval

    Note: This command could only accept one single prediction file ending with PASSED_SOLUTIONS.json.

    Calculating the efficient@1/speedup bash coffe eval

    Note: This command requires the index file and the instruction file as COFFE compares the performance of predictions with grouth truth solutions to calculate the metrics.

    STGen For details about the stressful test case generation approach STGen, please see stgen/.

    Cite If you use COFFE, please cite us: @misc{peng2025coffe, title={COFFE: A Code Efficiency Benchmark for Code Generation}, author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren}, year={2025}, eprint={2502.02827}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2502.02827}, }

  11. Data from: Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl...

    • catalog.data.gov
    Updated Jun 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment [Dataset]. https://catalog.data.gov/dataset/plasma-protein-binding-evaluations-of-per-and-polyfluoroalkyl-substances-for-category-base
    Explore at:
    Dataset updated
    Jun 29, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Marci Smeltz, John F. Wambaugh, and Barbara A. Wetmore, Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment, Chemical Research in Toxicology, 10.1021/acs.chemrestox.3c00003. This dataset is associated with the following publication: Smeltz, M., J. Wambaugh, and B. Wetmore. Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, USA, 36(6): 870-871, (2023).

  12. ContextEval

    • huggingface.co
    Updated Nov 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2024). ContextEval [Dataset]. https://huggingface.co/datasets/allenai/ContextEval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 8, 2024
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

      Dataset Summary
    

    We provide here the data accompanying the paper: Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations.

      Dataset Structure
    
    
    
    
    
      Data Instances
    

    We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments.

      Data overview
    
    
    
    
    
      List of queries: Data Structure
    

    The list… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ContextEval.

  13. o

    TruthfulQA: Benchmark for Evaluating Language

    • opendatabay.com
    .undefined
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). TruthfulQA: Benchmark for Evaluating Language [Dataset]. https://www.opendatabay.com/data/ai-ml/248609bb-9172-4cd4-a042-6dcb70fcc3f3
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jun 26, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Data Science and Analytics
    Description

    The TruthfulQA dataset is specifically designed to evaluate the truthfulness of language models in generating answers to a wide range of questions. Comprising 817 carefully crafted questions spanning various topics such as health, law, finance, and politics, this benchmark aims to uncover any erroneous or false answers that may arise due to incorrect beliefs or misconceptions. It serves as a comprehensive measure of the ability of language models to go beyond imitating human texts and avoid generating inaccurate responses. The dataset includes columns such as type (indicating the format or style of the question), category (providing the topic or theme), best_answer (the correct and truthful answer), correct_answers (a list containing all valid responses), incorrect_answers (a list encompassing potential false interpretations provided by some humans), source (identifying the origin or reference for each question), mc1_targets and mc2_targets (highlighting respective correct answers for multiple-choice questions). The generation_validation.csv file contains generated questions and their corresponding evaluations based on truthfulness, while multiple_choice_validation.csv focuses on validating multiple-choice questions along with their answer choices. Through this dataset, researchers can comprehensively assess language model performance in terms of factual accuracy and avoidance of misleading information during answer generation tasks

    How to use the dataset How to Use the TruthfulQA Dataset: A Guide Welcome to the TruthfulQA dataset, a benchmark designed to evaluate the truthfulness of language models in generating answers to questions. This guide will provide you with essential information on how to effectively utilize this dataset for your own purposes.

    Dataset Overview The TruthfulQA dataset consists of 817 carefully crafted questions covering a wide range of topics, including health, law, finance, and politics. These questions are constructed in such a way that some humans would answer falsely due to false beliefs or misconceptions. The aim is to assess language models' ability to avoid generating false answers learned from imitating human texts.

    Files in the Dataset The dataset includes two main files:

    generation_validation.csv: This file contains questions and answers generated by language models. These responses are evaluated based on their truthfulness.

    multiple_choice_validation.csv: This file consists of multiple-choice questions along with their corresponding answer choices for validation purposes.

    Column Descriptions To better understand the dataset and its contents, here is an explanation of each column present in both files:

    type: Indicates the type or format of the question. category: Represents the category or topic of the question. best_answer: Provides the correct and truthful answer according to human knowledge/expertise. correct_answers: Contains a list of correct and truthful answers provided by humans. incorrect_answers: Lists incorrect and false answers that some humans might provide. source: Specifies where the question originates from (e.g., publication, website). For multiple-choice questions: mc1_targets, mc2_targets, etc.: Represent different options available as answer choices (with corresponding correct answers). Using this Dataset Effectively When utilizing this dataset for evaluation or testing purposes:

    Truth Evaluation: For assessing language models' truthfulness in generating answers, use the generation_validation.csv file. Compare the model answers with the correct_answers column to evaluate their accuracy.

    Multiple-Choice Evaluation: To test language models' ability to choose the correct answer among given choices, refer to the multiple_choice_validation.csv file. The correct answer options are provided in the columns such as mc1_targets, mc2_targets, etc.

    Ensure that you consider these guidelines while leveraging this dataset for your analysis or experiments related to evaluating language models' truthfulness and performance.

    Remember that this guide is intended to help

    Research Ideas Training and evaluating language models: The TruthfulQA dataset can be used to train and evaluate the truthfulness of language models in generating answers to questions. By comparing the generated answers with the correct and truthful ones provided in the dataset, researchers can assess the ability of language models to avoid false answers learned from imitating human texts. Detecting misinformation: This dataset can also be used to develop algorithms or models that are capable of identifying false or misleading information. By analyzing the generated answers and comparing them with the correct ones, it is possible to build systems that automatically detect and flag misinformation. Improving fact-checking systems: Fact-checking platforms or systems can benefit from this dataset by using it as a source for training and validating their algorithms. With access to a large number of

  14. H

    Replication Data for Study 1 of Instructor Name Preference and Student...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Aug 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Melissa Foster (2022). Replication Data for Study 1 of Instructor Name Preference and Student Evaluations of Instruction [Dataset]. http://doi.org/10.7910/DVN/GT89B6
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2022
    Dataset provided by
    Harvard Dataverse
    Authors
    Melissa Foster
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This data set includes the independent variable "Name" which identifies whether the instructor expressed preference for being referred to by her first name or professional title (e.g. "Dr. Last Name"). The dependent variable of interest was the Student Evaluation of Instruction scores, especially the "overall" score. Class information includes the class size, number of evaluations, year, and semester (Fall or Spring).

  15. Data from: In Vitro Hepatic Clearance Evaluations of Per- and...

    • catalog.data.gov
    Updated Mar 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2025). In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) across Multiple Structural Categories [Dataset]. https://catalog.data.gov/dataset/in-vitro-hepatic-clearance-evaluations-of-per-and-polyfluoroalkyl-substances-pfas-across-m
    Explore at:
    Dataset updated
    Mar 22, 2025
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    Dataset for "Crizer, D.M.; Rice, J.R.; Smeltz, M.G.; Lavrich, K.S.; Ravindra, K.; Wambaugh, J.F.; DeVito, M.; Wetmore, B.A. In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) across Multiple Structural Categories. Toxics 2024, 12, 672. https://doi.org/10.3390/toxics12090672". This dataset is associated with the following publication: Crizer, D., J. Rice, M. Smeltz, K. Lavrich, K. Ravindra, J. Wambaugh, M. Devito, and B. Wetmore. In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) Across Multiple Structural Categories. Toxics. MDPI, Basel, SWITZERLAND, 12(9): 672, (2024).

  16. P

    DiscoEval Dataset

    • paperswithcode.com
    Updated May 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zachary Bamberger; Ofek Glick; Chaim Baskin; Yonatan Belinkov (2024). DiscoEval Dataset [Dataset]. https://paperswithcode.com/dataset/discoeval
    Explore at:
    Dataset updated
    May 12, 2024
    Authors
    Zachary Bamberger; Ofek Glick; Chaim Baskin; Yonatan Belinkov
    Description

    Dataset Summary

    The DiscoEval is an English-language Benchmark that contains a test suite of 7 tasks to evaluate whether sentence representations include semantic information relevant to discourse processing. The benchmark datasets offer a collection of tasks designed to evaluate natural language understanding models in the context of discourse analysis and coherence.

    Dataset Sources

    Arxiv: A repository of scientific papers and research articles. Wikipedia: An extensive online encyclopedia with articles on diverse topics. Rocstory: A dataset consisting of fictional stories. Ubuntu IRC channel: Conversational data extracted from the Ubuntu Internet Relay Chat (IRC) channel. PeerRead: A dataset of scientific papers frequently used for discourse-related tasks. RST Discourse Treebank: A dataset annotated with Rhetorical Structure Theory (RST) discourse relations. Penn Discourse Treebank: Another dataset with annotated discourse relations, facilitating the study of discourse structure

  17. Z

    Data from: Static Stack-Preserving Intra-Procedural Slicing of WebAssembly...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David W. Binkley (2022). Static Stack-Preserving Intra-Procedural Slicing of WebAssembly Binaries [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5821006
    Explore at:
    Dataset updated
    Feb 8, 2022
    Dataset provided by
    Quentin Stiévenart
    Coen De Roover
    David W. Binkley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    About this artifact

    This artifact contains the implementation and the results of the evaluation of a static slicer for WebAssembly described in the ICSE 2022 paper titled "Static Stack-Preserving Intra-Procedural Slicing of WebAssembly Binaries".

    The artifact contains a docker image (wassail-eval.tar.xz) that contains everything necessary to reproduce our evaluation, and the actual data resulting from our evaluation: 1. The implementation of our slicer (presented in Section 4.1) is included in the docker machine, and is available publicly here: https://github.com/acieroid/wassail/tree/icse2022 2. Test cases used for our evaluation of RQ1 are included in the docker machine and in the rq1.tar.xz archive. 3. The dataset used in RQ2, RQ3, and RQ4 is included in the docker machine. 4. The code needed to run our evaluation of RQ2, RQ3, and RQ4 is included in the docker machine. 5. The scripts used to generate the statistics and graphs that are included in the paper for RQ2, RQ3, and RQ4 are included in the docker machine and as the *.py files in this artifact. 6. The data of RQ5 that has been used in our manual investigation is included in the docker machine and in the rq5.tar.xz archive, along with rq5-manual.txt detailing our manual analysis findings.

    How to obtain it

    Our artifact is available on Zenodo at the following URL: https://zenodo.org/record/5821007

    Setting up the Docker image

    Downloading The Artifact

    The artifact is available at the following URL: https://zenodo.org/record/5821007

    Loading The Docker Image

    Once the artifact is downloaded in the file icse2022slicing.tar.xz, it can be extracted and loaded into Docker as follows (this takes a few minutes): docker import icse2022slicing.tar.xz To simplify further commands, you can tag the image using the printed sha256 hash of the image: if the docker import command resulted in the hash 54aa9416a379a6c71b1c325985add8bf931752d754c8fb17872c05f4e4b52ea2, you can run: docker tag 54aa9416a379a6c71b1c325985add8bf931752d754c8fb17872c05f4e4b52ea2 wassail-eval

    Once the Docker image has been loaded, you can run the following commands to obtain a shell in the appropriate environment: docker volume create result docker run -it -v result:/tmp/out/ wassail-eval bash su - opam

    Reproducing results of RQ1

    Our manual translations of the "classical" examples are included in the rq1/ directory (available in the docker image and in rq1.tar.xz). We include the slices computed by our implementation in the rq1/out/ directory.

    A slice can be produced for each example in the docker image as follows, where the first argument is the name of the program being sliced, the second the function index being sliced, the third the slicing criterion (indicated as the instruction index, where instructions start at 1), and the last argument is the output file for the slice:

    cd rq1/
    wassail slice scam-mug.wat 5 8 scam-mug-slice.wat
    wassail slice montreal-boat.wat 5 19 montreal-boat-slice.wat
    wassail slice word-count.wat 1 41 word-count-slice1.wat
    wassail slice word-count.wat 1 43 word-count-slice2.wat
    wassail slice word-count.wat 1 39 word-count-slice3.wat
    wassail slice word-count.wat 1 45 word-count-slice4.wat
    wassail slice word-count.wat 1 37 word-count-slice5.wat
    wassail slice agrawal-fig-3.wat 3 38 agrawal-fig-3-slice.wat
    wassail slice agrawal-fig-5.wat 3 37 agrawal-fig-5-slice.wat
    

    The slice results can then be inspected manually, and compared with the original version of the .wat program to see which instructions have been removed, or with the expected solutions in the out/ directory, e.g. by running: diff word-count-slice1.wat out/word-count-slice1.wat (No output is expected if the slice is correct)

    Reproducing results of RQ2, RQ3, and RQ4

    For these RQ, we include the data resulting from our evaluation, but we also allow reviewers to rerun the full evaluation if needed. However, such an evaluation requires a heavy machine and takes quite some time (4-5 days to run to completion with a 4 hours timeout). In our case, we used a machine with 256 GB of RAM and a 64-core processor with HyperThreading enabled, allowing us to run 128 slicing jobs in parallel.

    Runnig the Evaluation

    We explain how to run the full evaluation, or only a partial evaluation below. One can directly skip to the next section and reuse our raw evaluation results, provided alongside this artifact.

    Running the Full Evaluation

    In order to reproduce our evaluation, you can run the following commands in the docker image. It is recommended to run them in a tmux session if one wants to inspect other elements in parallel (tmux is installed in the docker image). The timeout (set to 4 hours per binary, like in the paper) can be decreased by editing the evaluate.sh script (vim is installed in the docker image).

    This is expected to take 2-3 days of time, on a machine with 128 cores. In order to produce only partial results, see the next section.

    cd filtered
    cat ../supported.txt | parallel --bar -j 128 sh ../evaluate.sh {}
    

    The results are outputted in the /tmp/out/ directory.

    Running a Partial Evaluation

    If one does not have access to a high-end machine with 128 cores nor the time to run the full evaluation, it is possible to produce partial results. To do so, the following commands can be run. This will run the evaluation on the full dataset in a random order, which can be stopped early to represent a partial view of our evaluation, on a random subset of the data. In order to gather more datapoints, it is also advised to decrease the timeout in the evaluate.sh file, for example to 20 minutes by setting TIMEOUT=20m with nano evaluate.sh. The number of slicing jobs running in parallel can also be decreased to match the number of processors on the machine running the experiments (the -j 128 argument in the following command runs 128 parallel jobs)

    sudo chown opam:opam /tmp/out/
    cd filtered
    shuf ../supported.txt | parallel --bar -j 128 sh ../evaluate.sh {}
    

    The evaluation results will be stored in the /tmp/out/ directory.

    Skipping the Evaluation Run

    Instead of rerunning the evaluation, one can rely on our full results included in the data.txt.xz and error.txt.xz archives. These can simply be downloaded from within the Docker machine and extracted in /tmp/out/:

    cd /tmp/out/
    wget https://zenodo.org/record/5821007/files/data.txt.xz
    wget https://zenodo.org/record/5821007/files/error.txt.xz
    unxz data.txt.7z
    unxz error.txt.7z
    

    Processing the data

    In order to process this data, we included multiple python script. These require around 100GB of RAM to load the full dataset in memory. The scripts should be run with Python 3. When running this in the docker image, first run cd /tmp/out/ && cp /home/opam/*.py ./ - To count the number of functions sliced, run cut -d, -f 1,2 data.txt | sort -u | wc -l. This takes around 6 minutes to run on the full dataset. - To count the total number of slices encountered, run wc -l data.txt error.txt. This takes around 15 seconds to run. - To count the number of errors encountered, run wc -l error.txt. This takes around 1 second to run. - To produce data and graphs regarding the sizes and timing, run python3 statistics-and-plots.py. This will output the statistics presented in the paper, along with Figure 2 (rq2-sizes.pdf) and Figure 3 (rq2-times.pdf). This script takes around 35 minutes to run. - To find the executable slices that are larger than the original programs, run python3 larger-slices.py > larger.txt. This script takes around 2h30 to run. It will list the slice using the notation filename function-sliced slicing-criterion in the larger.txt file, from which the slice can be recomputed by running wassail slice function-sliced slicing-criterion output.wat in the docker image. It will also output statistics regarding these slices, which you can easily inspect by running tail larger.txt. - To investigate slices that could not be computed, run: sed -i error.txt -e 's/annotation,/annotation./' python3 errors.py This will take a few seconds to run and will print a summary of the errors encountered during the slicing process, and requires some manual sorting to map to the categories we discuss in the paper. Here is a summary of the errors encountered and their root cause:

    Root Cause: Unsupported Usage of br_table

    Error: (Failure"Invalid vstack when popping 2 values") Error: (Failure"Spec_inference.drop: not enough elements in stack") Error: (Failure"Spec_inference.take: not enough element in var list") Error: (Failure"unsupported in spec_inference: incompatible stack lengths (probably due to mismatches in br_table branches)")

    Root Cause: Unreachable Code

    Error: (Failure"Unsupported in slicing: cannot find an instruction. It probably is part of unreachable code.") Error: (Failure"bottom annotation") Error: (Failure"bottom annotation. this an unreachable instruction")

    RQ5: Comparison to Slicing C Programs

    For this RQ, we include the following data in the rq5.7z archive, and in the rq5/ directory in the docker image: - The slicing subjects in their C and textual wasm form in rq5/subjects/ - The CodeSurfer slices in their C and textual wasm form in rq5/codesurfer/ - Our slices in their wasm form in rq5/wasm-slices/

    As this RQ requires heavy manual comparison, we do not expect the reviewers to reproduce all of our results. We include a summary of our manual investigation in rq5-manual.txt. In order to validate these manual findings, one can for example inspect a specific slice. For example, the following line in rq5-manual.txt:

    adpcm_apl1_565_expr.c.wat INTERPROCEDURAL
    

    can be validated as follows: ``` cd ~/

    This generates a trimmed down version of the CodeSurfer slice, only containing the function of

  18. Data from: A public data set of human balance evaluations

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Damiana A. dos Santos; Marcos Duarte (2023). A public data set of human balance evaluations [Dataset]. http://doi.org/10.6084/m9.figshare.3394432.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Damiana A. dos Santos; Marcos Duarte
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The data set comprises signals from the force platform (raw data for the force, moments of forces, and centers of pressure) of 163 subjects plus one file with information about the subjects and balance conditions and the results of the other evaluations. Subject’s balance was evaluated by posturography using a force platform and by the Mini Balance Evaluation Systems Tests. In the posturography test, we evaluated subjects during standing still for 60 s in four different conditions where vision and the standing surface were manipulated: on a rigid surface with eyes open; on a rigid surface with eyes closed; on an unstable surface with eyes open; on an unstable surface with eyes closed. Each condition was performed three times and the order of the conditions was randomized among subjects. In addition, the following tests were employed in order to better characterize each subject: Short Falls Efficacy Scale International; International Physical Activity Questionnaire Short Version; and Trail Making Test. The subjects were also interviewed to collect information about their socio-cultural, demographic, and health characteristics.The data set is available at PhysioNet (DOI: 10.13026/C2WW2W) and at figshare (DOI: 10.6084/m9.figshare.3394432).

  19. 2025 Green Card Report for Title Of Engineer; Eval

    • myvisajobs.com
    Updated Jan 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MyVisaJobs (2025). 2025 Green Card Report for Title Of Engineer; Eval [Dataset]. https://www.myvisajobs.com/reports/green-card/major/title-of-engineer;-eval
    Explore at:
    Dataset updated
    Jan 16, 2025
    Dataset provided by
    MyVisaJobs.com
    Authors
    MyVisaJobs
    License

    https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/

    Variables measured
    Major, Salary, Petitions Filed
    Description

    A dataset that explores Green Card sponsorship trends, salary data, and employer insights for title of engineer; eval in the U.S.

  20. P

    ComplexCodeEval Dataset

    • paperswithcode.com
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ComplexCodeEval Dataset [Dataset]. https://paperswithcode.com/dataset/complexcodeeval
    Explore at:
    Dataset updated
    Oct 31, 2024
    Description

    ComplexCodeEval ComplexCodeEval is an evaluation benchmark designed to accommodate multiple downstream tasks, accurately reflect different programming environments, and deliberately avoid data leakage issues. This benchmark includes a diverse set of samples from real-world projects, aiming to closely mirror actual development scenarios.

    Overview ComplexCodeEval consists of: - 3,897 Java samples from 1,055 code repositories - 7,184 Python samples from 2,107 code repositories

    Key Features

    Diverse Downstream Tasks: The benchmark supports multiple downstream tasks to evaluate the performance of different code analysis tools and models. Accurate Reflection of Programming Environments: Samples are selected from projects that use popular third-party frameworks and packages. Avoidance of Data Leakage: Incorporates multiple timestamps for each sample to prevent data leakage.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Google (2023). IFEval [Dataset]. https://huggingface.co/datasets/google/IFEval
Organization logo

IFEval

IFEval

google/IFEval

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 22, 2023
Dataset authored and provided by
Googlehttp://google.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for IFEval

  Dataset Summary

This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset

ifeval = load_dataset("google/IFEval")

  Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.
Search
Clear search
Close search
Google apps
Main menu