Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for IFEval
Dataset Summary
This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset
ifeval = load_dataset("google/IFEval")
Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
WildIFEval Dataset
This dataset was originally introduced in the paper WildIFEval: Instruction Following in the Wild, available on arXiv. Code: https://github.com/gililior/wild-if-eval
Dataset Overview
The WildIFEval dataset is designed for evaluating instruction-following capabilities in language models. It provides decompositions of conversations extracted from the LMSYS-Chat-1M dataset. Each example includes:
conversation_id: A unique identifier for each conversation.… See the full description on the dataset page: https://huggingface.co/datasets/gililior/wild-if-eval.
This dataset evaluates instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks", etc.
gililior/wild-if-eval-gpt-decomposition dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
RedEval is a safety evaluation benchmark designed to assess the robustness of large language models (LLMs) against harmful prompts. It simulates and evaluates LLM applications across various scenarios, all while eliminating the need for human intervention. Here are the key aspects of RedEval:
Purpose: RedEval aims to evaluate LLM safety using a technique called Chain of Utterances (CoU)-based prompts. CoU prompts are effective at breaking the safety guardrails of various LLMs, including GPT-4, ChatGPT, and open-source models.
Safety Assessment: RedEval provides simple scripts to evaluate both closed-source systems (such as ChatGPT and GPT-4) and open-source LLMs on its benchmark. The evaluation focuses on harmful questions and computes the Attack Success Rate (ASR).
Question Banks:
HarmfulQA: Consists of 1,960 harmful questions covering 10 topics and approximately 10 subtopics each. DangerousQA: Contains 200 harmful questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful. CategoricalQA: Includes 11 categories of harm, each with 5 sub-categories, available in English, Chinese, and Vietnamese. AdversarialQA: Provides a set of 500 instructions to tease out harmful behaviors from the model.
Safety Alignment: RedEval also offers code to perform safety alignment of LLMs. For instance, it aligns Vicuna-7B on HarmfulQA, resulting in a safer version of Vicuna that is more robust against RedEval.
Installation:
Create a conda environment: conda create --name redeval -c conda-forge python=3.11 Activate the environment: conda activate redeval Install required packages: pip install -r requirements.txt Store API keys in the api_keys directory for use by the LLM as a judge and the generate_responses.py script for closed-source models.
Prompt Templates:
Choose a prompt template for red-teaming: Chain of Utterances (CoU): Effective at breaking safety guardrails. Chain of Thoughts (CoT) Standard prompt Suffix prompt Note: Different LLMs may require slight variations in the prompt template.
How to Perform Red-Teaming:
Step 0: Decide on the prompt template. Step 1: Generate model outputs on harmful questions by providing a path to the question bank and the red-teaming prompt.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
IF Data - RLVR Formatted
This dataset contains instruction following data formatted for use with open-instruct - specifically reinforcement learning with verifiable rewards. Prompts with verifiable constraints generated by sampling from the Tulu 2 SFT mixture and randomly adding constraints from IFEval. Part of the Tulu 3 release, for which you can see models here and datasets here.
Dataset Structure
Each example in the dataset contains the standard instruction-tuning… See the full description on the dataset page: https://huggingface.co/datasets/allenai/RLVR-IFeval.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Multilingual Meta-EVALuation benchmark (MM-Eval)
👨💻Code | 📄Paper | 🤗 MMQA
MM-Eval is a multilingual meta-evaluation benchmark consisting of five core subsets—Chat, Reasoning, Safety, Language Hallucination, and Linguistics—spanning 18 languages and a Language Resource subset spanning 122 languages for a broader analysis of language effects.
Design ChoiceIn this work, we minimize the inclusion of translated samples, as mere translation may alter existing preferences due to… See the full description on the dataset page: https://huggingface.co/datasets/prometheus-eval/MM-Eval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Large Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs' various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs' abilities from their stochastic nature. Details of collection method and use cases can be found in this paper: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
COFFE COFFE is a Python benchmark for evaluating the time efficiency of LLM-generated code. It is released by the FSE'25 paper "COFFE: A Code Efficiency Benchmark for Code Generation". You can also refer to the project webpage for more details.
Data COFFE is designed for evaluating both function-level code and file-level code. It contains selected instances from HumanEval, MBPP, APPS and Code Contests. COFFE keeps the original test cases in these benchmarks as correctness test cases and adds new test cases designed for time efficiency evaluation as stressful test cases.
Statistics:
Category | #Instance | #Solution/Instance | #Correctness/Instance | #Stressful/Instance |
---|---|---|---|---|
Function-Level | 398 | 1.00 | 5.72 | 4.99 |
File-Level | 358 | 66.93 | 43.68 | 4.95 |
Data Files:
All instances in COFFE are in Coffe/datasets, where Coffe/datasets/function contains all function-level instances and Coffe/datasets/file contains all file-level instances. In each repo: - best_solutions.json contains the best ground truth solution COFFE uses to calculate efficient@1 and speedup. - stressful_testcases.json contains all stressful test cases COFFE adds. - solutions.json contains all ground truth solutions in original benchmarks. - testcases.json contains all correctness test cases in original benchmarks.
Installation Requirements: - Linux Machine - Docker - Python>=3.10
We suggest you create a virtural environment before installing COFFE.
To use COFFE, please clone this repo in your current workspace workspace/ and execute: bash cd Coffe && pip install .
COFFE comes with a docker image, to install it: bash docker build . -t coffe Note that if your network requires proxy, please modify the Dockerfile in Coffe/ to indicate it, otherwise the docker image building process could fail.
Go back to your workspace and initialize COFFE: bash cd .. && coffe init If your installation succeeds, you could see the statistics of COFFE.
Usage Pipeline When you prepare the predictions from LLMs, COFFE provides a pipeline to calculate the efficient@1 and speedup defined in the paper: bash coffe pipe
For example: bash coffe pipe function Coffe/examples/function -p Coffe/examples/function/GPT-4o.json -f efficient_at_1 -n 8 This command evaluates the predictions from GPT-4o on the function-level instances of COFFE. If you want to evaluate other LLMs, please prepare a JSON file with the same format as Coffe/examples/function/GPT-4o.json.
Prediction File Format:
In the JSON file, the key is the prompt used to query the LLM for the results, you could get the prompts in datasets/function/prompts.json and datasets/file/prompts.json. The value contains two objects, the first is a list contains the raw outputs from LLMs and the second is an indicator for the whether the raw output is valid.
Note:
In default, COFFE will run all predictions in docker. However, if you could not successfully install the docker or want to run the predictions on the host machine, you can add the -x option.
Single Evaluation The pipe command provides an entire pipeline for calculating the final metrics. This pipeline could also be completed by executing the following four single evaluation commands.
Sanitize the predictions bash coffe eval
Select correct predictions bash coffe eval
Note: This command will combine all correct solutions and ground truth solutions together into files
Evaluate the GPU instruction count bash coffe eval
Note: This command could only accept one single prediction file ending with PASSED_SOLUTIONS.json.
Calculating the efficient@1/speedup bash coffe eval
Note: This command requires the index file and the instruction file as COFFE compares the performance of predictions with grouth truth solutions to calculate the metrics.
STGen For details about the stressful test case generation approach STGen, please see stgen/.
Cite If you use COFFE, please cite us: @misc{peng2025coffe, title={COFFE: A Code Efficiency Benchmark for Code Generation}, author={Yun Peng and Jun Wan and Yichen Li and Xiaoxue Ren}, year={2025}, eprint={2502.02827}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2502.02827}, }
Marci Smeltz, John F. Wambaugh, and Barbara A. Wetmore, Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment, Chemical Research in Toxicology, 10.1021/acs.chemrestox.3c00003. This dataset is associated with the following publication: Smeltz, M., J. Wambaugh, and B. Wetmore. Plasma Protein Binding Evaluations of Per- and Polyfluoroalkyl Substances for Category-Based Toxicokinetic Assessment. CHEMICAL RESEARCH IN TOXICOLOGY. American Chemical Society, Washington, DC, USA, 36(6): 870-871, (2023).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations
Dataset Summary
We provide here the data accompanying the paper: Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations.
Dataset Structure
Data Instances
We release the set of queries, as well as the autorater & human evaluation judgements collected for our experiments.
Data overview
List of queries: Data Structure
The list… See the full description on the dataset page: https://huggingface.co/datasets/allenai/ContextEval.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The TruthfulQA dataset is specifically designed to evaluate the truthfulness of language models in generating answers to a wide range of questions. Comprising 817 carefully crafted questions spanning various topics such as health, law, finance, and politics, this benchmark aims to uncover any erroneous or false answers that may arise due to incorrect beliefs or misconceptions. It serves as a comprehensive measure of the ability of language models to go beyond imitating human texts and avoid generating inaccurate responses. The dataset includes columns such as type (indicating the format or style of the question), category (providing the topic or theme), best_answer (the correct and truthful answer), correct_answers (a list containing all valid responses), incorrect_answers (a list encompassing potential false interpretations provided by some humans), source (identifying the origin or reference for each question), mc1_targets and mc2_targets (highlighting respective correct answers for multiple-choice questions). The generation_validation.csv file contains generated questions and their corresponding evaluations based on truthfulness, while multiple_choice_validation.csv focuses on validating multiple-choice questions along with their answer choices. Through this dataset, researchers can comprehensively assess language model performance in terms of factual accuracy and avoidance of misleading information during answer generation tasks
How to use the dataset How to Use the TruthfulQA Dataset: A Guide Welcome to the TruthfulQA dataset, a benchmark designed to evaluate the truthfulness of language models in generating answers to questions. This guide will provide you with essential information on how to effectively utilize this dataset for your own purposes.
Dataset Overview The TruthfulQA dataset consists of 817 carefully crafted questions covering a wide range of topics, including health, law, finance, and politics. These questions are constructed in such a way that some humans would answer falsely due to false beliefs or misconceptions. The aim is to assess language models' ability to avoid generating false answers learned from imitating human texts.
Files in the Dataset The dataset includes two main files:
generation_validation.csv: This file contains questions and answers generated by language models. These responses are evaluated based on their truthfulness.
multiple_choice_validation.csv: This file consists of multiple-choice questions along with their corresponding answer choices for validation purposes.
Column Descriptions To better understand the dataset and its contents, here is an explanation of each column present in both files:
type: Indicates the type or format of the question. category: Represents the category or topic of the question. best_answer: Provides the correct and truthful answer according to human knowledge/expertise. correct_answers: Contains a list of correct and truthful answers provided by humans. incorrect_answers: Lists incorrect and false answers that some humans might provide. source: Specifies where the question originates from (e.g., publication, website). For multiple-choice questions: mc1_targets, mc2_targets, etc.: Represent different options available as answer choices (with corresponding correct answers). Using this Dataset Effectively When utilizing this dataset for evaluation or testing purposes:
Truth Evaluation: For assessing language models' truthfulness in generating answers, use the generation_validation.csv file. Compare the model answers with the correct_answers column to evaluate their accuracy.
Multiple-Choice Evaluation: To test language models' ability to choose the correct answer among given choices, refer to the multiple_choice_validation.csv file. The correct answer options are provided in the columns such as mc1_targets, mc2_targets, etc.
Ensure that you consider these guidelines while leveraging this dataset for your analysis or experiments related to evaluating language models' truthfulness and performance.
Remember that this guide is intended to help
Research Ideas Training and evaluating language models: The TruthfulQA dataset can be used to train and evaluate the truthfulness of language models in generating answers to questions. By comparing the generated answers with the correct and truthful ones provided in the dataset, researchers can assess the ability of language models to avoid false answers learned from imitating human texts. Detecting misinformation: This dataset can also be used to develop algorithms or models that are capable of identifying false or misleading information. By analyzing the generated answers and comparing them with the correct ones, it is possible to build systems that automatically detect and flag misinformation. Improving fact-checking systems: Fact-checking platforms or systems can benefit from this dataset by using it as a source for training and validating their algorithms. With access to a large number of
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This data set includes the independent variable "Name" which identifies whether the instructor expressed preference for being referred to by her first name or professional title (e.g. "Dr. Last Name"). The dependent variable of interest was the Student Evaluation of Instruction scores, especially the "overall" score. Class information includes the class size, number of evaluations, year, and semester (Fall or Spring).
Dataset for "Crizer, D.M.; Rice, J.R.; Smeltz, M.G.; Lavrich, K.S.; Ravindra, K.; Wambaugh, J.F.; DeVito, M.; Wetmore, B.A. In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) across Multiple Structural Categories. Toxics 2024, 12, 672. https://doi.org/10.3390/toxics12090672". This dataset is associated with the following publication: Crizer, D., J. Rice, M. Smeltz, K. Lavrich, K. Ravindra, J. Wambaugh, M. Devito, and B. Wetmore. In Vitro Hepatic Clearance Evaluations of Per- and Polyfluoroalkyl Substances (PFAS) Across Multiple Structural Categories. Toxics. MDPI, Basel, SWITZERLAND, 12(9): 672, (2024).
Dataset Summary
The DiscoEval is an English-language Benchmark that contains a test suite of 7 tasks to evaluate whether sentence representations include semantic information relevant to discourse processing. The benchmark datasets offer a collection of tasks designed to evaluate natural language understanding models in the context of discourse analysis and coherence.
Dataset Sources
Arxiv: A repository of scientific papers and research articles. Wikipedia: An extensive online encyclopedia with articles on diverse topics. Rocstory: A dataset consisting of fictional stories. Ubuntu IRC channel: Conversational data extracted from the Ubuntu Internet Relay Chat (IRC) channel. PeerRead: A dataset of scientific papers frequently used for discourse-related tasks. RST Discourse Treebank: A dataset annotated with Rhetorical Structure Theory (RST) discourse relations. Penn Discourse Treebank: Another dataset with annotated discourse relations, facilitating the study of discourse structure
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This artifact contains the implementation and the results of the evaluation of a static slicer for WebAssembly described in the ICSE 2022 paper titled "Static Stack-Preserving Intra-Procedural Slicing of WebAssembly Binaries".
The artifact contains a docker image (wassail-eval.tar.xz
) that contains
everything necessary to reproduce our evaluation, and the actual data resulting
from our evaluation:
1. The implementation of our slicer (presented in Section 4.1) is included in
the docker machine, and is available publicly here:
https://github.com/acieroid/wassail/tree/icse2022
2. Test cases used for our evaluation of RQ1 are included in the docker machine
and in the rq1.tar.xz
archive.
3. The dataset used in RQ2, RQ3, and RQ4 is included in the docker machine.
4. The code needed to run our evaluation of RQ2, RQ3, and RQ4 is included in the
docker machine.
5. The scripts used to generate the statistics and graphs that are included in
the paper for RQ2, RQ3, and RQ4 are included in the docker machine and as the
*.py
files in this artifact.
6. The data of RQ5 that has been used in our manual investigation is included in
the docker machine and in the rq5.tar.xz
archive, along with
rq5-manual.txt
detailing our manual analysis findings.
Our artifact is available on Zenodo at the following URL: https://zenodo.org/record/5821007
The artifact is available at the following URL: https://zenodo.org/record/5821007
Once the artifact is downloaded in the file icse2022slicing.tar.xz
, it can be extracted and loaded into Docker as follows (this takes a few minutes):
docker import icse2022slicing.tar.xz
To simplify further commands, you can tag the image using the printed sha256 hash of the image: if the docker import
command resulted in the hash 54aa9416a379a6c71b1c325985add8bf931752d754c8fb17872c05f4e4b52ea2
, you can run:
docker tag 54aa9416a379a6c71b1c325985add8bf931752d754c8fb17872c05f4e4b52ea2 wassail-eval
Once the Docker image has been loaded, you can run the following commands to
obtain a shell in the appropriate environment:
docker volume create result
docker run -it -v result:/tmp/out/ wassail-eval bash
su - opam
Our manual translations of the "classical" examples are included in the rq1/
directory (available in the docker image and in rq1.tar.xz
). We
include the slices computed by our implementation in the rq1/out/
directory.
A slice can be produced for each example in the docker image as follows, where the first argument is the name of the program being sliced, the second the function index being sliced, the third the slicing criterion (indicated as the instruction index, where instructions start at 1), and the last argument is the output file for the slice:
cd rq1/
wassail slice scam-mug.wat 5 8 scam-mug-slice.wat
wassail slice montreal-boat.wat 5 19 montreal-boat-slice.wat
wassail slice word-count.wat 1 41 word-count-slice1.wat
wassail slice word-count.wat 1 43 word-count-slice2.wat
wassail slice word-count.wat 1 39 word-count-slice3.wat
wassail slice word-count.wat 1 45 word-count-slice4.wat
wassail slice word-count.wat 1 37 word-count-slice5.wat
wassail slice agrawal-fig-3.wat 3 38 agrawal-fig-3-slice.wat
wassail slice agrawal-fig-5.wat 3 37 agrawal-fig-5-slice.wat
The slice results can then be inspected manually, and compared with the original
version of the .wat program to see which instructions have been removed, or with
the expected solutions in the out/
directory, e.g. by running:
diff word-count-slice1.wat out/word-count-slice1.wat
(No output is expected if the slice is correct)
For these RQ, we include the data resulting from our evaluation, but we also allow reviewers to rerun the full evaluation if needed. However, such an evaluation requires a heavy machine and takes quite some time (4-5 days to run to completion with a 4 hours timeout). In our case, we used a machine with 256 GB of RAM and a 64-core processor with HyperThreading enabled, allowing us to run 128 slicing jobs in parallel.
We explain how to run the full evaluation, or only a partial evaluation below. One can directly skip to the next section and reuse our raw evaluation results, provided alongside this artifact.
In order to reproduce our evaluation, you can run the following commands in the
docker image. It is recommended to run them in a tmux session if one wants to
inspect other elements in parallel (tmux is installed in the docker image). The
timeout (set to 4 hours per binary, like in the paper) can be decreased by
editing the evaluate.sh
script (vim is installed in the docker image).
This is expected to take 2-3 days of time, on a machine with 128 cores. In order to produce only partial results, see the next section.
cd filtered
cat ../supported.txt | parallel --bar -j 128 sh ../evaluate.sh {}
The results are outputted in the /tmp/out/
directory.
If one does not have access to a high-end machine with 128 cores nor the time to
run the full evaluation, it is possible to produce partial results. To do so,
the following commands can be run. This will run the evaluation on the full
dataset in a random order, which can be stopped early to represent a partial
view of our evaluation, on a random subset of the data. In order to gather more
datapoints, it is also advised to decrease the timeout in the evaluate.sh
file, for example to 20 minutes by setting TIMEOUT=20m
with nano
evaluate.sh
. The number of slicing jobs running in parallel can also be
decreased to match the number of processors on the machine running the
experiments (the -j 128
argument in the following command runs 128 parallel
jobs)
sudo chown opam:opam /tmp/out/
cd filtered
shuf ../supported.txt | parallel --bar -j 128 sh ../evaluate.sh {}
The evaluation results will be stored in the /tmp/out/
directory.
Instead of rerunning the evaluation, one can rely on our full results included
in the data.txt.xz
and error.txt.xz
archives. These can simply be downloaded
from within the Docker machine and extracted in /tmp/out/
:
cd /tmp/out/
wget https://zenodo.org/record/5821007/files/data.txt.xz
wget https://zenodo.org/record/5821007/files/error.txt.xz
unxz data.txt.7z
unxz error.txt.7z
In order to process this data, we included multiple python script.
These require around 100GB of RAM to load the full dataset in memory.
The scripts should be run with Python 3.
When running this in the docker image, first run cd /tmp/out/ && cp /home/opam/*.py ./
- To count the number of functions sliced, run cut -d, -f 1,2 data.txt | sort
-u | wc -l
. This takes around 6 minutes to run on the full dataset.
- To count the total number of slices encountered, run wc -l data.txt
error.txt
. This takes around 15 seconds to run.
- To count the number of errors encountered, run wc -l error.txt
. This takes
around 1 second to run.
- To produce data and graphs regarding the sizes and timing, run python3
statistics-and-plots.py
. This will output the statistics presented in the
paper, along with Figure 2 (rq2-sizes.pdf) and Figure 3 (rq2-times.pdf). This
script takes around 35 minutes to run.
- To find the executable slices that are larger than the original programs, run
python3 larger-slices.py > larger.txt
. This script takes around 2h30 to
run. It will list the slice using the notation filename function-sliced
slicing-criterion
in the larger.txt file, from which the slice can be
recomputed by running wassail slice function-sliced slicing-criterion
output.wat
in the docker image. It will also output statistics regarding
these slices, which you can easily inspect by running tail larger.txt
.
- To investigate slices that could not be computed, run:
sed -i error.txt -e 's/annotation,/annotation./'
python3 errors.py
This will take a few seconds to run and will print a summary of the errors
encountered during the slicing process, and requires some manual sorting to map
to the categories we discuss in the paper. Here is a summary of the errors
encountered and their root cause:
Error: (Failure"Invalid vstack when popping 2 values") Error: (Failure"Spec_inference.drop: not enough elements in stack") Error: (Failure"Spec_inference.take: not enough element in var list") Error: (Failure"unsupported in spec_inference: incompatible stack lengths (probably due to mismatches in br_table branches)")
Error: (Failure"Unsupported in slicing: cannot find an instruction. It probably is part of unreachable code.") Error: (Failure"bottom annotation") Error: (Failure"bottom annotation. this an unreachable instruction")
For this RQ, we include the following data in the rq5.7z
archive, and in the rq5/
directory in the docker image:
- The slicing subjects in their C and textual wasm form in rq5/subjects/
- The CodeSurfer slices in their C and textual wasm form in rq5/codesurfer/
- Our slices in their wasm form in rq5/wasm-slices/
As this RQ requires heavy manual comparison, we do not expect the reviewers to
reproduce all of our results. We include a summary of our manual investigation
in rq5-manual.txt
. In order to validate these manual findings, one can for
example inspect a specific slice. For example, the following line in
rq5-manual.txt
:
adpcm_apl1_565_expr.c.wat INTERPROCEDURAL
can be validated as follows: ``` cd ~/
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The data set comprises signals from the force platform (raw data for the force, moments of forces, and centers of pressure) of 163 subjects plus one file with information about the subjects and balance conditions and the results of the other evaluations. Subject’s balance was evaluated by posturography using a force platform and by the Mini Balance Evaluation Systems Tests. In the posturography test, we evaluated subjects during standing still for 60 s in four different conditions where vision and the standing surface were manipulated: on a rigid surface with eyes open; on a rigid surface with eyes closed; on an unstable surface with eyes open; on an unstable surface with eyes closed. Each condition was performed three times and the order of the conditions was randomized among subjects. In addition, the following tests were employed in order to better characterize each subject: Short Falls Efficacy Scale International; International Physical Activity Questionnaire Short Version; and Trail Making Test. The subjects were also interviewed to collect information about their socio-cultural, demographic, and health characteristics.The data set is available at PhysioNet (DOI: 10.13026/C2WW2W) and at figshare (DOI: 10.6084/m9.figshare.3394432).
https://www.myvisajobs.com/terms-of-service/https://www.myvisajobs.com/terms-of-service/
A dataset that explores Green Card sponsorship trends, salary data, and employer insights for title of engineer; eval in the U.S.
ComplexCodeEval ComplexCodeEval is an evaluation benchmark designed to accommodate multiple downstream tasks, accurately reflect different programming environments, and deliberately avoid data leakage issues. This benchmark includes a diverse set of samples from real-world projects, aiming to closely mirror actual development scenarios.
Overview ComplexCodeEval consists of: - 3,897 Java samples from 1,055 code repositories - 7,184 Python samples from 2,107 code repositories
Key Features
Diverse Downstream Tasks: The benchmark supports multiple downstream tasks to evaluate the performance of different code analysis tools and models. Accurate Reflection of Programming Environments: Samples are selected from projects that use popular third-party frameworks and packages. Avoidance of Data Leakage: Incorporates multiple timestamps for each sample to prevent data leakage.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for IFEval
Dataset Summary
This dataset contains the prompts used in the Instruction-Following Eval (IFEval) benchmark for large language models. It contains around 500 "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times" which can be verified by heuristics. To load the dataset, run: from datasets import load_dataset
ifeval = load_dataset("google/IFEval")
Supported Tasks and… See the full description on the dataset page: https://huggingface.co/datasets/google/IFEval.