Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
RepoBench is a dataset that benchmarks repository-level code auto-completion systems.
RepoBench-P denotes RepoBench for pipeline, which is subtask of RepoBench including both relevant code retrieval and next-line code prediction.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
RepoBench is a dataset that benchmarks repository-level code auto-completion systems.
RepoBench-C denotes RepoBench for code completion, which is subtask of RepoBench for next-line code prediction given both cross-file and in-file context.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
See details from OpenAI: https://openai.com/index/introducing-swe-bench-verified/
Converted from Parquet to CSV from https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
Data Summary from Huggingface:
SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systemsβ ability to solve GitHub issues automatically. See this post for more details on the human-validation process.
The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.
The original SWE-bench dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Want to run inference now? This dataset only contains the problem_statement (i.e. issue text) and the base_commit which represents the state of the codebase before the issue has been resolved. If you want to run inference using the "Oracle" or BM25 retrieval settings mentioned in the paper, consider the following datasets.
princeton-nlp/SWE-bench_Lite_oracle
princeton-nlp/SWE-bench_Lite_bm25_13K
princeton-nlp/SWE-bench_Lite_bm25_27K
Supported Tasks and Leaderboards SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com
Languages The text of the dataset is primarily English, but we make no effort to filter or otherwise clean based on language type.
Dataset Structure
An example of a SWE-bench datum is as follows:
Facebook
Twitteranonymous-my-repo/OmniEarth-Bench dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The entire dump of GitHub repositories.
Facebook
TwitterThis dataset tracks the updates made on the dataset "Bench and Bedside" as a repository for previous versions of the data and metadata.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset represents the benchmark results of Goblint, conducted on all coreutils programs of the goblint/bench repository. It includes the following measurements (note: flambda enabled means that Goblint is compiled with flambda and the provided optimization flags):
Without flambda
With flambda's -Oclassic flag
With flambda's -O2 flag
With flambda's -O3 flag
With flambda's -O3 flag, as well as the following additional flags: -inline-toplevel=400 -inline-max-depth=1 -inline-max-unroll=0
Facebook
TwitterThis repository contains scripts and supporting information for a set of demonstrations related to the 2022 AM-Bench challenge series.In particular, the data used in this demonstration comes from the 2018 AM-Bench challenge documented at https://www.nist.gov/ambench/amb2018-02-description.The script queries the AM-Bench 2018 repository located at https://ambench.nist.gov/, then processes the returned XML-based results, does some image processing, and generates plots of melt pool depths. This work was originally done by Miyu Mudalamane (University of Delaware) during a 2021 Summer Undergraduate Research Fellowship at NIST supervised by Chandler Becker and Gretchen Greene (NIST Office of Data and Informatics, ODI).It was later adapted as this notebook by Jordan Raddick (Johns Hopkins University) in consultation with NIST ODI staff.Additional datasets and challenge problem documentation are available through the NIST Public Data Repository record at https://data.nist.gov/od/id/6D6EC9B3A4147BE2E05324570681EEC91931 with an associated publication (https://link.springer.com/article/10.1007/s40192-020-00169-1) documenting the experiments.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MigrationBench
1. π Overview
π€ MigrationBench is a large-scale code migration benchmark dataset at the repository level, across multiple programming languages.
Current and initial release includes java 8 repositories with the maven build system⦠See the full description on the dataset page: https://huggingface.co/datasets/AmazonScience/migration-bench-java-full.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This replication package contains the code (aws-triggers and azure-trigger), data analysis scripts (data-analysis), and dataset (data) of the TriggerBench cross-provider serverless benchmark.
It also bundles a customized extension of the serverless-benchmarker tool to automate and analyze serverless performance experiments.
TriggerBench
The Github repository joe4dev/trigger-bench contains the last version of TriggerBench. This replication package describes the version for the paper "TriggerBench: A Performance Benchmark for Serverless Function Triggers".
TriggerBench currently supports three triggers on AWS and eight triggers on Microsoft Azure.
Dataset
The data/aws and data/azure directories contain data from benchmark executions from April 2022.
Each execution is a separate directory with a timestamp in the format yyyy-mm-dd-HH-MM-SS (e.g., 2022-04-15_21-58-52) and contains the following files:
k6_metrics.csv: Load generator HTTP client logs in CSV format (see K6 docs)
sb_config.yml: serverless benchmarker execution configuration including experiment label.
trigger.csv: analyzer output CSV per trace.
root_trace_id: The trace id created by k6 and adopted by the invoker function
child_trace_id: The trace id newly created by the receiver function if trace propagation is not supported (this is the case for most asynchronous triggers)
t1-t4: Timestamps following the trace model (see paper)
t5-t9: Additional timestamps for measuring timestamping overhead
coldstart_f1=True|False: coldstart status for invoker (f1) and receiver (f2) functions
trace_ids.txt: text file with each pair of root_trace_id and child_trace_id on a new line.
traces.json: raw trace JSON representation as retrieved from the provider tracing service. For AWS, see X-Ray segment docs. For Azure, see Application Insights telemetry data model.
workload_options.json: K6 load scenario configuration.
Replicate Data Analysis
Installation
Install Python 3.10+
Install Python dependencies pip install -r requirements.txt
Create Plots
python plots.py generates the plots and the statistical summaries presented in the paper.By default, the plots will be saved into a plots sub-directory.
An alternative output directory can be configured through the environment variable PLOTS_PATH.
Hint: For interactive development, we recommend the VSCode Python extension in interactive mode.
Replicate Cloud Experiments
The following experiment plan automates benchmarking experiments with different types workloads (constant and bursty).
This generates a new dataset in the same format as described above.
Set up a load generator as vantage point following the description in LOADGENERATOR.
Choose the PROVIDER (aws or azure) in the constant.py experiment plan
Run the constant.py experiment plan
Open tmux
Activate virtualenv source sb-env/bin/activate
Run ./constant.py 2>&1 | tee -a constant.log
Contributors
The initial trigger implementations for AWS and Azure are based on two master thesis projects at Chalmers University of Technology in Sweden supervised by Joel:
AWS + Azure: Performance Comparison of Function-as- a-Service Triggers: A Cross-Platform Performance Study of Function Triggers in Function-as-a-Service by Marcus Bertilsson and Oskar GrΓΆnqvist, 2021.
Azure Extension: Serverless Function Triggers in Azure: An Analysis of Latency and Reliability by Henrik Lagergren and Henrik Tao, 2022.
Joel contributed many improvements to their original source code as documented in the import commits a00b67a and 6d2f5ef and developed TriggerBench as an integrated benchmark suite (see commit history for detailed changelog).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MMSI-Bench
This repo contains evaluation code for the paper "MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence" π Homepage | π€ Dataset | π Paper | π» Code | π arXiv
πNews
π₯[2025-10-23]: We added the normalized human response time for each MMSI-Bench sample and its difficulty level to our dataset on Hugging Face. π₯[2025-06-18]: MMSI-Bench has been supported in the LMMs-Eval repository. β¨[2025-06-11]: MMSI-Bench was used for evaluation in theβ¦ See the full description on the dataset page: https://huggingface.co/datasets/RunsenXu/MMSI-Bench.
Facebook
TwitterDatasets released as a part of the paper "IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages"
Paper Link: https://arxiv.org/abs/2404.16816
IndicGenBench is a multilingual, multi-way parallel benchmark for measuring language generation capabilities across diverse user-facing tasks in 29 Indic languages spanning 13 writing scripts and 4 language families.
We extend existing datasets in Cross-lingual Summarization (CrossSum), Machine Translation (FLORES), Multi-lingual Question Answering (XQuAD), and Cross-lingual Question Answering (XorQA) by collecting human translations for English examples into the target Indic languages. Please see the paper for details.
The following datasets are released in this repository:
CrossSum-IN:
Given an English article, generate a short summary in the target language.
crosssum_in directory contains english-{target_language}_{split}.json files
Flores-IN: Given a sentence in the source language, generate a translation
in the target language. The data contains translation pairs in two directions,
English β TargetLanguage and TargetLanguage β English
flores_in directory contains flores_{source_language}_{target_language}_{split}.json.
Note: the 'dev' and 'test' split of flores_in corresponds to the 'dev' and 'devtest' split of the original flores dataset.
XQuAD-IN: Given a question and passage in an Indic language,
generate a short answer span from the passage as the answer.
xquad_in directory contains xquad_{target_language}_{split}.json files.
XorQA-IN: Given a question in an Indic language and a passage in English,
generate a short answer span. We provide both an English and target language
answer span in the annotations.
xorqa_in directory contains xorqa_{target_language}_{split}.json files.
source_language: language code of the source language.
target_language: language code of the target language.
split: one of {train, dev, test} (flores-in contains dev and devtest splits only)
After cloning the repository: ``` import json
json_file_path = '/kaggle/input/indic-gen-bench/crosssum_in/crosssum_english-as_dev.json'
with open(json_file_path, 'r', encoding='utf-8') as f: data = json.load(f)
examples = data["examples"]
for example in examples: print("Paragraph (In English):", example["text"]) print("Summary (In Assamese):", example["summary"])
### Dataset Sizes
| Task | \# Languages | \# Examples (train / dev/ test) |
| ----------- | ------------ | ------------------------------- |
| CrossSum-IN | 29 | 2.9k / 2.9k / 14.5k |
| Flores-IN | 29 | \- / 28.9k / 29.3k |
| XQuAD-IN | 12 | 1.2k / 1.2k / 14.2k |
| XorQA-IN | 28 | 2.8k / 14k / 15.1k |
## Language Coverage
| Language | Code | Script | Family |
| ------------- | ------- | ---------- | ------------- |
| Bengali | `bn` | Bengali | Indo-European |
| Gujarati | `gu` | Gujarati | Indo-European |
| Hindi | `hi` | Devanagari | Indo-European |
| Kannada | `kn` | Kannada | Dravidian |
| Malayalam | `ml` | Malayalam | Dravidian |
| Marathi | `mr` | Devanagari | Indo-European |
| Tamil | `ta` | Tamil | Dravidian |
| Telugu | `te` | Telugu | Dravidian |
| Urdu | `ur` | Arabic | Indo-European |
| Assamese | `as` | Bengali | Indo-European |
| Bhojpuri | `bho` | Devanagari | Indo-European |
| Nepali | `ne` | Devanagari | Indo-European |
| Odia | `or` | Odia | Indo-European |
| Punjabi | `pa` | Gurumukhi | Indo-European |
| Pashto | `ps` | Arabic | Indo-European |
| Sanskrit | `sa` | Devanagari | Indo-European |
| Awadhi | `awa` | Devanagari | Indo-European |
| Haryanvi | `bgc` | Devanagari | Indo-European |
| Tibetan | `bo` | Tibetan | Sino-Tibetan |
| Bodo | `brx` | Devanagari | Sino-Tibetan |
| Garhwali | `gbm` | Devanagari | Indo-European |
| Konkani | `gom` | Devanagari | Indo-European |
| Chhattisgarhi | `hne` | Devanagari | Indo-European |
| Rajasthani | `hoj` | Devanagari | Indo-European |
| Maithili | `mai` | Devanagari | Indo-European |
| Manipuri | `mni` | Meithi | Sino-Tibetan |
| Malvi | `mup` | D...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
CORDEX-ML-Bench provides a standardized framework for evaluating machine learning models for regional climate downscaling. Participants will train models on Regional Climate Model (RCM) simulations and test their performance across different time periods and spatial domains. This Zenodo repository only provides the training data, other relevant information will be provided below.
| Domain | Resolution | Target Variables | Target Grid Size | Predictor Variables | Predictor Grid Size | Static Fields |
|---|---|---|---|---|---|---|
| New Zealand (NZ) | 0.11Β° | Tasmax, Pr | 128 Γ 128 | u, v, q, t, z at 850, 700, 500 hPa (15 variables) | 16 x 16 (2Β°) | Orography (128 x 128; 0.11Β°) |
| Europe (ALPS) | 0.11Β° | Tasmax, Pr | 128 Γ 128 | u, v, q, t, z at 850, 700, 500 hPa | 16 x 16 (2Β°) | Orography |
|
South Africa (SA) | 0.10Β° | Tasmax, Pr | 128 Γ 128 | u, v, q, t, z at 850, 700, 500 hPa | 16 x 16 (2Β°) | Orography |
An illustration of the predictor (16 x 16) and target domains (128 x 128) are shown above. Noting that the predictor domains are slightly larger than the target domains.
u - zonal wind componentv - meridional wind componentq - specific humidityt - temperaturez - geopotential heightFor code examples and tutorials on loading the training data, refer to the official repository:
Region-specific preprocessing information:
The benchmark investigates training in the "perfect" framework:
Evaluation is supported in both perfect (coarsened RCM) and imperfect (applied directly to the GCM) configurations.
Training Period: 20 years (1961β1980)
GCM Examples: ACCESS-CM2 (NZ), CNRM-CM5 (ALPS)
This experiment mimics traditional Empirical Statistical Downscaling (ESD) training.
Configurations:
Training Period: 40 years combined
Periods: Historical (1961β1980) + Future (2081β2100)
GCM Examples: ACCESS-CM2 (NZ), CNRM-CM5 (ALPS)
This experiment evaluates model extrapolation skill and transferability across different GCMs.
Configurations:
| Training Setup | Inference Set | Evaluation Type | Notes | Metrics | Required |
|---|---|---|---|---|---|
| ESD 1961β1980 Static: Yes/No | Historical (1981β2000) | PP cross-validation | Same GCM, perfectly | Error, Clim | β |
| Historical (1981β2000) | Imperfect cross-validation | Same GCM, imperfectly | Error, Clim | β | |
| 2041β2060 2081β2100 | Extrapolation | Same GCM, perfectly | Change signal | β | |
| 2041β2060 2081β2100 | Extrapolation | Same GCM, imperfectly | Change signal | β |
| Training Setup | Inference Set | Evaluation Type | Notes | Metrics | Required |
|---|---|---|---|---|---|
| Emulator 1961β1980 + 2081β2100 Static: Yes/No | Historical (1981β2000) | PP cross-validation | Same GCM, perfectly | Error, Clim | β |
| Historical (1981β2000) | Imperfect cross-validation | Same GCM, imperfectly | Error, Clim | β | |
| Historical (1981β2000) |
Different GCM, perfectly | Error, Clim | β | ||
| Historical (1981β2000) |
Different GCM, imperfectly | Error, Clim | β | ||
| 2041β2060 2081β2100 | Extrapolation | Same GCM, perfectly | Change signal | β | |
| 2041β2060 2081β2100 | Extrapolation | Same GCM, imperfectly | Change signal | β | |
| 2041β2060 2081β2100 | Hard Transferability | Different GCM, perfectly | Change signal | β | |
| 2041β2060 2081β2100 | Hard Transferability | Different GCM, imperfectly | Change signal | β |
NZ_Domain/
βββ train/
β βββ ESD_pseudo-reality/
β β βββ predictors/
β β β βββ ACCESS-CM2_1961-1980.nc
β β β βββ static.nc
β β βββ target/
β β βββ pr_tasmax_ACCESS-CM2_1961-1980.nc
β β
β βββ Emulator_hist_future/
β βββ predictors/
β β βββ ACCESS-CM2_1961-1980_2080-2099.nc
β β βββ static.nc
β βββ target/
β βββ pr_tasmax_ACCESS-CM2_1961-1980_2080-2099.nc
β
βββ test/
βββ historical/
β βββ predictors/
β β βββ perfect/
β β β βββ ACCESS-CM2_1981-2000.nc
β β β
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
This page contains the data for the paper "OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding." π Homepage | π Paper | π» Code | π arXiv
Introduction
Download OST-Bench for evaluation only: huggingface-cli download rbler/OST-Bench --include OST_bench.json,img.zip --repo-type dataset
Download OST-Bench for both training and evaluation: huggingface-cli download rbler/OST-Bench --repo-type dataset
Dataset Description
The⦠See the full description on the dataset page: https://huggingface.co/datasets/rbler/OST-Bench.
Facebook
TwitterOpen Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Dataset accompanying the paper "The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning", including 1.88M CoT rationales extracted across 1,060 tasks" - https://arxiv.org/abs/2305.14045
From the release repo https://github.com/kaistAI/CoT-Collection: Large Language Models (LLMs) have shown enhanced capabilities of solving novel tasks by reasoning step-by-step known as Chain-of-Thought (CoT) reasoning; how can we instill the same capability of reasoning step-by-step on unseen tasks into LMs that possess less than <100B parameters? To address this question, we first introduce the CoT Collection, a new instruction-tuning dataset that augments 1.88 million CoT rationales across 1,060 tasks. We show that continually fine-tuning Flan-T5 (3B & 11B) with the CoT Collection enables the 3B & 11B LMs to perform CoT better on unseen tasks, leading to an improvement in the average zero-shot accuracy on 27 datasets of the BIG-Bench-Hard benchmark by +4.34% and +2.44%, respectively. Furthermore, we show that instruction tuning with CoT allows LMs to possess stronger few-shot learning capabilities, resulting in an improvement of +2.97% and +2.37% on 4 domain-specific tasks over Flan-T5 (3B & 11B), respectively.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This repository contains the C2Rust-Bench dataset that is a minimized set of representative C functions for C-to-Rust transpilation. The C2Rust-Bench dataset is proposed for evaluating the C-to-Rust transpilers. This repo contains two files as following:
1. benchmark.json
This file contains a description of the C2Rust-Bench dataset, which includes:
2,905 C functions and detailed metadata about each of them.
2. Benchmark.tar
The Benchmark.tar file contains the C and Rust⦠See the full description on the dataset page: https://huggingface.co/datasets/anonymous4review/C2Rust-Bench.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset summary
This dataset contains the domain data from the tau2-bench repository for agentic evaluation. To download it in your code, run: hf download HuggingFaceH4/tau2-bench-data --repo-type dataset --local-dir data/tau2
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
π Github | π Paper Link | π Blog post
CuriaBench; Benchmark for the paper Curia: A Multi-Modal Foundation Model for Radiology
Usage
This repo contains evaluation datasets fro mthe Curia paper https://arxiv.org/abs/2509.06830. You can run the curia evaluation pipeline using our github repo at https://github.com/raidium-med/curia. The Curia-B model is available on huggingface: https://huggingface.co/raidium/curia
Acknowledgments
The⦠See the full description on the dataset page: https://huggingface.co/datasets/raidium/CuriaBench.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
π Paper β’ π Github Page β’ π¦ GitHub
ML-Bench is a novel dual-setup benchmark designed to evaluate Large Language Models (LLMs) and AI agents in generating repository-level code for machine learning tasks. The benchmark consists of 9,641 examples from 169 diverse tasks across 18 GitHub machine learning repositories. This dataset contains the following fields:β¦ See the full description on the dataset page: https://huggingface.co/datasets/super-dainiu/ml-bench.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data Instances
instance_id: (str) - A unique identifier for the instance repo: (str) - The repository name including the owner project_name: (str) - The name of the project without owner lang: (str) - The programming language of the repository work_dir: (str) - Working directory path sanitizer: (str) - The type of sanitizer used for testing (e.g., Address, Memory, Undefined) bug_description: (str) - Description of the vulnerability base_commit: (str) - The base commit hash where the⦠See the full description on the dataset page: https://huggingface.co/datasets/SEC-bench/SEC-bench.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
RepoBench is a dataset that benchmarks repository-level code auto-completion systems.
RepoBench-P denotes RepoBench for pipeline, which is subtask of RepoBench including both relevant code retrieval and next-line code prediction.