Dataset Summary
SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Want to run inference now?
This dataset only contains the problem_statement… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench.
Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
See details from OpenAI: https://openai.com/index/introducing-swe-bench-verified/
Converted from Parquet to CSV from https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified
Data Summary from Huggingface:
SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process.
The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.
The original SWE-bench dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Want to run inference now? This dataset only contains the problem_statement (i.e. issue text) and the base_commit which represents the state of the codebase before the issue has been resolved. If you want to run inference using the "Oracle" or BM25 retrieval settings mentioned in the paper, consider the following datasets.
princeton-nlp/SWE-bench_Lite_oracle
princeton-nlp/SWE-bench_Lite_bm25_13K
princeton-nlp/SWE-bench_Lite_bm25_27K
Supported Tasks and Leaderboards SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com
Languages The text of the dataset is primarily English, but we make no effort to filter or otherwise clean based on language type.
Dataset Structure
An example of a SWE-bench datum is as follows:
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SWE-bench/SWE-bench_Multilingual dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Summary
SWE-bench Lite is subset of SWE-bench, a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 300 test Issue-Pull Request pairs from 11 popular Python. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Want to run inference now?
This dataset only contains the… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
👋 Overview
This repository contains the Multi-SWE-bench dataset, introduced in Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving, to address the lack of multilingual benchmarks for evaluating LLMs in real-world code issue resolution. Unlike existing Python-centric benchmarks (e.g., SWE-bench), this framework spans 7 languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances, curated from 2,456 candidates by 68 expert annotators… See the full description on the dataset page: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench.
SWE-bench Multimodal
SWE-bench Multimodal is a dataset of 617 task instances that evalutes Language Models and AI Systems on their ability to resolve real world GitHub issues. To learn more about the dataset, please visit our website. You can find the leaderboard at SWE-bench's home page.
Dataset Summary
SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os
Dataset Structure
We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.
R2E-Gym/SWE-Bench-Verified dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: This dataset has an improved and significantly larger successor: SWE-rebench.
Dataset Summary
SWE-bench Extra is a dataset that can be used to train or evaluate agentic systems specializing in resolving GitHub issues. It is based on the methodology used to build SWE-bench benchmark and includes 6,415 Issue-Pull Request pairs sourced from 1,988 Python repositories.
Dataset Description
The SWE-bench Extra dataset supports the development of software engineering agents… See the full description on the dataset page: https://huggingface.co/datasets/nebius/SWE-bench-extra.
SWE-bench/SWE-bench_Not_Verified dataset hosted on Hugging Face and contributed by the HF Datasets community
yilche/SWE-bench-verified-scikit-learn dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Summary
SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.
Supported Tasks and Leaderboards
SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com
Languages… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle_cl100k.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
OpenHands/Devin-SWE-bench-output dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A brand-new, continuously updated SWE-bench-like dataset powered by an automated curation pipeline.
For the official data release page, please see microsoft/SWE-bench-Live.
Dataset Summary
SWE-bench-Live is a live benchmark for issue resolving, designed to evaluate an AI system’s ability to complete real-world software engineering tasks. Thanks to our automated dataset curation pipeline, we plan to update SWE-bench-Live on a monthly basis to provide the… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench-Live/SWE-bench-Live.
nejumi/swe-bench-verified-ja dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
SWE-Bench Dataset - 8,712 files
The dataset comprises 8,712 files across 6 programming languages, featuring verified tasks and benchmarks for evaluating coding agents and language models. It supports coding agents, language models, and developer tools with verified benchmark scores and multi-language test sets. - Get the data
Dataset characteristics:
Characteristic Data
Description An extended benchmark of real-world software engineering tasks with enhanced… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/swe-bench-coding-tasks.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
SWE-smith Trajectories
Code • Paper • Site
This dataset contains the 5017 trajectories we fine-tuned Qwen 2.5 Coder Instruct on, leading to SWE-agent-LM-32B, a coding LM agent that achieve 40.2% on SWE-bench Verified (no verifiers or multiple rollouts, just 1 attempt per instance). Trajectories were generated by running SWE-agent + Claude 3.7 Sonnet on task instances from the SWE-smith dataset.
ibragim-bad/swe-bench-verified-50 dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
klieret/swe-bench-dummy-test-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Dataset Summary
SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Want to run inference now?
This dataset only contains the problem_statement… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench.