59 datasets found
  1. h

    SWE-bench

    • huggingface.co
    • opendatalab.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Princeton NLP group, SWE-bench [Dataset]. https://huggingface.co/datasets/princeton-nlp/SWE-bench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Princeton NLP group
    Description

    Dataset Summary

    SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

      Want to run inference now?
    

    This dataset only contains the problem_statement… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench.

  2. h

    SWE-bench_Verified

    • huggingface.co
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SWE-bench (2025). SWE-bench_Verified [Dataset]. https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified
    Explore at:
    Dataset updated
    Apr 29, 2025
    Dataset authored and provided by
    SWE-bench
    Description

    Dataset Summary SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process. The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The original… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified.

  3. SWE Bench Verified

    • kaggle.com
    • huggingface.co
    Updated Aug 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harry Wang (2024). SWE Bench Verified [Dataset]. https://www.kaggle.com/datasets/harrywang/swe-bench-verified
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2024
    Dataset provided by
    Kaggle
    Authors
    Harry Wang
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    See details from OpenAI: https://openai.com/index/introducing-swe-bench-verified/

    Converted from Parquet to CSV from https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified

    Data Summary from Huggingface:

    SWE-bench Verified is a subset of 500 samples from the SWE-bench test set, which have been human-validated for quality. SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. See this post for more details on the human-validation process.

    The dataset collects 500 test Issue-Pull Request pairs from popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

    The original SWE-bench dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Want to run inference now? This dataset only contains the problem_statement (i.e. issue text) and the base_commit which represents the state of the codebase before the issue has been resolved. If you want to run inference using the "Oracle" or BM25 retrieval settings mentioned in the paper, consider the following datasets.

    princeton-nlp/SWE-bench_Lite_oracle

    princeton-nlp/SWE-bench_Lite_bm25_13K

    princeton-nlp/SWE-bench_Lite_bm25_27K

    Supported Tasks and Leaderboards SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com

    Languages The text of the dataset is primarily English, but we make no effort to filter or otherwise clean based on language type.

    Dataset Structure

    An example of a SWE-bench datum is as follows:

    • instance_id: (str) - A formatted instance identifier, usually as repo_owner_repo_name-PR-number.
    • patch: (str) - The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue.
    • repo: (str) - The repository owner/name identifier from GitHub.
    • base_commit: (str) - The commit hash of the repository representing the HEAD of the repository before the solution PR is applied.
    • hints_text: (str) - Comments made on the issue prior to the creation of the solution PR’s first commit creation date.
    • created_at: (str) - The creation date of the pull request.
    • test_patch: (str) - A test-file patch that was contributed by the solution PR. problem_statement: (str) - The issue title and body.
    • version: (str) - Installation version to use for running evaluation.
    • environment_setup_commit: (str) - commit hash to use for environment setup and installation.
    • FAIL_TO_PASS: (str) - A json list of strings that represent the set of tests resolved by the PR and tied to the issue resolution.
    • PASS_TO_PASS: (str) - A json list of strings that represent tests that should pass before and after the PR application.
  4. h

    SWE-bench_Multilingual

    • huggingface.co
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SWE-bench (2025). SWE-bench_Multilingual [Dataset]. https://huggingface.co/datasets/SWE-bench/SWE-bench_Multilingual
    Explore at:
    Dataset updated
    Apr 29, 2025
    Dataset authored and provided by
    SWE-bench
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SWE-bench/SWE-bench_Multilingual dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    SWE-bench_Lite

    • huggingface.co
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SWE-bench (2025). SWE-bench_Lite [Dataset]. https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite
    Explore at:
    Dataset updated
    Apr 29, 2025
    Dataset authored and provided by
    SWE-bench
    Description

    Dataset Summary

    SWE-bench Lite is subset of SWE-bench, a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 300 test Issue-Pull Request pairs from 11 popular Python. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

      Want to run inference now?
    

    This dataset only contains the… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite.

  6. Multi-SWE-bench

    • huggingface.co
    Updated Jul 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ByteDance Seed (2025). Multi-SWE-bench [Dataset]. https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench
    Explore at:
    Dataset updated
    Jul 16, 2025
    Dataset provided by
    ByteDancehttps://www.bytedance.com/
    Authors
    ByteDance Seed
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    👋 Overview

    This repository contains the Multi-SWE-bench dataset, introduced in Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving, to address the lack of multilingual benchmarks for evaluating LLMs in real-world code issue resolution. Unlike existing Python-centric benchmarks (e.g., SWE-bench), this framework spans 7 languages (Java, TypeScript, JavaScript, Go, Rust, C, and C++) with 1,632 high-quality instances, curated from 2,456 candidates by 68 expert annotators… See the full description on the dataset page: https://huggingface.co/datasets/ByteDance-Seed/Multi-SWE-bench.

  7. h

    SWE-bench_Multimodal

    • huggingface.co
    Updated Apr 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SWE-bench (2025). SWE-bench_Multimodal [Dataset]. https://huggingface.co/datasets/SWE-bench/SWE-bench_Multimodal
    Explore at:
    Dataset updated
    Apr 29, 2025
    Dataset authored and provided by
    SWE-bench
    Description

    SWE-bench Multimodal

    SWE-bench Multimodal is a dataset of 617 task instances that evalutes Language Models and AI Systems on their ability to resolve real world GitHub issues. To learn more about the dataset, please visit our website. You can find the leaderboard at SWE-bench's home page.

  8. SWE-bench_Pro

    • huggingface.co
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Scale AI (2025). SWE-bench_Pro [Dataset]. https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro
    Explore at:
    Dataset updated
    Sep 21, 2025
    Dataset authored and provided by
    Scale AIhttps://scale.com/
    Description

    Dataset Summary

    SWE-Bench Pro is a challenging, enterprise-level dataset for testing agent ability on long-horizon software engineering tasks. Paper: https://static.scale.com/uploads/654197dc94d34f66c0f5184e/SWEAP_Eval_Scale%20(9).pdf See the related evaluation Github: https://github.com/scaleapi/SWE-bench_Pro-os

      Dataset Structure
    

    We follow SWE-Bench Verified (https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) in terms of dataset structure, with several… See the full description on the dataset page: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro.

  9. h

    SWE-Bench-Verified

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R2E-Gym, SWE-Bench-Verified [Dataset]. https://huggingface.co/datasets/R2E-Gym/SWE-Bench-Verified
    Explore at:
    Dataset authored and provided by
    R2E-Gym
    Description

    R2E-Gym/SWE-Bench-Verified dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. SWE-bench-extra

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nebius, SWE-bench-extra [Dataset]. https://huggingface.co/datasets/nebius/SWE-bench-extra
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset provided by
    Nebius
    Nebius Grouphttps://nebius.com/
    Authors
    Nebius
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Note: This dataset has an improved and significantly larger successor: SWE-rebench.

      Dataset Summary
    

    SWE-bench Extra is a dataset that can be used to train or evaluate agentic systems specializing in resolving GitHub issues. It is based on the methodology used to build SWE-bench benchmark and includes 6,415 Issue-Pull Request pairs sourced from 1,988 Python repositories.

      Dataset Description
    

    The SWE-bench Extra dataset supports the development of software engineering agents… See the full description on the dataset page: https://huggingface.co/datasets/nebius/SWE-bench-extra.

  11. h

    SWE-bench_Not_Verified

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SWE-bench, SWE-bench_Not_Verified [Dataset]. https://huggingface.co/datasets/SWE-bench/SWE-bench_Not_Verified
    Explore at:
    Dataset authored and provided by
    SWE-bench
    Description

    SWE-bench/SWE-bench_Not_Verified dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    SWE-bench-verified-scikit-learn

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yi-Ling Chen, SWE-bench-verified-scikit-learn [Dataset]. https://huggingface.co/datasets/yilche/SWE-bench-verified-scikit-learn
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Yi-Ling Chen
    Description

    yilche/SWE-bench-verified-scikit-learn dataset hosted on Hugging Face and contributed by the HF Datasets community

  13. h

    SWE-bench_oracle_cl100k

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Princeton NLP group, SWE-bench_oracle_cl100k [Dataset]. https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle_cl100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Princeton NLP group
    Description

    Dataset Summary

    SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

      Supported Tasks and Leaderboards
    

    SWE-bench proposes a new task: issue resolution provided a full repository and GitHub issue. The leaderboard can be found at www.swebench.com

      Languages… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle_cl100k.
    
  14. h

    Devin-SWE-bench-output

    • huggingface.co
    Updated May 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenHands (2024). Devin-SWE-bench-output [Dataset]. https://huggingface.co/datasets/OpenHands/Devin-SWE-bench-output
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 6, 2024
    Dataset authored and provided by
    OpenHands
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    OpenHands/Devin-SWE-bench-output dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    SWE-bench-Live

    • huggingface.co
    Updated Aug 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SWE-bench-Live (2025). SWE-bench-Live [Dataset]. https://huggingface.co/datasets/SWE-bench-Live/SWE-bench-Live
    Explore at:
    Dataset updated
    Aug 30, 2025
    Dataset authored and provided by
    SWE-bench-Live
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A brand-new, continuously updated SWE-bench-like dataset powered by an automated curation pipeline.

    For the official data release page, please see microsoft/SWE-bench-Live.

      Dataset Summary
    

    SWE-bench-Live is a live benchmark for issue resolving, designed to evaluate an AI system’s ability to complete real-world software engineering tasks. Thanks to our automated dataset curation pipeline, we plan to update SWE-bench-Live on a monthly basis to provide the… See the full description on the dataset page: https://huggingface.co/datasets/SWE-bench-Live/SWE-bench-Live.

  16. h

    swe-bench-verified-ja

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YuyaYamamoto, swe-bench-verified-ja [Dataset]. https://huggingface.co/datasets/nejumi/swe-bench-verified-ja
    Explore at:
    Authors
    YuyaYamamoto
    Description

    nejumi/swe-bench-verified-ja dataset hosted on Hugging Face and contributed by the HF Datasets community

  17. h

    swe-bench-coding-tasks

    • huggingface.co
    Updated Oct 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata NLP (2025). swe-bench-coding-tasks [Dataset]. https://huggingface.co/datasets/ud-nlp/swe-bench-coding-tasks
    Explore at:
    Dataset updated
    Oct 4, 2025
    Authors
    Unidata NLP
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    SWE-Bench Dataset - 8,712 files

    The dataset comprises 8,712 files across 6 programming languages, featuring verified tasks and benchmarks for evaluating coding agents and language models. It supports coding agents, language models, and developer tools with verified benchmark scores and multi-language test sets. - Get the data

      Dataset characteristics:
    

    Characteristic Data

    Description An extended benchmark of real-world software engineering tasks with enhanced… See the full description on the dataset page: https://huggingface.co/datasets/ud-nlp/swe-bench-coding-tasks.

  18. h

    SWE-smith-trajectories

    • huggingface.co
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SWE-bench (2025). SWE-smith-trajectories [Dataset]. https://huggingface.co/datasets/SWE-bench/SWE-smith-trajectories
    Explore at:
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    SWE-bench
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    SWE-smith Trajectories

    Code • Paper • Site

    This dataset contains the 5017 trajectories we fine-tuned Qwen 2.5 Coder Instruct on, leading to SWE-agent-LM-32B, a coding LM agent that achieve 40.2% on SWE-bench Verified (no verifiers or multiple rollouts, just 1 attempt per instance). Trajectories were generated by running SWE-agent + Claude 3.7 Sonnet on task instances from the SWE-smith dataset.

  19. h

    swe-bench-verified-50

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ibragim, swe-bench-verified-50 [Dataset]. https://huggingface.co/datasets/ibragim-bad/swe-bench-verified-50
    Explore at:
    Authors
    Ibragim
    Description

    ibragim-bad/swe-bench-verified-50 dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    swe-bench-dummy-test-dataset

    • huggingface.co
    Updated Jul 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kilian Lieret (2025). swe-bench-dummy-test-dataset [Dataset]. https://huggingface.co/datasets/klieret/swe-bench-dummy-test-dataset
    Explore at:
    Dataset updated
    Jul 1, 2025
    Authors
    Kilian Lieret
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    klieret/swe-bench-dummy-test-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Princeton NLP group, SWE-bench [Dataset]. https://huggingface.co/datasets/princeton-nlp/SWE-bench

SWE-bench

princeton-nlp/SWE-bench

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Princeton NLP group
Description

Dataset Summary

SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution. The dataset was released as part of SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

  Want to run inference now?

This dataset only contains the problem_statement… See the full description on the dataset page: https://huggingface.co/datasets/princeton-nlp/SWE-bench.

Search
Clear search
Close search
Google apps
Main menu