SWE-bench is the most widely cited benchmark for evaluating AI coding capability. It tests whether an AI can solve real GitHub issues — not just generate code snippets, but actually fix real-world bugs and implement features.
Each task in SWE-bench presents the AI with a GitHub repository, a description of a bug or feature request, and requires the AI to produce a pull request that passes the repository's existing tests. This tests end-to-end coding capability: understanding the codebase, locating the relevant code, making the correct change, and verifying nothing breaks.
SWE-bench Verified contains 500 human-validated instances. This is the standard benchmark. SWE-bench Pro contains harder instances that require deeper reasoning. Claude Opus 4.7 scores 64.3% on Pro vs 87.6% on Verified — the gap shows that hard problems still challenge even the best models.
| Model | SWE-bench Verified | SWE-bench Pro |
|---|---|---|
| Opus Mythos | 93.9% | — |
| Opus 4.7 | 87.6% | 64.3% |
| GPT-5.5 | 82.6% | — |
| DeepSeek V4 Pro | 80.6% | 55.4% |
Note: SWE-bench Pro scores are limited to models that have been evaluated. Verified scores are available for all frontier models.
SWE-bench measures one thing: the ability to fix GitHub issues. It doesn't measure: code generation speed, IDE integration quality, multi-file refactoring ability, UI scaffolding, or developer experience. This is why AgentRanks combines SWE-bench scores with architecture analysis for a complete picture.