Each stack's Combined Score = LLM SWE-bench % × Agent Architecture Score (mapped to 0-70) / 100. This separates agent capability from LLM capability, allowing users to see the marginal value of each upgrade.
We use SWE-bench Verified (pass@1 on 500 human-validated instances) as the primary LLM coding benchmark. Terminal-Bench 2.0 scores are also tracked. Scores are sourced from official leaderboards and verified against multiple third-party sources.
Each agent is evaluated on 7 architectural dimensions (total 100pts): Orchestration, Memory, Tools, Cache, Safety, Reliability, Community. For proprietary agents without public source code, scores are estimated from published documentation and community analysis. Open-source agents (Aider, Cline, Continue, OpenClaw, Goose, Hermes) are scored from direct source code analysis.
We don't use benchmarks that are easily gamed or have known contamination issues. We don't use subjective user reviews. We don't factor in brand recognition or marketing spend.
← Back to Guides