How AgentRanks Scores Work

Primary Score: Real Code Work

AgentRanks now treats Real Code Work as the primary coding score. The first input is the strictest available real-engineering benchmark, currently DeepSWE when available, because it measures original, long-horizon software engineering tasks across real repositories and separates models more sharply than easier public leaderboards.

A 70% score on DeepSWE is strong. It means the model is doing useful real work, not just passing toy prompts. It also means the field is still far from solved: many models that look impressive in marketing or easier benchmarks fall into the 20-50% range under stricter real-code pressure.

Score Stack

Real Code Work is the anchor, not the whole story. AgentRanks ranks coding models and vibe-coding stacks with this order of evidence:

1. Real Code Work: DeepSWE or the strictest long-horizon benchmark available. SWE-bench Verified and Terminal-Bench are secondary evidence when DeepSWE is missing.

2. Vibe Coding Usefulness: Can a non-coder or solo builder get a visible, recoverable result without fighting the tool for hours?

3. Failure Risk: Memory loss, fake confidence, hallucination, code chaos, data bonfire, retry loops, and tool failures reduce trust even when benchmark scores look good.

4. Cost to Finish: Token price is not enough. We care about average cost per accepted task, retries, wall time, and wasted loops.

5. Open / Closed Trust: Open-source code, open weights, reproducible setup, and clear environment requirements improve trust. Closed systems need stronger external evidence.

Legacy ARscore

Older stack rows used a proxy formula: legacy_ARscore = model benchmark score x agent architecture score / 100. It remains useful for discovery, but it is no longer the final truth. The new editorial hierarchy puts strict real-code evidence first and labels proxy or inferred values as such.

Agent Architecture Score

Each agent is still evaluated on 7 architectural dimensions (max 100pts): Multi-Agent Orchestration (20), Memory & Context (15), Tool System (20), Prompt Cache & Cost (10), Safety & Permissions (15), Reliability & Recovery (10), Community & Ecosystem (10).

Architecture score explains why the same model may feel different inside different products. It does not override poor Real Code Work results.

Benchmark Sources

Primary source: DeepSWE. Secondary sources: SWE-bench and Vals SWE-bench Verified.

â† Back to Guides