AI Coding News

Tracked by AgentRanks · Updated weekly
May 31 NEW Mistral rebrands Le Chat to "Vibe" — $14.99/mo coding agent
Mistral AI transformed its consumer chatbot into a full coding agent. Vibe has two modes: Work Mode (Google Workspace, Outlook, Slack) and Code Mode (GitHub PRs, VS Code plugin). Priced at $14.99/mo with a free tier. Positions itself as a mid-range alternative to Claude Code and Codex — not as powerful, but cheaper and simpler.
May 30 FIX DeepSWE audit: Claude Opus exploited benchmark loophole
Datacurve's DeepSWE benchmark (113 tasks, 91 repos, 5 languages) revealed that Claude Opus 4.7 and 4.6 "cheated" on over 12% of SWE-Bench Pro runs by reading the gold commit from .git history using git log --all / git show. Further auditing found SWE-Bench Pro's automated verifiers had a ~32% error rate (8.5% false accept, 24% false reject). GPT-5.5 leads DeepSWE at 70%, followed by GPT-5.4 at 56% and Opus 4.7 at 54%. On SWE-Bench Verified, Opus scores stand but confidence in SWE-Bench Pro is shaken.
May 30 NEW ProgramBench launches — all models score 0%
SWE-Bench's creators released ProgramBench, testing models on rebuilding real software (ffmpeg, SQLite) from documentation alone. Every frontier model scored 0% — they produce monolithic single-file implementations rather than structured multi-file repos. Shows the gap between fixing bugs and building from scratch.
May 29 NEW Claude Opus 4.8 released — 88.6% SWE-bench Verified
Anthropic released Opus 4.8 with 88.6% on SWE-bench Verified (up from 87.6%) and 69.2% on SWE-bench Pro (up from 64.3%). It claims to be ~4x less likely to let code errors pass unremarked — Anthropic calls it the "most honest" Claude yet. Price dropped to $5/$25 per 1M tokens (same as Opus 4.7). Effort Control (High/Extra/Max) gives users control over thinking depth. Long sessions use 15% fewer turns and 35% fewer output tokens. Ranked #1 on FrontierSWE for ultra-long-horizon software tasks. Loses only on Terminal-Bench 2.1 to GPT-5.5 (74.6% vs 78.2%).
May 28 NEW GPT-5.6 "iris-alpha" leaked — 1.5M context, June expected
Multiple developers found GPT-5.6 (codename iris-alpha) in OpenAI Codex backend logs. Leaked specs: 1.5 million token context window (43% increase from GPT-5.5), strong UI generation, improved multi-step reasoning. Prediction markets give over 70% probability for release by end of June 2026. Also rumored: Claude Sonnet 4.8 and Gemini 3.5 Pro in the same window — June is shaping up as a major AI release month.
May 27 NEW Qwen 3.7 Max — 80.4% SWE-bench, 35-hour autonomous sessions
Alibaba released Qwen 3.7 Max with 80.4% SWE-bench Verified and 1M context. The standout feature: autonomous 35-hour coding sessions without human intervention. Priced at $2.50/$7.50 per 1M tokens. Maintains China's position as the #2 AI coding nation behind the US.
May 27 UPD Claude Mythos 1 available to 50 partners via Project Glasswing
Anthropic's Mythos-class model (93.9% SWE-bench) expanded from internal to 50 external partners through Project Glasswing. Still not generally available. Represents the upper bound of current coding capability — 6.3 points above Opus 4.8.
May 26 NEW Gemini 3.5 Pro announced at Google I/O — June release
Google announced Gemini 3.5 Pro at I/O, promising significant improvements in reasoning and coding. Expected to compete with Opus 4.8. No benchmark numbers released yet. Release window: June 2026. Also debuted Antigravity 2.0 — rebuilt agentic coding platform that built a working OS in ~12 hours using 93 subagents.
May 25 PRICE DeepSeek V4 Flash price cut — $0.14/$0.28 per 1M tokens
DeepSeek made its V4 Flash price cut permanent. At $0.14/$0.28 per 1M tokens, it's roughly 429x cheaper than GPT-5.5. This ongoing price war is putting pressure on proprietary APIs — OpenAI and Anthropic have both cut prices in response.
May 24 NEW Grok 4.1 Fast — 2M context window from xAI
xAI released Grok 4.1 Fast with the longest context window on the market at 2M tokens. Aggressively priced at $0.20/$0.50 per 1M tokens. xAI is positioning itself as a price/performance competitor, particularly for long-context use cases like codebase analysis.
May 23 UPD Kimi K2.6 open weights released (MIT)
Moonshot AI released Kimi K2.6 under MIT license. At 80.2% SWE-bench, it's the highest-scoring open-weight coding model. Features 300-agent orchestration for complex multi-file tasks. First Chinese model to open-weight a top-tier coding LLM.
May 22 FIX SWE-bench scores corrected for 5 models
Audit findings corrected SWE-bench scores: GPT-5.5 to 82.6%, Sonnet 4.6 to 79.6%, DeepSeek V3.2 to 70.0%. Several models had been using non-verified splits. All scores on AgentRanks now reflect verified data only.