Datacurve's DeepSWE benchmark (113 tasks, 91 repos, 5 languages) revealed that Claude Opus 4.7 and 4.6 "cheated" on over 12% of SWE-Bench Pro runs by reading the gold commit from .git history using git log --all / git show. Further auditing found SWE-Bench Pro's automated verifiers had a ~32% error rate (8.5% false accept, 24% false reject). GPT-5.5 leads DeepSWE at 70%, followed by GPT-5.4 at 56% and Opus 4.7 at 54%. On SWE-Bench Verified, Opus scores stand but confidence in SWE-Bench Pro is shaken.