<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Chen Sagi — Blog</title><description>Chen Sagi builds software with agentic workflows, and writes about how it goes.</description><link>https://blog.chensagi.com/</link><item><title>I spent all of my tokens so you wouldn&apos;t have to: Fable&apos;s vision against Opus, Codex, and Gemini</title><link>https://blog.chensagi.com/blog/blind-ai-bug-bounty-benchmark/</link><guid isPermaLink="true">https://blog.chensagi.com/blog/blind-ai-bug-bounty-benchmark/</guid><description>I pitted Claude Fable 5, Claude Opus 4.8, GPT-5.5 Codex, and Gemini 3.1 Pro against my iOS app in a blind, peer-judged bug hunt — then graded the judges against ground truth. The results surprised me twice.</description><pubDate>Sat, 13 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Last night, I ran a bug-bounty competition where every
contestant was an AI agent, every judge was an AI agent, and the only human in
the loop — me — showed up at the end to check the judges’ homework.&lt;/p&gt;
&lt;p&gt;The target was &lt;a href=&quot;https://blog.chensagi.com/apps/finn&quot;&gt;Finn&lt;/a&gt;, my Expo/React Native paper-trading game —
think Duolingo, for stocks and investing. The contestants: &lt;strong&gt;Claude Opus 4.8&lt;/strong&gt;
and &lt;strong&gt;Claude Fable 5&lt;/strong&gt; (two runs each),
&lt;strong&gt;GPT-5.5 Codex&lt;/strong&gt;, and &lt;strong&gt;Gemini 3.1 Pro running in Antigravity&lt;/strong&gt;. Each got a
hard 15-minute budget on a live iPhone simulator, the same prompt, and the same
mandatory 16-stop sweep across the whole app.&lt;/p&gt;
&lt;p&gt;The headline: &lt;strong&gt;Fable 5 won unanimously&lt;/strong&gt; — all seven blind judges ranked the
same Fable run first, and both Fable runs finished in the top two. Going in, I
was genuinely skeptical Fable and Opus would even differ; they did, and it
wasn’t close. The judges ranked Opus dead last — but the more interesting result
wasn’t who won. It was what happened when I checked the judges’ work: it didn’t
just expose a wrong verdict, it reshuffled the bottom of the board.&lt;/p&gt;
&lt;h2 id=&quot;the-claim-that-started-it&quot;&gt;The claim that started it&lt;/h2&gt;
&lt;p&gt;When Anthropic &lt;a href=&quot;https://www.anthropic.com/news/claude-fable-5-mythos-5&quot;&gt;announced Fable 5&lt;/a&gt;,
the part that stuck with me was the vision claim: it had beaten &lt;strong&gt;Pokémon
FireRed&lt;/strong&gt; start to finish on a &lt;em&gt;vision-only&lt;/em&gt; harness — raw screenshots, no maps,
no game-state crutches — and could supposedly rebuild a web app’s source code
from a screenshot alone. My app is basically that task: look at a screen full of
numbers and work out what’s broken. So I wanted to test the claim where it
actually mattered to me — and get a clean read on how Fable holds up against
Opus 4.8, the model I’d been reaching for by default.&lt;/p&gt;
&lt;p&gt;Codex and Gemini I added, in all honesty, for one reason: I’d already burned
through my Claude tokens and had time to kill while the usage window reset. So I
threw two other stacks at the same task — and once Claude came back, I turned
the judges loose.&lt;/p&gt;
&lt;h2 id=&quot;why-count-the-bugs-doesnt-work&quot;&gt;Why “count the bugs” doesn’t work&lt;/h2&gt;
&lt;p&gt;“Let an AI test my app and count the bugs” fails in two known ways:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Claim spam.&lt;/strong&gt; Models win by reporting everything that looks odd, and the
reader pays the verification cost.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Benchmark theater.&lt;/strong&gt; Whoever wrote the harness knows which model produced
which output, and grades accordingly.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;So the design countered both with three mechanisms:&lt;/p&gt;

























&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Mechanism&lt;/th&gt;&lt;th&gt;What it does&lt;/th&gt;&lt;th&gt;What it prevents&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Verified-only scoring&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;A finding counts only with a named verification method — re-observe, math-check, source-check, or reproduce. False positives count &lt;em&gt;against&lt;/em&gt;; honest dismissals count &lt;em&gt;for&lt;/em&gt;.&lt;/td&gt;&lt;td&gt;Claim spam — five proven bugs beat fifteen maybes&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Blind judging&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Judges see anonymous &lt;code&gt;run-N/&lt;/code&gt; directories. Contestants are forbidden to name their own model anywhere; I grepped every run for model names first; the identity map lived in one file judges couldn’t read.&lt;/td&gt;&lt;td&gt;Brand bias — and it makes bias &lt;em&gt;measurable&lt;/em&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Trustee calibration&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;Afterwards, I hand-verify the contested and design-intent items — so the judges themselves get graded against human ground truth.&lt;/td&gt;&lt;td&gt;Trusting AI judges blindly&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;h2 id=&quot;the-arena&quot;&gt;The arena&lt;/h2&gt;
&lt;p&gt;One booted iPhone 17 Pro Max simulator, Metro live, real app state. Contestants
ran sequentially; I reset the app between runs. Saved-game data persisted
across runs — initially a fairness worry I handled by interleaving the models
across positions, but it turned into the benchmark’s most important accident
(more below).&lt;/p&gt;
&lt;p&gt;Every contestant ran the same QA methodology, read-only for the benchmark —
it’s the harness I actually use on Finn. It enforces three things: a tiered
&lt;strong&gt;User Complaint Filter&lt;/strong&gt; (a hard list of objectively-bad UX — overlapping
layout, &lt;code&gt;$NaN&lt;/code&gt;, dead buttons, content you can’t scroll to), a mandatory analysis
block after every screenshot, and a compressed-evidence contract — a 784px WebP
plus the full accessibility tree and a checksum manifest per capture. Each swept
16 mandatory stops at ~35 seconds each, then spent ~5 minutes digging into its
best candidates, logging &lt;em&gt;Observed → Hypothesis → Verification → Verdict&lt;/em&gt; for
every one — including dismissals.&lt;/p&gt;















































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Run&lt;/th&gt;&lt;th&gt;Model&lt;/th&gt;&lt;th&gt;Skill&lt;/th&gt;&lt;th&gt;Driver&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;Claude Opus 4.8&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;Claude Fable 5&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3&lt;/td&gt;&lt;td&gt;Claude Fable 5&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;Claude Opus 4.8&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Claude Code&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;5&lt;/td&gt;&lt;td&gt;GPT-5.5 Codex (xhigh)&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Codex CLI&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;6&lt;/td&gt;&lt;td&gt;Gemini 3.1 Pro (High)&lt;/td&gt;&lt;td&gt;&lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;&lt;/td&gt;&lt;td&gt;Antigravity&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Runs 5–6 used a portable sibling of that skill — &lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;:
the same checks (User Complaint Filter, per-screenshot analysis block, compressed
evidence), just packaged so a plain CLI could run them without Claude Code’s Skill
tool. Every run drove the same simulator through &lt;code&gt;idb&lt;/code&gt; — what differed for the
external two was the agent and the skill packaging, not the way the app was
controlled. That’s still enough to make them &lt;em&gt;stack vs stack&lt;/em&gt;, not bare model vs
model — a caveat I’ll keep flagging.&lt;/p&gt;
&lt;h2 id=&quot;the-standings&quot;&gt;The standings&lt;/h2&gt;
&lt;p&gt;Seven blind judges (2 Sonnet 4.6, 2 Opus 4.8, 3 Fable 5 — deliberately mixed
so family bias would be measurable) independently re-verified every claim
against the evidence and source, and force-ranked all six runs. Then I ran a
ground-truth pass over the contested and design-intent calls. The honest
scoreboard isn’t their votes — it’s what &lt;em&gt;survived&lt;/em&gt;: real bugs caught, minus
false alarms, across the four stacks.&lt;/p&gt;



































&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Stack&lt;/th&gt;&lt;th&gt;Verified real bugs (of 13)&lt;/th&gt;&lt;th&gt;False alarms&lt;/th&gt;&lt;th&gt;Net&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;strong&gt;Claude Fable 5&lt;/strong&gt;&lt;/td&gt;&lt;td&gt;&lt;strong&gt;10&lt;/strong&gt; (run 3: 7 · run 2: 3)&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;td&gt;&lt;strong&gt;+10&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;GPT-5.5 Codex&lt;/td&gt;&lt;td&gt;1 (the −$0.00)&lt;/td&gt;&lt;td&gt;1 (the CTA claim)&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Claude Opus 4.8&lt;/td&gt;&lt;td&gt;1 (% wrap) + a shared flag&lt;/td&gt;&lt;td&gt;1 (run 1’s map claim)&lt;/td&gt;&lt;td&gt;0&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Gemini 3.1 Pro&lt;/td&gt;&lt;td&gt;2 (1 shared)&lt;/td&gt;&lt;td&gt;~4&lt;/td&gt;&lt;td&gt;−2&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Every number traces to a specific finding: the 13 are the bugs that survived
both the panel and my verification; the false alarms are claims one or both
ruled out. Net is real minus false — nothing weighted, nothing to argue with.&lt;/p&gt;
&lt;p&gt;Two things stand out. Fable isn’t just ahead, it’s in another tier — 10 of the
13 real bugs, all four of the catastrophic family, zero false alarms; the blind
panel agreed unanimously, ranking one Fable run first on every single ballot.
And the &lt;em&gt;bottom&lt;/em&gt; flips: the panel ranked Gemini above Opus, but counting real
outcomes reverses it — Opus’s one quiet, correct find nets it even, while Gemini
found real bugs and then cried wolf four times over. Discipline beat volume —
and the judges missed it.&lt;/p&gt;
&lt;h2 id=&quot;the-bug-that-decided-it&quot;&gt;The bug that decided it&lt;/h2&gt;
&lt;p&gt;Here’s where the QA earned its keep — not in &lt;em&gt;spotting&lt;/em&gt; something weird, but in
proving what it actually was.&lt;/p&gt;
&lt;p&gt;The winning run did one thing none of the others did: it resumed a saved,
half-played game instead of starting fresh, then kept &lt;em&gt;playing&lt;/em&gt; — advancing
days, holding a position. A few days in, the Portfolio went sideways. A stock it
owned — 253 shares of SWCO, bought around $31.62 — suddenly showed a price of
&lt;strong&gt;$0.00&lt;/strong&gt;. Net worth fell from $7,672.98 to $12.14. A 99.85% wipe.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-zero-price-portfolio.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Finn&amp;#x27;s Portfolio mid-level: the held SWCO position — 253 shares at a $31.62 basis — is valued at $0.00. Holdings value $0.00, $12.14 cash left, unrealized P/L −$7,999.86 (−100%).&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;A weaker run files a “catastrophic data loss” bug right here and moves on. This
one didn’t trust the screenshot. It played on to the level’s Game Over screen —
where the same shares were priced normally again, net worth $6,223.29, not $12.
The “wipe” wasn’t real: a glitch that flashes on the resume boundary and clears
itself. Scary to a player, but no money actually lost.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-gameover-proof.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The same level&amp;#x27;s Game Over screen: net worth $6,223.29 with the identical SWCO position priced normally — proof the $0.00 was a transient resume-time glitch, not a real wipe.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Then it found &lt;em&gt;why&lt;/em&gt;. Resuming corrupts the game’s internal date, so the price
lookup for that day comes back empty — and one line turns “empty” into a real,
displayed zero:&lt;/p&gt;
&lt;pre class=&quot;astro-code astro-code-themes github-light github-dark&quot; style=&quot;background-color:#fff;--shiki-dark-bg:#24292e;color:#24292e;--shiki-dark:#e1e4e8; overflow-x: auto;&quot; tabindex=&quot;0&quot; data-language=&quot;tsx&quot;&gt;&lt;code&gt;&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#6A737D;--shiki-dark:#6A737D&quot;&gt;// app/(game)/index.tsx:156 — the $0.00 stock&lt;/span&gt;&lt;/span&gt;
&lt;span class=&quot;line&quot;&gt;&lt;span style=&quot;color:#D73A49;--shiki-dark:#F97583&quot;&gt;const&lt;/span&gt;&lt;span style=&quot;color:#005CC5;--shiki-dark:#79B8FF&quot;&gt; price&lt;/span&gt;&lt;span style=&quot;color:#D73A49;--shiki-dark:#F97583&quot;&gt; =&lt;/span&gt;&lt;span style=&quot;color:#24292E;--shiki-dark:#E1E4E8&quot;&gt; stock?.price &lt;/span&gt;&lt;span style=&quot;color:#D73A49;--shiki-dark:#F97583&quot;&gt;??&lt;/span&gt;&lt;span style=&quot;color:#005CC5;--shiki-dark:#79B8FF&quot;&gt; 0&lt;/span&gt;&lt;span style=&quot;color:#24292E;--shiki-dark:#E1E4E8&quot;&gt;;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That one corrupted date was behind a whole family of weirdness: a 62-day level
reading &lt;strong&gt;“7,656 days remaining,”&lt;/strong&gt; the day counter sliding backwards, the
market list shrinking from 13 stocks to 2. &lt;strong&gt;One bug wearing four costumes.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-7656-days.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Finn&amp;#x27;s home screen on Day 4 of a 62-day level, showing \&amp;#x22;7,656 days remaining\&amp;#x22; and net worth $7,672.98.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;And that’s the entire point. Any model can screenshot a $0.00 and shout “bug.”
What’s hard — and what won the night — is the QA around it:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Resume a saved game.&lt;/strong&gt; The whole bug family lived on the save/resume
boundary. Fresh-launch sweeps never reached it.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Play, don’t tour.&lt;/strong&gt; Advancing days while holding a position is what made it
surface.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reconcile to the cent.&lt;/strong&gt; Cross-checking Portfolio against Game Over both
found the wipe and proved it a mirage.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pin the line.&lt;/strong&gt; A screenshot is a claim; &lt;code&gt;index.tsx:156&lt;/code&gt; is a root cause.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The failure modes were just as instructive — because a false positive is its own
kind of failure. Opus’s run 1 had exactly one finding: that the campaign map’s
“STORM WARNING” banner mislabeled levels 10–12. It didn’t — the banner heads the
locked chapter below it, and Opus’s own star math disproved the claim. It cited
that math as &lt;em&gt;proof&lt;/em&gt; anyway: confirmation bias in its purest form, and a 7/7
false positive on its only finding.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-fp-opus-map.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Run 1 (Opus)&amp;#x27;s sole finding — scored a false positive by all seven judges. It claimed the campaign map&amp;#x27;s \&amp;#x22;STORM WARNING\&amp;#x22; banner mislabels levels 10–12; in fact the banner heads the locked chapter below, and Opus&amp;#x27;s own star math disproved it.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Gemini produced what the judges called verification theater: “VERIFIED” labels
with no method, no reasoning log, a missed tap reported as a dead button, and a
coverage claim its own evidence contradicted. It also stopped at minute 6 — it
confused “6 minutes spent” with “6 minutes remaining” — and needed two human
nudges to use its budget, which no other run got.&lt;/p&gt;
&lt;h2 id=&quot;the-judges-got-judged--and-the-majority-got-it-wrong&quot;&gt;The judges got judged — and the majority got it wrong&lt;/h2&gt;
&lt;p&gt;This is the part I’d actually want you to take home.&lt;/p&gt;
&lt;p&gt;The panel split 5–2 on exactly one finding: Codex claimed the level-briefing
CTA button hides the star-reward thresholds. Five judges confirmed it — the
labels were invisible in every screenshot. Two judges read the layout source,
noticed the labels were ordinary below-the-fold content in a ScrollView, and
pointed out that nobody — not the contestant, not the confirming judges — had
ever &lt;em&gt;scrolled&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-bug-cta-overlap.webp&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The contested level-briefing screen, with the green \&amp;#x22;Start Level\&amp;#x22; button at the bottom and the Bronze/Silver/Gold reward tiers above it.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;I opened the app and scrolled. The minority was right. &lt;strong&gt;The 5-judge majority
was wrong on the only genuinely contested verdict of the night.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The wrinkle: the two who got it right were both Fable judges — but so was one of
the five who got it wrong. It wasn’t a family thing. They won because they
opened the layout source and noticed nobody had scrolled; every judge who
trusted the screenshot — both Sonnet, both Opus, and the third Fable — got it
wrong. Method beat pedigree.&lt;/p&gt;
&lt;p&gt;Meanwhile, every unanimous 7/7 verdict survived my review without exception.
And all three items the panel had set aside as “needs design intent” — an
academy lesson showing a 2,118-day horizon, a 0.0% win rate styled loss-red
with zero trades, a star counter with an impossible denominator — turned out
to be real bugs.&lt;/p&gt;
&lt;p&gt;So the calibrated trust rules I’m keeping:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Trust unanimous panel verdicts.&lt;/strong&gt; 7/7 agreement was a reliable signal.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Treat split verdicts as unresolved.&lt;/strong&gt; Demand a behavioral test — a
scroll, a tap — not another opinion.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Always route “design intent” questions through a human.&lt;/strong&gt; AI judges
systematically under-call bugs that require knowing what the product is
&lt;em&gt;supposed&lt;/em&gt; to do. That pile is where the cheapest extra yield lives.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;One more number from the blind ballots: &lt;strong&gt;no self-favoritism&lt;/strong&gt;. All three
judge families ranked the Fable runs essentially identically, and the Opus
judges were the &lt;em&gt;harshest&lt;/em&gt; graders of the Opus runs.&lt;/p&gt;
&lt;h2 id=&quot;what-it-cost&quot;&gt;What it cost&lt;/h2&gt;
&lt;p&gt;Two ways to read the cost: what a subscriber actually burns (the plan meters),
and the clean per-token math (list price). Here’s both.&lt;/p&gt;
&lt;p&gt;The Claude plan absorbed all four Claude hunts &lt;em&gt;plus&lt;/em&gt; the orchestration in a
single 5-hour window. Here’s the session meter at 37% (after the first hunt),
55% (midway through the second), and 89% (after the fourth):&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-claude-37pct.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Claude Max(5x) session usage meter at 37% used after the first Opus hunt.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-claude-55pct.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The same meter at 55% used, midway through the second hunt (Fable).&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-claude-89pct.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;The same meter at 89% used after the fourth hunt, with the session about to reset.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Here’s the real arithmetic, list price for list price: Fable is &lt;strong&gt;$10/$50&lt;/strong&gt; per
million tokens (input/output), Opus &lt;strong&gt;$5/$25&lt;/strong&gt; — exactly 2× per token, both
directions. But each Fable hunt ran ~34% leaner (≈122k tokens vs ≈185k, straight
from the run logs). Two times the rate on two-thirds the tokens lands at about
&lt;strong&gt;1.3× per hunt&lt;/strong&gt; at API prices. The subscription meter is murkier — the
orchestrator burned the same 5-hour window alongside the contestants, so I can’t
cleanly split it per run — and in everyday use, where the token thrift doesn’t
show up, it feels like the full 2× (more below).&lt;/p&gt;
&lt;p&gt;Codex runs on a different plan with a different meter. One hunt cost ~8% of an
entire week on the $20 plan — call it twelve hunts a week, ceiling:&lt;/p&gt;
&lt;p&gt;&lt;img __ASTRO_IMAGE_=&quot;{&amp;#x22;src&amp;#x22;:&amp;#x22;../../assets/blog/bughunt-usage-codex.png&amp;#x22;,&amp;#x22;alt&amp;#x22;:&amp;#x22;Codex usage panel after its hunt: 5-hour window 44% remaining, weekly quota 72% remaining.&amp;#x22;,&amp;#x22;index&amp;#x22;:0}&quot;&gt;&lt;/p&gt;
&lt;p&gt;Gemini’s hunt was the cheapest — about 10% of the daily quota, ~2% of the
weekly. But that low number is mostly an artifact of a lazy run: the Antigravity
agent stalled at minute 6 and needed two nudges, so it simply did less work, not
less work &lt;em&gt;per unit&lt;/em&gt;. (I lost the only usable cost screenshot; the survivor reads
“100% available” across every tier, which the meter itself admits is misleading.)&lt;/p&gt;
&lt;p&gt;Four days of living with Fable since — first impressions only. The cost is no
joke — and it’s the rolling &lt;strong&gt;5-hour usage window&lt;/strong&gt; you feel, not the monthly
bill. That ~1.3× was the benchmark; in everyday use the 2× sticker is brutally
real, draining a window so fast that a session that used to last me three hours
is gone in ninety minutes. Worth it — but I feel it.&lt;/p&gt;
&lt;p&gt;The more useful thing I’ve learned is about &lt;em&gt;fit&lt;/em&gt;, and it’s made me re-value
Opus rather than write it off. Fable is the better send-and-forget agent: hand
it a task and I trust it to do that task better, heads down, little chatter along
the way — which is exactly why it won here, a 15-minute solo run. But being
better at the agentic workload comes at a cost to the human in the loop, and
sometimes I &lt;em&gt;want&lt;/em&gt; to be in the loop. When I want a session — to talk with the
model, steer it, actually follow the process — Opus is the one I reach for. The
split I’ve landed on, at least for now: Fable when I want to hand it off, Opus
when I want to be part of it.&lt;/p&gt;
&lt;h2 id=&quot;caveats-before-anyone-quotes-this&quot;&gt;Caveats, before anyone quotes this&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;n=1 per cell.&lt;/strong&gt; One run per model per position. State-inheritance luck was
real: the winning run inherited the previous run’s saved game — the bug-rich
path — and the Opus run after it started post-game-over, with that path
gone. The interleaving meant both Claude variants saw both fresh and
inherited state, but a single evening is a sample, not a proof.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Codex and Gemini ran a different stack&lt;/strong&gt; — the &lt;code&gt;ios-qa-evidence-compression&lt;/code&gt;
sibling skill via their own CLIs, not the &lt;code&gt;ios-qa&lt;/code&gt; skill through Claude Code.
Same methodology and the same &lt;code&gt;idb&lt;/code&gt; control underneath, but a different agent
and packaging, so those rows are stack-vs-stack, not model-vs-model.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Gemini’s ranking partially reflects its missing audit trail&lt;/strong&gt; — a run
without a reasoning log forced every judge to redo its verification from
scratch.&lt;/li&gt;
&lt;li&gt;The orchestrator and most judges were Claude models; the no-self-favoritism
result above is the honest attempt to measure what that implies, but it’s
worth saying plainly.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;the-actual-takeaway&quot;&gt;The actual takeaway&lt;/h2&gt;
&lt;p&gt;All the machinery — verified-only scoring, blind judges, a human ground-truth
pass — existed to earn one conclusion the right to be believed. And it’s simple:
&lt;strong&gt;Fable is exceptional at QA, and it wasn’t close.&lt;/strong&gt; It caught 10 of the 13 real
bugs, the entire catastrophic family included, with zero false alarms; the other
three stacks managed one or two apiece, most with a false alarm or two attached.
And it isn’t just my scoring — seven blind judges, grading anonymized runs,
ranked a Fable run first on &lt;em&gt;every single ballot&lt;/em&gt;. The raw QA says it. The blind
judges say it. They agree.&lt;/p&gt;
&lt;p&gt;It costs more — 2× per token, and the meter lets you feel it. But for finding
real bugs in a real app, it’s not a close call. &lt;strong&gt;Pricier — but better. By a
mile.&lt;/strong&gt;&lt;/p&gt;</content:encoded><category>ai</category><category>testing</category><category>benchmarks</category></item></channel></rss>