Three Case Studies: Why Multi-Model Pipelines Need a Referee
Real data from 6 pipeline runs. Rate limiting middleware. Debate vs parallel. A $0.40 security audit.
Why Three Models Beat One
The Arbiter catches what no single model sees
We asked three frontier models to independently generate rate limiting middleware for a production API gateway. The task was specific: distributed rate limiter with Redis backing, sliding window counters, and graceful degradation.
All three models produced working code. All three made the same mistakes. No distributed locking around counter increments. No Redis connection pooling. No graceful degradation when Redis goes down. Three models, three blindspots, zero diversity of failure.
The Arbiter — running on a completely different model (Grok 4) — found 8 critical issues at 0.97 confidence.
Key insight: All 3 models independently produced the same blindspot. The Arbiter, using a different model, caught it immediately. This is the core argument for multi-model pipelines: systematic blindspots are invisible to the model that has them.
Debate vs Parallel — Same Prompt, Different Universes
$3 once beats $2 four times
Same rate-limiting task. Same models. We ran it through parallel mode four separate times. Every single run was REJECTED by the Arbiter. Then we ran it once through debate mode. Zero critical issues.
Here are the parallel results:
Four runs, $8.38 total, 18 critical issues across all runs. Every run rejected. The models working in parallel simply reinforce each other's assumptions — they don't challenge them.
Now the debate run:
Takeaway: $3 once beats $2 four times. Debate mode forced adversarial scrutiny. The models had to defend their design choices against a motivated opponent. Position papers, rebuttals, final arguments — each phase stripped away assumptions that parallel mode left unchallenged. The Arbiter's FLAG at 0.82 was for minor style concerns, not structural flaws.
The $0.40 Security Audit
The Arbiter found an RCE vulnerability that 3 models and cross-review all missed
This one is the scariest. Three models generated rate limiting middleware with Redis caching. The cross-review stage — where each model reviews another model's output — approved the code. Everyone signed off.
Then the Arbiter found eval() being called on data read from Redis. An RCE vulnerability with a CVSS score of 9.8.
Here is what the 3 models generated (and cross-review approved):
# What 3 models generated (and cross-review approved)
cached = redis_client.get(f"rate_limit:{client_ip}")
if cached:
config = eval(cached) # <-- RCE vulnerability (CVSS 9.8)And here is what the Arbiter flagged, along with the fix:
# What the Arbiter flagged — and the fix
import json
cached = redis_client.get(f"rate_limit:{client_ip}")
if cached:
config = json.loads(cached) # Safe deserializationIf this code shipped to production, any attacker who could write to Redis — via cache poisoning, SSRF, or a compromised internal service — could execute arbitrary Python on your servers. The eval() call turns a data store read into a full remote code execution vector.
Three models missed it. Cross-review missed it. The Arbiter caught it for $0.17.
The Numbers
Try It
CRTX is open source under Apache 2.0. The complete pipeline — parallel, sequential, debate mode, the Arbiter, smart routing, context injection, apply mode — runs locally with your own API keys.
pip install crtx crtx run "Your coding task here" \ --mode debate \ --arbiter bookend \ --route quality_first